[00:23:51] 6Operations, 10ops-eqiad: Update physical hostname labels for analytics 1017 & 1021 to read new hostnames notebook 1001 & 1002 - https://phabricator.wikimedia.org/T131216#2159621 (10madhuvishy) [00:25:23] 6Operations, 10ops-eqiad: Update physical hostname labels for analytics 1017 & 1021 to read new hostnames notebook 1001 & 1002 - https://phabricator.wikimedia.org/T131216#2159638 (10madhuvishy) [00:28:37] 6Operations, 10ops-eqiad: Update the visible label field in racktables for analytics 1017 & 1021 to notebook 1001 & 1002 - https://phabricator.wikimedia.org/T131217#2159642 (10madhuvishy) [00:29:42] 6Operations, 10ops-eqiad: Update the port description on the network switch for analytics 1017 and 1021 - https://phabricator.wikimedia.org/T131218#2159656 (10madhuvishy) [00:41:27] 6Operations, 10ops-eqiad, 10netops: Update the port description on the network switch for analytics 1017 and 1021 - https://phabricator.wikimedia.org/T131218#2159707 (10Dzahn) [00:45:06] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: puppet fail [01:05:42] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 331.94 seconds [01:13:25] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [01:14:02] 6Operations, 7Puppet: Reboot during puppet run causes /var/lib/puppet/state/agent_catalog_run.lock to be left and puppet to not start running again - https://phabricator.wikimedia.org/T127602#2159731 (10scfc) `molly-guard` only works interactively, and I'm not sure if the problems are connected to a "regular"... [01:27:51] (03CR) 10Krinkle: "Where can I find those?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275734 (owner: 10Aaron Schulz) [01:49:31] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Replication lag: 0.28 seconds [01:49:35] PROBLEM - MariaDB Slave Lag: x1 on db2009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 314.38 seconds [01:50:00] PROBLEM - MariaDB Slave Lag: x1 on db1031 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 329.51 seconds [01:50:36] PROBLEM - MariaDB Slave Lag: x1 on db2008 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 372.74 seconds [01:53:15] RECOVERY - MariaDB Slave Lag: x1 on db2009 is OK: OK slave_sql_lag Replication lag: 0.32 seconds [01:53:32] RECOVERY - MariaDB Slave Lag: x1 on db1031 is OK: OK slave_sql_lag Replication lag: 0.17 seconds [01:54:15] RECOVERY - MariaDB Slave Lag: x1 on db2008 is OK: OK slave_sql_lag Replication lag: 0.42 seconds [02:27:38] (03CR) 10Sabya: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [02:31:51] (03CR) 10Sabya: "OK. Pls ignore my previous comment. Didn't notice the inline comments from Alexandros." [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [02:32:04] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.18) (duration: 11m 29s) [02:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:07:47] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.19) (duration: 18m 03s) [03:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:17:16] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Mar 30 03:17:15 UTC 2016 (duration 9m 29s) [03:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:50:32] (03PS1) 10BBlack: remove a couple of inline attrs [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280377 [04:50:34] (03PS1) 10BBlack: split lp->match allocation from lp itself [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280378 [04:50:36] (03PS1) 10BBlack: remove lp->tmpbuf, bump scratch default to 4MB [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280379 [04:50:38] (03PS1) 10BBlack: refactor match_assign/scratch/parser stuff [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280380 [04:50:40] (03PS1) 10BBlack: minor cleanups from cppcheck [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280381 [05:39:41] !log tstarling@tin Synchronized php-1.27.0-wmf.19/extensions/WikimediaMaintenance/makeDumpList.php: (no message) (duration: 00m 38s) [05:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:43:44] you are probably using some fields which are not indexed, but #wikimedia-labs is perhaps better place to ask [05:45:32] Nikerabbit: I think you were meaning to reply in #mediawiki? [05:46:12] p858snake: yes [05:52:17] !log tstarling@tin Synchronized php-1.27.0-wmf.18/extensions/WikimediaMaintenance/makeDumpList.php: (no message) (duration: 00m 27s) [05:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:06:02] !log on terbium: making visual diff dump lists with makeDumpList.php [06:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:24:15] (03PS10) 10KartikMistry: Enable non-default MT for some languages [puppet] - 10https://gerrit.wikimedia.org/r/277463 (https://phabricator.wikimedia.org/T129849) [06:31:06] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:56] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:15] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:17] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:56] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:42:02] (03CR) 10Gehel: [C: 031] "Thanks a lot! Seems to be fine..." [puppet] - 10https://gerrit.wikimedia.org/r/280343 (https://phabricator.wikimedia.org/T130329) (owner: 10Dzahn) [06:44:13] 6Operations, 6Discovery, 7Elasticsearch, 13Patch-For-Review: Icinga should alert on free disk space < 15% on Elasticsearch hosts - https://phabricator.wikimedia.org/T130329#2160016 (10Gehel) Thanks @Dzahn ! I had this on my not too urgent todo list, but you've been much faster than me! Great! [06:49:06] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:50:26] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:55:47] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:56:05] RECOVERY - HHVM rendering on mw1128 is OK: HTTP OK: HTTP/1.1 200 OK - 67729 bytes in 2.424 second response time [06:56:06] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 4.506 second response time [06:56:31] !log restarted hhvm on mw1128, mw1139 [06:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:56:36] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 67729 bytes in 1.275 second response time [06:56:46] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.071 second response time [06:57:36] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:47] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:58:26] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:37] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:07:34] (03CR) 10Muehlenhoff: [C: 04-1] "See comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280265 (https://phabricator.wikimedia.org/T131150) (owner: 10Elukey) [07:14:18] (03CR) 10Elukey: Add diamond nf_conntrack counter. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280265 (https://phabricator.wikimedia.org/T131150) (owner: 10Elukey) [07:39:36] (03CR) 10Muehlenhoff: Add diamond nf_conntrack counter. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280265 (https://phabricator.wikimedia.org/T131150) (owner: 10Elukey) [08:00:56] (03CR) 10Filippo Giunchedi: [C: 04-1] "net.netfilter.nf_conntrack_count vs net.ipv4.netfilter.ip_conntrack_count, but other than that LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280265 (https://phabricator.wikimedia.org/T131150) (owner: 10Elukey) [08:08:30] (03PS2) 10Gehel: Allow browser to cache /portal static files. [puppet] - 10https://gerrit.wikimedia.org/r/280204 (https://phabricator.wikimedia.org/T126280) [08:09:10] (03PS5) 10Elukey: Add diamond nf_conntrack counter. [puppet] - 10https://gerrit.wikimedia.org/r/280265 (https://phabricator.wikimedia.org/T131150) [08:16:44] (03CR) 10Muehlenhoff: "ip_conntrack_count and nf_conntrack_count are displaying the same values on all systems I checked, but we should rather read the value of " [puppet] - 10https://gerrit.wikimedia.org/r/280265 (https://phabricator.wikimedia.org/T131150) (owner: 10Elukey) [08:18:07] (03CR) 10Elukey: Add diamond nf_conntrack counter. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280265 (https://phabricator.wikimedia.org/T131150) (owner: 10Elukey) [08:18:39] godog: thanks for --^ [08:19:59] (03PS6) 10Elukey: Add diamond nf_conntrack counter. [puppet] - 10https://gerrit.wikimedia.org/r/280265 (https://phabricator.wikimedia.org/T131150) [08:21:32] np! [08:22:10] !log restarting db1047 due to data corruption on Aria tables [08:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:23:34] (03CR) 10Filippo Giunchedi: [C: 031] Add diamond nf_conntrack counter. [puppet] - 10https://gerrit.wikimedia.org/r/280265 (https://phabricator.wikimedia.org/T131150) (owner: 10Elukey) [08:27:45] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [08:29:23] (03CR) 10Muehlenhoff: [C: 031] "Minor nitpick, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280265 (https://phabricator.wikimedia.org/T131150) (owner: 10Elukey) [08:30:08] yeah, dbproxy is normal based on the log above [08:30:18] (03CR) 10ArielGlenn: [C: 032 V: 032] pylint and pep8 for listmediaperproject [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280254 (owner: 10ArielGlenn) [08:30:25] (it is the secondary server) [08:31:57] (03PS7) 10Elukey: Add diamond nf_conntrack counter. [puppet] - 10https://gerrit.wikimedia.org/r/280265 (https://phabricator.wikimedia.org/T131150) [08:33:08] (03CR) 10Elukey: [C: 032] Add diamond nf_conntrack counter. [puppet] - 10https://gerrit.wikimedia.org/r/280265 (https://phabricator.wikimedia.org/T131150) (owner: 10Elukey) [08:33:57] (03PS1) 10Ema: Use Varnish 4 on cp1043 [puppet] - 10https://gerrit.wikimedia.org/r/280393 (https://phabricator.wikimedia.org/T122880) [08:34:05] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [08:40:16] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 [08:42:36] (03PS2) 10Filippo Giunchedi: Staging: Clean up restbase seeds [puppet] - 10https://gerrit.wikimedia.org/r/280351 (owner: 10GWicke) [08:42:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Staging: Clean up restbase seeds [puppet] - 10https://gerrit.wikimedia.org/r/280351 (owner: 10GWicke) [08:45:11] 6Operations, 13Patch-For-Review: Randomly failing puppetmaster sync to strontium - https://phabricator.wikimedia.org/T128895#2160164 (10fgiunchedi) FWIW I just saw this happening on palladium too, _before_ displaying the diff thus while updating from gerrit, repeating `puppet-merge` works though, looks like so... [08:45:26] volans: ^ [08:45:53] !log depooling maps-test2001(to apply ferm) [08:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:46:27] (03PS2) 10Muehlenhoff: Enable base::firewall on maps master [puppet] - 10https://gerrit.wikimedia.org/r/280209 [08:46:36] jynus: followup question to https://lists.wikimedia.org/pipermail/analytics/2016-March/005072.html : is it OK to restart queries on s1-analytics-slave now? (it seems to be up and running) [08:46:49] ... i read your advice to use dbstore1002 instead, but... [08:46:49] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on maps master [puppet] - 10https://gerrit.wikimedia.org/r/280209 (owner: 10Muehlenhoff) [08:46:51] godog: thanks, I think is the same that apergos saw too other times, good to have it logged on the task, it definitely is something on gerrit side (git repo) [08:47:53] ... i would like to run this on s1 because it's part of a series of queries where the preceding ones ran on s1 too and... [08:48:10] yes, I also see that repeating the merge works [08:48:27] ... i have unfortunately been getting very different results for the same query and the same table on store [08:49:00] (discussed this with madhu and nuria earlier and we did not bother you about this issue yet ;) but i might file a bug soon) [08:49:47] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [08:50:13] 6Operations, 6Discovery, 10hardware-requests, 3Discovery-Search-Sprint: Relevance forge hardware - https://phabricator.wikimedia.org/T131184#2160173 (10Gehel) [08:50:48] 6Operations, 10Traffic: Deploy infra ganeti cluster @ ulsfo - https://phabricator.wikimedia.org/T96852#2160180 (10fgiunchedi) [08:55:11] (03PS1) 10Muehlenhoff: Disable ferm on maps master, needs an additional rule first [puppet] - 10https://gerrit.wikimedia.org/r/280395 [08:55:30] (03CR) 10Muehlenhoff: [C: 032 V: 032] Disable ferm on maps master, needs an additional rule first [puppet] - 10https://gerrit.wikimedia.org/r/280395 (owner: 10Muehlenhoff) [08:56:51] HaeB, yes, queries can go there [08:57:11] jynus: how long might the refilling take? [08:57:48] HaeB, the tables are constantly being refilled from the master, the only thing that changes is how late they are from the master [08:58:24] so it's just about the update lag, basically? [08:58:25] i see [08:58:27] !log repooling maps-test2001 [08:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:59:23] jynus, in any case, thanks! [08:59:38] most seem to be 5 days behind [08:59:48] enwiki and dewiki only minutes [09:00:17] (i'm querying older eventlogging data, so that's not a concern in my case) [09:02:07] (03PS1) 10Elukey: Add netfliter nf_conntrack diamond collector to kafka brokers. [puppet] - 10https://gerrit.wikimedia.org/r/280396 (https://phabricator.wikimedia.org/T131028) [09:05:08] right now dbstore1002 is probably more accurante than db1047 [09:05:55] PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: puppet fail [09:06:25] (03CR) 10Elukey: [C: 032] "Tested with https://puppet-compiler.wmflabs.org/2221/" [puppet] - 10https://gerrit.wikimedia.org/r/280396 (https://phabricator.wikimedia.org/T131028) (owner: 10Elukey) [09:07:36] RECOVERY - puppet last run on maps-test2001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [09:07:37] except for MobileAppUploadAttempts_5257716 [09:07:40] !log depooling cp1043 for varnish4 upgrade (T122880) [09:07:41] T122880: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880 [09:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:07:49] (03CR) 10Filippo Giunchedi: "minor things, LGTM overall" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/279717 (https://phabricator.wikimedia.org/T130948) (owner: 10Mobrovac) [09:08:37] jynus: also for older data? (from around the beginning of january, say) [09:09:16] yes, in fact I can say that (except for that table) all rows from January to now are there [09:09:19] jynus: and why? (the way madhu and nuria explained it to me made it sound that it was the other way around... that s1-analytics might be more reliable than store) [09:09:34] well, eventually, both will be [09:10:06] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 690 [09:10:23] but db1047 just had issues and I have not yet completed a full resync [09:10:52] (03PS2) 10Ema: Use Varnish 4 on cp1043 [puppet] - 10https://gerrit.wikimedia.org/r/280393 (https://phabricator.wikimedia.org/T122880) [09:11:04] (03CR) 10Ema: [C: 032 V: 032] Use Varnish 4 on cp1043 [puppet] - 10https://gerrit.wikimedia.org/r/280393 (https://phabricator.wikimedia.org/T122880) (owner: 10Ema) [09:11:10] jynus: ok, but that would not affect (say) eventlogging data from january 7, say? [09:11:50] (the missing resync, i mean) [09:11:58] I am actually checking since 2016-01-01 [09:13:15] I think the best way to check it is for you to run a quick count(*) from the timestamp that interests you an compare results [09:13:33] if you see a difference, tell me [09:15:06] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 613 [09:15:35] PROBLEM - Varnishkafka log producer on cp1043 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [09:16:16] ---^ ema: I am very sad [09:16:29] :P [09:16:38] yep me too. It's depooled though :) [09:16:56] PROBLEM - Varnish HTTP maps-frontend - port 80 on cp1043 is CRITICAL: Connection refused [09:17:07] PROBLEM - puppet last run on cp1043 is CRITICAL: CRITICAL: Puppet has 1 failures [09:19:28] (03PS1) 10Muehlenhoff: Reenable base::firewall on maps2001 [puppet] - 10https://gerrit.wikimedia.org/r/280398 [09:25:06] RECOVERY - check_mysql on lutetium is OK: Uptime: 1105730 Threads: 1 Questions: 7445117 Slow queries: 8343 Opens: 85793 Flush tables: 2 Open tables: 64 Queries per second avg: 6.733 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:29:28] (03CR) 10Muehlenhoff: [C: 032 V: 032] Reenable base::firewall on maps2001 [puppet] - 10https://gerrit.wikimedia.org/r/280398 (owner: 10Muehlenhoff) [09:31:10] (03CR) 10Elukey: [C: 031] remove a couple of inline attrs [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280377 (owner: 10BBlack) [09:31:26] RECOVERY - Varnish HTTP maps-frontend - port 80 on cp1043 is OK: HTTP OK: HTTP/1.1 200 OK - 147 bytes in 0.037 second response time [09:33:27] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2160230 (10Peachey88) [09:39:02] (03PS1) 10Muehlenhoff: Move master ferm services into server.pp [puppet] - 10https://gerrit.wikimedia.org/r/280399 [09:40:17] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move master ferm services into server.pp [puppet] - 10https://gerrit.wikimedia.org/r/280399 (owner: 10Muehlenhoff) [09:51:36] RECOVERY - Varnishkafka log producer on cp1043 is OK: PROCS OK: 1 process with command name varnishkafka [09:53:25] (03PS1) 10Filippo Giunchedi: cassandra: add restbase200[3-6] instances [dns] - 10https://gerrit.wikimedia.org/r/280400 [09:54:40] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase200[3-6] instances [dns] - 10https://gerrit.wikimedia.org/r/280400 (owner: 10Filippo Giunchedi) [09:56:53] 6Operations, 6Project-Admins, 3DevRel-March-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2160271 (10Aklapper) @Faidon: Any 'final' word on T119944#2140271 ? [09:59:13] (03PS1) 10Alexandros Kosiaris: maps: Fix mess introduced in I2a44ee9 [puppet] - 10https://gerrit.wikimedia.org/r/280401 [10:00:14] (03CR) 10jenkins-bot: [V: 04-1] maps: Fix mess introduced in I2a44ee9 [puppet] - 10https://gerrit.wikimedia.org/r/280401 (owner: 10Alexandros Kosiaris) [10:02:54] (03CR) 10Alexandros Kosiaris: "For future reference, https://gerrit.wikimedia.org/r/#/c/280401/1 for more info as to why this was required after all" [puppet] - 10https://gerrit.wikimedia.org/r/280399 (owner: 10Muehlenhoff) [10:02:56] (03CR) 10Elukey: [C: 031] "Some comments, nothing major, LGTM!" (035 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280294 (owner: 10BBlack) [10:02:58] (03PS1) 10Ladsgroup: [WIP] Scap3 deployment configurations for ores [puppet] - 10https://gerrit.wikimedia.org/r/280403 [10:03:08] (03PS1) 10Ema: Varnish 4: fix dynamic directors inclusion [puppet] - 10https://gerrit.wikimedia.org/r/280404 (https://phabricator.wikimedia.org/T124279) [10:03:39] (03CR) 10Elukey: "Forgot to mention: tested in my Vagrant VM and everything works fine." [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280294 (owner: 10BBlack) [10:04:32] (03CR) 10jenkins-bot: [V: 04-1] Varnish 4: fix dynamic directors inclusion [puppet] - 10https://gerrit.wikimedia.org/r/280404 (https://phabricator.wikimedia.org/T124279) (owner: 10Ema) [10:06:35] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "Puppet compiler says ok, noop in https://puppet-compiler.wmflabs.org/2222/" [puppet] - 10https://gerrit.wikimedia.org/r/280401 (owner: 10Alexandros Kosiaris) [10:07:17] (03PS2) 10Ema: Varnish 4: fix dynamic directors inclusion [puppet] - 10https://gerrit.wikimedia.org/r/280404 (https://phabricator.wikimedia.org/T124279) [10:09:56] (03PS1) 10Muehlenhoff: Fix ferm rules for postgres and redis access on maps master [puppet] - 10https://gerrit.wikimedia.org/r/280405 [10:11:02] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix ferm rules for postgres and redis access on maps master [puppet] - 10https://gerrit.wikimedia.org/r/280405 (owner: 10Muehlenhoff) [10:11:23] (03PS3) 10Ema: Varnish 4: fix dynamic directors inclusion [puppet] - 10https://gerrit.wikimedia.org/r/280404 (https://phabricator.wikimedia.org/T124279) [10:11:51] (03CR) 10Ema: [C: 032 V: 032] Varnish 4: fix dynamic directors inclusion [puppet] - 10https://gerrit.wikimedia.org/r/280404 (https://phabricator.wikimedia.org/T124279) (owner: 10Ema) [10:12:23] (03PS1) 10Alexandros Kosiaris: maps: ship postgres tuning file into role module [puppet] - 10https://gerrit.wikimedia.org/r/280406 [10:12:33] (03PS2) 10Ladsgroup: [WIP] Scap3 deployment configurations for ores [puppet] - 10https://gerrit.wikimedia.org/r/280403 [10:15:46] (03CR) 10Alexandros Kosiaris: [C: 032] maps: ship postgres tuning file into role module [puppet] - 10https://gerrit.wikimedia.org/r/280406 (owner: 10Alexandros Kosiaris) [10:15:53] (03PS2) 10Alexandros Kosiaris: maps: ship postgres tuning file into role module [puppet] - 10https://gerrit.wikimedia.org/r/280406 [10:16:37] PROBLEM - puppet last run on dbstore1001 is CRITICAL: CRITICAL: Puppet has 1 failures [10:19:23] (03PS3) 10Ladsgroup: [WIP] Scap3 deployment configurations for ores [puppet] - 10https://gerrit.wikimedia.org/r/280403 [10:22:44] (03CR) 10Elukey: [C: 04-1] "The change looks good and I've tested it on my VM. I went -1 because I'd like to discuss this change more with Andrew and Brandon:" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280295 (owner: 10BBlack) [10:24:51] (03PS4) 10Ladsgroup: [WIP] Scap3 deployment configurations for ores [puppet] - 10https://gerrit.wikimedia.org/r/280403 [10:25:07] (03PS5) 10Mobrovac: scap::target: Allow scap's user to restart all services on a node [puppet] - 10https://gerrit.wikimedia.org/r/279717 (https://phabricator.wikimedia.org/T130948) [10:25:30] !log starting regression/stress testing of codfw mediawiki infrastructure [10:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:26:22] (03CR) 10Mobrovac: scap::target: Allow scap's user to restart all services on a node (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/279717 (https://phabricator.wikimedia.org/T130948) (owner: 10Mobrovac) [10:27:23] Failed requests: 21 (out of 10000, on blank page, we do not start well) [10:28:47] (03CR) 10Alexandros Kosiaris: "This LGTM. That being said, the use of extra sudo rules that are meant to be different than the ones installed by defining service_name is" [puppet] - 10https://gerrit.wikimedia.org/r/279717 (https://phabricator.wikimedia.org/T130948) (owner: 10Mobrovac) [10:29:45] (03PS1) 10Alexandros Kosiaris: maps: osm-replicate memorylimit stability fix [puppet] - 10https://gerrit.wikimedia.org/r/280407 [10:30:08] (03CR) 10Mobrovac: "Still looking good - https://puppet-compiler.wmflabs.org/2226/" [puppet] - 10https://gerrit.wikimedia.org/r/279717 (https://phabricator.wikimedia.org/T130948) (owner: 10Mobrovac) [10:30:55] (03PS2) 10Alexandros Kosiaris: maps: osm-replicate memorylimit stability fix [puppet] - 10https://gerrit.wikimedia.org/r/280407 [10:31:01] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: osm-replicate memorylimit stability fix [puppet] - 10https://gerrit.wikimedia.org/r/280407 (owner: 10Alexandros Kosiaris) [10:32:53] (03CR) 10Mobrovac: "@Alex, $sudo_rules is meant to be used when one needs to restart extra (dependent) services. I agree that it's confusing, but this has not" [puppet] - 10https://gerrit.wikimedia.org/r/279717 (https://phabricator.wikimedia.org/T130948) (owner: 10Mobrovac) [10:34:12] (03CR) 10Elukey: [C: 031] "WOW" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280378 (owner: 10BBlack) [10:34:26] ignore that, it is length errors, so not real errors, I think [10:35:05] (03PS1) 10Ema: Varnish 4: call init_hashcircle after vslp director definition [puppet] - 10https://gerrit.wikimedia.org/r/280409 (https://phabricator.wikimedia.org/T124279) [10:40:48] (03CR) 10Ema: [C: 032 V: 032] Varnish 4: call init_hashcircle after vslp director definition [puppet] - 10https://gerrit.wikimedia.org/r/280409 (https://phabricator.wikimedia.org/T124279) (owner: 10Ema) [10:41:11] akosiaris: hey, I figured out how to solve keyholder issue. I worked it out and I'm running the scap in beta (without puppet apply) [10:41:31] there is a strange issue, by doing "include scap" in puppet it returns [10:41:39] hmm, this error is strange: E: Version '3.0.3-1' for 'scap' was not found [10:41:48] RECOVERY - puppet last run on dbstore1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [10:41:53] 3:05 PM ladsgroup@deployment-ores-web:~$ apt-cache madison scap [10:41:53] 3:05 PM scap | 3.1.0-1 | http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/main amd64 Packages [10:41:53] 3:05 PM scap | 3.1.0-1 | http://apt.wikimedia.org/wikimedia/ jessie-wikimedia/main Sources [10:42:23] so, scap 3.0.3-1 is not in apt sources and it's scap 3.1 there instead [10:42:53] should we change the scap puppets? or Am I missing something? [10:43:24] (03CR) 10BBlack: Code cleanup (033 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280294 (owner: 10BBlack) [10:43:59] file modules/scap/manifests/init.pp [10:51:17] RECOVERY - puppet last run on cp1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:55:27] Amir1: indeed! good find [10:55:35] Amir1: https://phabricator.wikimedia.org/T130902 created the new pkg version [10:57:18] (03PS1) 10Ema: ganglia-varnish.py: use double quotes in metric description [puppet] - 10https://gerrit.wikimedia.org/r/280412 (https://phabricator.wikimedia.org/T122880) [10:58:37] (03CR) 10BBlack: [C: 031] ganglia-varnish.py: use double quotes in metric description [puppet] - 10https://gerrit.wikimedia.org/r/280412 (https://phabricator.wikimedia.org/T122880) (owner: 10Ema) [10:59:00] (03CR) 10Ema: [C: 032 V: 032] ganglia-varnish.py: use double quotes in metric description [puppet] - 10https://gerrit.wikimedia.org/r/280412 (https://phabricator.wikimedia.org/T122880) (owner: 10Ema) [10:59:33] (03PS1) 10Mobrovac: Scap: Bump version to 3.1.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/280413 (https://phabricator.wikimedia.org/T130902) [11:00:29] \o/ [11:00:33] Amir1: ^ [11:00:37] thanks mobrovac [11:00:58] Amir1: you can cherry-pick this on deployment-puppetmaster.eqiad.wmflabs and it should "just work"(TM) [11:01:10] I'm about to [11:01:11] :D [11:01:15] k cool [11:03:21] done [11:03:23] worked [11:03:51] (03CR) 10Elukey: [C: 031] "Tested on my VM" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280379 (owner: 10BBlack) [11:05:04] Amir1: gr8! could you please +1 the patch and say so? [11:05:14] 6Operations, 10ops-codfw: install SSDs in restbase2001-restbase2006 - https://phabricator.wikimedia.org/T127333#2160363 (10fgiunchedi) 5Resolved>3Open a:5fgiunchedi>3Papaul @Papaul I'm reopening this as we'll need to undo some parts, we're getting new hardware in T130218. namely we'll have a total of... [11:05:16] sure [11:05:51] mobrovac: you added the other Amir as reviewer [11:05:55] I'm Ladsgroup [11:06:09] gr [11:06:11] happens all the time most of the time it's the other way around [11:06:43] i don't find this gerrit feature too friendly tbh [11:07:28] (03CR) 10Ladsgroup: [C: 031] "I manually added PS1 to beta puppetmaster and tested it in deployment-ores-web.deployment-prep. It worked like a charm" [puppet] - 10https://gerrit.wikimedia.org/r/280413 (https://phabricator.wikimedia.org/T130902) (owner: 10Mobrovac) [11:08:01] it happens everywhere, mostly phabricator [11:16:41] (03CR) 10Elukey: [C: 031] "Tested on my VM" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280381 (owner: 10BBlack) [11:18:23] codfw may have real stress now [11:18:45] !log Updated cxserver to 5699a49 [11:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:20:19] (03PS4) 10BBlack: Bump s-maxage for purged endpoints. [puppet] - 10https://gerrit.wikimedia.org/r/280091 (owner: 10Ppchelko) [11:20:45] (03CR) 10BBlack: [C: 032 V: 032] Bump s-maxage for purged endpoints. [puppet] - 10https://gerrit.wikimedia.org/r/280091 (owner: 10Ppchelko) [11:21:26] memcache vs database requests is a huge difference [11:22:28] (I know that is stating the obvious, but it is nice to actually see it) [11:26:26] what are you using for stress testing btw? [11:26:51] (03PS1) 10Filippo Giunchedi: cassandra: add restbase200[3-6] instances [puppet] - 10https://gerrit.wikimedia.org/r/280416 [11:28:30] heh bblack, thnx for merging [11:29:00] godog, just basic ab for now, to start [11:33:36] Amir1: yeah 3.1.0 is the new version we should very soon be going ot [11:33:57] jynus: nice! have you come across https://github.com/wg/wrk yet? supposedly the modern version [11:34:13] akosiaris: yeah :) [11:34:23] I'm working on other issues [11:34:43] mostly premission error and owner of /srv/ores [11:35:02] godog, I do not really need anything fancy (I am not benchmarking or doing anything too strange) [11:35:25] for now I am just runing some query on 5 mediawiki servers [11:35:46] and see how it responds (errors, general load) [11:36:11] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase200[3-6] instances [puppet] - 10https://gerrit.wikimedia.org/r/280416 (owner: 10Filippo Giunchedi) [11:36:53] bblack: I've puppet-merge'd your change too btw [11:37:08] mobrovac: ^ [11:37:41] !log reimage restbase2004.codfw.wmnet [11:37:43] in general, results are being positive, but there is a significant jump between 99% percentile and max [11:37:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:37:57] godog: sorry, thanks! [11:38:06] there are other issues I am seeing that seem wrong, but not codfw-related [11:38:40] I intend to write a summary of the results later. Also, if you want me to test swift or something else, just tell me [11:39:31] jynus: sure, thanks! [11:40:47] (03CR) 10Elukey: [C: 031] "Looks good and I've tested it on my VM. The code looks cleaner and less entangled, but since this part of the code was the "hic sunt leone" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280380 (owner: 10BBlack) [11:40:49] (03PS1) 10ArielGlenn: wikiqueries: remove special Config class, use dumps library version [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280418 [11:40:51] (03PS1) 10ArielGlenn: remove RunSimpleCommand from wikiqueries and use dumps library version [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280419 [11:40:53] (03PS1) 10ArielGlenn: remove DBServerInfo class from wikiqueries and use dumps library equiv [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280420 [11:40:55] (03PS1) 10ArielGlenn: remove MultiVersion class from wikiqueries, no longer needed [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280421 [11:40:57] (03PS1) 10ArielGlenn: remove MiscUtils class from wikiqueries and use dumps lib equivs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280422 [11:40:59] (03PS1) 10ArielGlenn: remove OutputFile and ContentFile classes from wikiqueries [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280423 [11:41:01] (03PS1) 10ArielGlenn: remove QueryDir class from wikiqueries, replace by 2 lines of code [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280424 [11:41:03] (03PS1) 10ArielGlenn: pylint and pep8 for wikiqueries [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280425 [11:41:10] [11:41:43] (03CR) 10Elukey: [C: 031] "Had a chat with Brandon on IRC. For the moment we are going to remove this feature to proceed with the varnish 4 porting, then we'll re-ad" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/280295 (owner: 10BBlack) [11:42:57] (03CR) 10ArielGlenn: [C: 032 V: 032] wikiqueries: remove special Config class, use dumps library version [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280418 (owner: 10ArielGlenn) [11:43:49] (03CR) 10ArielGlenn: [C: 032 V: 032] remove RunSimpleCommand from wikiqueries and use dumps library version [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280419 (owner: 10ArielGlenn) [11:45:06] (03CR) 10ArielGlenn: [C: 032 V: 032] remove DBServerInfo class from wikiqueries and use dumps library equiv [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280420 (owner: 10ArielGlenn) [11:45:34] (03CR) 10ArielGlenn: [C: 032 V: 032] remove MultiVersion class from wikiqueries, no longer needed [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280421 (owner: 10ArielGlenn) [11:46:06] I am using mw2008-12, just in case something breaks [11:46:22] (03CR) 10ArielGlenn: [C: 032 V: 032] remove MiscUtils class from wikiqueries and use dumps lib equivs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280422 (owner: 10ArielGlenn) [11:48:36] (03CR) 10ArielGlenn: [C: 032 V: 032] remove OutputFile and ContentFile classes from wikiqueries [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280423 (owner: 10ArielGlenn) [11:49:12] (03CR) 10ArielGlenn: [C: 032 V: 032] remove QueryDir class from wikiqueries, replace by 2 lines of code [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280424 (owner: 10ArielGlenn) [11:51:07] (03CR) 10ArielGlenn: [C: 032 V: 032] pylint and pep8 for wikiqueries [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280425 (owner: 10ArielGlenn) [12:00:35] (03PS1) 10ArielGlenn: move wikiqueries files into main dump dir [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280431 [12:01:37] (03CR) 10ArielGlenn: [C: 032] move wikiqueries files into main dump dir [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280431 (owner: 10ArielGlenn) [12:02:59] (03PS1) 10ArielGlenn: remove a couple more excluded dirs from tox config [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280432 [12:03:36] (03CR) 10ArielGlenn: [C: 032] remove a couple more excluded dirs from tox config [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280432 (owner: 10ArielGlenn) [12:10:13] (03CR) 10BBlack: [C: 031] Remove loglines cache to mitigate a possible memory leak. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [12:11:55] (03CR) 10BBlack: [C: 031] delete *.email.donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/278353 (https://phabricator.wikimedia.org/T130414) (owner: 10Dzahn) [12:13:12] (03CR) 10BBlack: "Nothing's really blocking this directly, but at the same time, we should fix the other issues first IMHO, so that we can really see what w" [puppet] - 10https://gerrit.wikimedia.org/r/274864 (https://phabricator.wikimedia.org/T126730) (owner: 10Smalyshev) [12:24:00] 6Operations, 10Continuous-Integration-Config, 10Dumps-Generation, 13Patch-For-Review, 7WorkType-Maintenance: operations/dumps repo should pass flake8 - https://phabricator.wikimedia.org/T114249#2160482 (10ArielGlenn) All files in production use are now in operations/dumps/xmldumpsbackup (not the subdirec... [12:26:34] PROBLEM - confd service on cp1043 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive [12:32:27] (03PS5) 10Ladsgroup: [WIP] Scap3 deployment configurations for ores [puppet] - 10https://gerrit.wikimedia.org/r/280403 [12:36:00] (03PS2) 10MarcoAurelio: Setting $wgMetaNamespaces for an.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279724 (https://phabricator.wikimedia.org/T131006) [12:39:25] RECOVERY - confd service on cp1043 is OK: OK - confd is active [12:40:53] !log repooling cp1043 running varnish4 (T122880) [12:40:54] T122880: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880 [12:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:41:08] (03Abandoned) 10ArielGlenn: Fixed various issues detected by pyflakes [dumps] - 10https://gerrit.wikimedia.org/r/207708 (owner: 10Dereckson) [12:46:24] (03Abandoned) 10MarcoAurelio: Setting $wgMetaNamespaces for an.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279724 (https://phabricator.wikimedia.org/T131006) (owner: 10MarcoAurelio) [12:53:13] (03PS6) 10Ladsgroup: [WIP] Scap3 deployment configurations for ores [puppet] - 10https://gerrit.wikimedia.org/r/280403 [13:04:36] (03PS7) 10Ladsgroup: [WIP] Scap3 deployment configurations for ores [puppet] - 10https://gerrit.wikimedia.org/r/280403 [13:11:25] jouncebot next [13:11:25] In 1 hour(s) and 48 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160330T1500) [13:16:16] (03PS8) 10Ladsgroup: [WIP] Scap3 deployment configurations for ores [puppet] - 10https://gerrit.wikimedia.org/r/280403 [13:30:49] 6Operations, 10RESTBase, 13Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2160555 (10Cmjohnson) No, we do not have enough disks. we are adding 5 each and removing 3. even if i used the remaining 3 spare i am still 1 disk short of being able to populate the las... [13:40:54] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [13:55:25] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:55:59] blergh ^ [13:56:14] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:57:14] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [13:57:56] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [13:59:07] (03PS9) 10Ladsgroup: [WIP] Scap3 deployment configurations for ores [puppet] - 10https://gerrit.wikimedia.org/r/280403 [13:59:37] 6Operations, 10RESTBase, 13Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2160595 (10Eevans) >>! In T128107#2160555, @Cmjohnson wrote: > No, we do not have enough disks. we are adding 5 each and removing 3. even if i used the remaining 3 spare i am still 1 dis... [14:05:34] (03PS1) 10MarcoAurelio: Updating project logo for ast.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280445 (https://phabricator.wikimedia.org/T131247) [14:07:07] (03CR) 10MarcoAurelio: "Note to deployer: optiPNG needs to be run." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280445 (https://phabricator.wikimedia.org/T131247) (owner: 10MarcoAurelio) [14:08:14] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:08:53] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:12:37] 6Operations, 10RESTBase, 13Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2064139 (10mobrovac) For posterity, here's @Eevans' maths as to why we are short of disks: ``` (15:32:20) urandom: actually, i think i get itt (15:32:34) mobrovac: have we miscalculated... [14:13:45] akosiaris: something going on with scb200[12] ? checks failing for services there but no action at all there ^ [14:13:50] (03PS10) 10Ladsgroup: [WIP] Scap3 deployment configurations for ores [puppet] - 10https://gerrit.wikimedia.org/r/280403 [14:14:14] (03PS1) 10Filippo Giunchedi: cassandra: add restbase2004 instances [puppet] - 10https://gerrit.wikimedia.org/r/280446 [14:15:35] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:15:57] mobrovac: doesn't seem like a problem with the services themselves [14:16:07] maybe the check timeouts ... [14:16:46] it shouldn't though [14:16:54] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:17:23] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:17:43] and now scb1002 ? [14:17:48] ok that's bad [14:19:29] wth? [14:20:38] I can reproduce it from neon [14:20:45] /usr/lib/nagios/plugins/check_nrpe -H scb1002.eqiad.wmnet -t 10 -c check_endpoints_citoid [14:20:45] CHECK_NRPE: Socket timeout after 10 seconds. [14:21:04] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:21:22] but citoid alone [14:21:27] /usr/lib/nagios/plugins/check_nrpe -H scb1001.eqiad.wmnet -t 10 -c check_endpoints_cxserver [14:21:28] All endpoints are healthy [14:21:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase2004 instances [puppet] - 10https://gerrit.wikimedia.org/r/280446 (owner: 10Filippo Giunchedi) [14:21:34] for example cxserver is working fine [14:21:34] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [14:22:07] running the service_checker script on scb1002 for citoid gives ReadTimeoutError("HTTPConnectionPool(host=u'127.0.0.1', port=1970): Read timed out. [14:22:39] no cpu/mem activity for citoid processes there [14:22:43] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [14:22:53] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:23:06] oh come on [14:23:47] and no loging.. last weird logline was back at 2016-03-30T14:05:27.780Z [14:23:58] TypeError: Cannot read property 'response' of undefined [14:24:10] and a stack track 500 internal error [14:24:52] "message":"Maximum call stack size exceeded" [14:25:03] this seems to have been the cause for the error above [14:25:26] i don't see that on scb1002 [14:25:37] not sure however they correlate with the report at hand.. but altough 15 minutes ago.. it just might [14:25:40] just an endless flux of useless logs [14:25:41] that's on scb1001 [14:25:50] (03PS1) 10DCausse: Bump CirrusSearchRequestSet rev to 121456865906 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280448 (https://phabricator.wikimedia.org/T128533) [14:25:54] which also raised an alert [14:26:23] Maximum call stack size exceeded is known [14:26:26] !log bootstrap cassandra-a on restbase2004 [14:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:26:34] (03CR) 10DCausse: [C: 04-1] "Should be safe to SWAT next Monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280448 (https://phabricator.wikimedia.org/T128533) (owner: 10DCausse) [14:26:40] that happens when a lot of redirects happen for a req [14:26:47] but that doesn't cause a worker to die [14:26:55] and certainly not all of them [14:27:14] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:27:25] I don't think any worker died [14:27:41] start dates on all processes is Mar 25 [14:29:29] !log restbase deploy 3ea08751a8 on restbase2004 after reimage [14:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:31:02] akosiaris: can you check puppet logs on scb to be sure it isn't messing with the services? [14:31:10] * mobrovac has no access to them [14:31:22] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [14:32:15] (03PS1) 10Luke081515: Add movefile to the editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280449 (https://phabricator.wikimedia.org/T131249) [14:34:14] mobrovac: nothing in there, just ssh key changes for restbase2004 [14:34:40] that /etc/ssh/ssh_known_hosts in case I wasn't clear [14:34:58] !log restbase rolling restart to apply https://gerrit.wikimedia.org/r/#/c/280091/ [14:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:35:03] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:35:30] it seems to be citoid specific after all [14:35:41] akosiaris: now i don't understand why most of the logs are almost useless [14:35:46] no messages no nothing [14:35:52] jsut the level and ts [14:37:07] (03PS2) 10Luke081515: Add movefile to the editor group at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280449 (https://phabricator.wikimedia.org/T131249) [14:38:34] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [14:38:45] this is starting to smell of blocking [14:39:12] damn it [14:39:17] and i yet have to pack [14:39:21] damn [14:39:22] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:39:54] akosiaris: there are a number of fixes in service-runner since the version used by citoid, this might help us greatly [14:40:03] could this anything have to do/be affected by codfw stress testing (I am only querying mediawiki)? Or can I continue? [14:40:23] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [14:40:53] unlikely jynus, nodes in both eqiad and codfw are flapping and citoid doesn't call mw at all [14:41:24] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:41:44] akosiaris: however, in order for me to update citoid we need to push through a config change for service::node [14:42:29] I am not sure how updating citoid would help here. We don't even have a suspect for what is going on [14:44:16] GET /api?search=PMC9999999&format=mediawiki is the failing one [14:44:37] that's the health check that is being pushed over to zotero, right ? [14:44:45] yes [14:44:51] cause it is starting to smell like zotero is at fault here [14:45:03] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:45:09] not that I would be surprised [14:45:46] akosiaris: zotero using 10.5% mem on sc1001 [14:45:54] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:46:06] 6.528g res mem [14:46:07] lol [14:46:12] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:46:33] zotero logs however are absolutely unhelpful [14:46:35] for example [14:46:42] zotero(3)(+0000000): HTTP GET http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&retmode=xml&id=9999999 [14:46:51] clearly related but that is about it [14:47:43] akosiaris: zotero is a front-end app at its core, so my guess would be they use window.alert() when debugging [14:47:43] :D [14:48:23] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [14:48:26] and of course I will sound like a broken record. We need to get rid of that app [14:48:37] but wait, why aren't we seeing zotero checks failures? [14:48:52] zotero checks are laughable [14:49:16] it's not meant to be monitored, so my best outcome was to post something and ask to be converted [14:49:39] akosiaris: there's some movement on the "get rid of zotero" front, cf https://phabricator.wikimedia.org/T93579 [14:50:30] akosiaris: turn it off and on again(TM) ? [14:50:53] lol [14:50:59] can't hurt to try [14:52:13] !log restart zotero on sca1001, sca1002 [14:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:52:33] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:53:09] heh, res mem back to 65k from 7g [14:53:13] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [14:53:26] so apart from everything else, it also has a memory leak [14:53:34] !log restart zotero on sca2001, sca2002 [14:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:53:37] not that I am surprised [14:53:43] yup [14:53:43] it's xulrunner we are talking about here [14:54:02] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:54:16] that's understandable, though - this is unlikely to be a real-world problem in a browser implementation [14:54:42] * mobrovac is said about that statement [14:54:47] s/said/sad/ [14:54:48] well, depends on the poor machine it runs on [14:55:12] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [14:55:29] if the poor researcher decides to leave firefox running for say 10 days and zotero alone ends up consuming 6GB of memory.. well ok restarting firefox solves the problem but still... [14:55:40] so, service_checker still fails [14:55:54] i think we restart zotero every 4 to 6 months, in which period its mem grows from 65k to 6g, so the growth rate is slow enough for desktops not to notice it i guess [14:56:36] yeah, it all depends on where the memory leaks happen in the end [14:56:46] anyway, the restart did not help [14:57:30] http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&retmode=xml&id=9999999 is not exactly happy in general though [14:57:37] (03CR) 10Luke081515: [C: 031] Enable Translate extension on AffCom (chapcomwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275289 (https://phabricator.wikimedia.org/T66122) (owner: 10MarcoAurelio) [14:57:38] I can't fetch it from my box either [14:58:02] ah here we are... a 502 from the upstream server [14:58:04] great!! [14:58:16] akosiaris: it's supposed to fail with a 404 iirc [14:58:30] ah [14:58:33] ok! [14:58:37] The proxy server could not handle the request GET /entrez/eutils/efetch.fcgi.Reason: Error reading from remote server [14:58:43] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:47] Error 502eutils.ncbi.nlm.nih.gov [14:58:47] Wed Mar 30 10:57:17 2016 [14:58:47] Apache [14:58:52] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2160827 (10demon) >>! In T131189#2160229, @Paladox wrote: > Maybe it is a bug in gerrit. > > Since in https://gerrit-documentation.storage.googleapis.com/ReleaseNotes/ReleaseNotes-2.8.4.html it s... [14:58:54] lol [14:59:08] heh, that serves us right for depending on a governmental site working [14:59:27] akosiaris: so, let's just ack it? [14:59:42] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [14:59:48] it's flapping. ACK won't help much [14:59:51] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2160836 (10Paladox) Ok, then it is a bug. Which hopefully we will be on gerrit 2.12 soon so the problem should be fixed. [14:59:55] indeed [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160330T1500). Please do the needful. [15:00:04] matt_flaschen kart_ mafk: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:12] PROBLEM - check_puppetrun on rigel is CRITICAL: CRITICAL: Puppet has 1 failures [15:00:24] * mafk here [15:00:37] it might make sense to change the check to something else ? [15:00:45] not sure what of course [15:00:56] and IIRC we 've been down that road before once already [15:00:58] 6Operations, 10Beta-Cluster-Infrastructure, 7Performance: Need a way to simulate replication lag to test replag issues - https://phabricator.wikimedia.org/T40945#2160841 (10Nikerabbit) [15:01:45] I can SWAT. [15:01:49] Present [15:01:59] * Luke081515 uses the evening SWAT for his patch today, because he is only available till 17:25 this SWAT time [15:02:00] here [15:02:06] mafk: do you have a mediawiki-config submodule bump? [15:02:10] thcipriani as usual lately :) [15:02:30] I mean, in my last 5 SWATs you were the one deploying the code [15:03:11] mafk: :D [15:03:12] (03PS1) 10Elukey: Add a new configuration file specific for Varnish 4 [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/280452 (https://phabricator.wikimedia.org/T124278) [15:03:29] akosiaris: that gov DB is rather important for citoid, so it's important to know about its disruption [15:03:42] (03PS1) 10Muehlenhoff: Detect outdated package versions and flag an error [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/280453 [15:03:57] mobrovac: why ? what is it that we can do about its disruption ? [15:04:10] meh, maybe I should adjust my client. Everytime Elukey is metioned here, my client pings, because "elukey" contains the work "luke" :-/ [15:04:37] akosiaris: well, at least we know about it, if we remove it, it would be even harder to deduce it, wouldn't it? [15:04:41] Luke081515: :D [15:04:42] also when somebody says "Use the force, luke!" [15:05:05] thcipriani: I can do code+2 on the wikimedia/portals ones, but somebody else need to merge those on the portals [15:05:25] mobrovac: but we don't know it. We lost already 1 hour trying to figure it out. And all of that because a gov site fails which we can do nothing about [15:05:39] (03CR) 10Ottomata: [C: 031] Add a new configuration file specific for Varnish 4 [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/280452 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [15:06:14] mobrovac: we should not be having alerts that trigger due to things that are out of our control failing [15:06:14] akosiaris: Yeah, that too [15:06:23] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:06:53] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [15:07:09] so, /me get's nervous, when he has to swat something, and the elukey is uploadins a lot of patches, or some review them :D [15:07:25] mobrovac: of course some exceptions can exist. Things like "hey is basic internet connectivity working ? Can I fetch google.com ?" [15:07:38] akosiaris: the problem with this check is that it also ensures citoid respond appropriately to a non-existent PMCID [15:07:59] mobrovac, that sounds more like a unit test. [15:08:02] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [15:08:12] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [15:08:16] akosiaris: and the req flow is such that if all fails, ask the gov DB [15:08:19] mobrovac, you could do a unit test mocking responses from upstream (502, 404), and check how Citoid handles it. [15:10:35] matt_flaschen: that govt db is used throughout citoid, so we'd effectively need to remove almost all monitoring checks [15:10:43] mobrovac: what would be funny is if suddenly PMCID=9999999 starts existing ... [15:11:14] * mobrovac eyes akosiaris for adding oil on the fire [15:11:29] * akosiaris just doing what he does best ;-) [15:12:22] but seriously, an inexistent PMCID is not the same check as gov db failing [15:12:31] mobrovac, well, you said that particular DB is a fallback, right? So you could monitor some tests which depend on externals that are more reliable, plus any against Citoid not depending on externals (e.g. invalid URL, cached data, etc.). Don't know, I'm just floating ideas. [15:12:48] akosiaris: the flapping notwithstanding, i'd like to proceed in updating service-runner for citoid, these useless logs are killing me [15:13:05] mobrovac: now that we know the problem, I am fine with that [15:13:16] (03CR) 10Ema: [C: 031] Add a new configuration file specific for Varnish 4 [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/280452 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [15:13:36] mafk: so it should be fine to +2 stuff to merge to portals master. Then for SWAT, I'd just need a patch for mediawiki-config that bumps the portals submodule to the head of master. [15:14:05] I +2'd the portals/master patch, but I'll have to make a patch for mediawiki-config once that merges to deploy. [15:14:42] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:14:47] I didn't know about that submodule stuff [15:15:06] thcipriani, the V-1 on https://gerrit.wikimedia.org/r/280369 is a spurious failure. [15:15:11] 15:04:08 npm ERR! enoent ENOENT, open '/mnt/jenkins-workspace/workspace/mwext-qunit/src/node_modules/karma/node_modules/socket.io/node_modules/socket.io-client/node_modules/engine.io-client/node_modules/component-inherit/package.json' is a spurious failure [15:15:12] things were easier handling this at Meta :) [15:15:16] thcipriani, should I re- [15:15:20] Should I re-+2 it? [15:15:22] matt_flaschen: yeah, i'll try to find some alternatives if possible [15:15:25] matt_flaschen: yeah, I saw that :(, please do. [15:15:43] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2158668 (10hashar) Seems to be an issue with Gerrit queue: ``` ssh -p 29418 gerrit.wikimedia.org 'gerrit show-queue -w' Task State StartTime Command ----------------------------... [15:15:45] (03CR) 10Alexandros Kosiaris: [C: 032] Scap: Bump version to 3.1.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/280413 (https://phabricator.wikimedia.org/T130902) (owner: 10Mobrovac) [15:15:50] (03PS2) 10Alexandros Kosiaris: Scap: Bump version to 3.1.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/280413 (https://phabricator.wikimedia.org/T130902) (owner: 10Mobrovac) [15:15:57] (03CR) 10Alexandros Kosiaris: [V: 032] Scap: Bump version to 3.1.0-1 [puppet] - 10https://gerrit.wikimedia.org/r/280413 (https://phabricator.wikimedia.org/T130902) (owner: 10Mobrovac) [15:16:13] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:16:33] (03PS2) 10Muehlenhoff: Detect outdated package versions and flag an error [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/280453 [15:16:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Detect outdated package versions and flag an error [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/280453 (owner: 10Muehlenhoff) [15:17:07] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2160947 (10hashar) From the Gerrit 2.8.4 release notes @paladox pointed: > Fix mail thread getting stuck when waiting for response from SMTP server. So looks like whenever the SMTP relay flap fo... [15:17:42] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2160949 (10Paladox) @hashar would restarting gerrit fix that. [15:18:39] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2160951 (10demon) >>! In T131189#2160949, @Paladox wrote: > @hashar would restarting gerrit fix that. No, it would flush the queue. [15:18:49] (03CR) 10Elukey: [C: 032] Add a new configuration file specific for Varnish 4 [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/280452 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [15:19:13] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:19:23] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:19:23] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2158668 (10akosiaris) >>! In T131189#2160951, @demon wrote: >>>! In T131189#2160949, @Paladox wrote: >> @hashar would restarting gerrit fix that. > > No, it would flush the queue. Flush or drop... [15:20:03] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [15:20:14] !log thcipriani@tin Synchronized php-1.27.0-wmf.19/extensions/Echo/modules/ooui/mw.echo.ui.NotificationBadgeWidget.js: SWAT: Change threshold for survey invitation from 2 unread notifs to 1 [[gerrit:280370]] (duration: 00m 28s) [15:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:23] ^ matt_flaschen check please [15:21:04] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Puppet has 1 failures [15:21:36] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2160963 (10Paladox) https://code.google.com/p/gerrit/issues/detail?id=1528 [15:24:01] (03PS1) 10Elukey: Update the varnishkafka module to add a Varnish 4 compatible configuration template. [puppet] - 10https://gerrit.wikimedia.org/r/280455 (https://phabricator.wikimedia.org/T124278) [15:24:51] (03PS7) 10Andrew Bogott: Stop using old labs-ns0 and labs-ns1, move ns2/ns3 to ns0/ns1 [puppet] - 10https://gerrit.wikimedia.org/r/279945 (https://phabricator.wikimedia.org/T131052) [15:25:12] RECOVERY - check_puppetrun on rigel is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [15:25:12] thcipriani, I checked it deployed properly. Figuring out a semi-realistic way to test now. [15:25:25] (03CR) 10Elukey: [C: 032] Update the varnishkafka module to add a Varnish 4 compatible configuration template. [puppet] - 10https://gerrit.wikimedia.org/r/280455 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [15:25:42] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:55] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2160966 (10Paladox) @demon do you know when we can upgrade to gerrit 2.12. Also if we do restart the problem will come back after a fews per the link above says. So we carn't keep restarting gerr... [15:26:02] PROBLEM - Varnishkafka log producer on cp1043 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:26:03] matt_flaschen: kk, I'll keep going with deploys, lemme know if anything seems amiss that I can help with :) [15:26:08] (03CR) 10Andrew Bogott: [C: 032] Stop using old labs-ns0 and labs-ns1, move ns2/ns3 to ns0/ns1 [puppet] - 10https://gerrit.wikimedia.org/r/279945 (https://phabricator.wikimedia.org/T131052) (owner: 10Andrew Bogott) [15:26:13] (03PS8) 10Andrew Bogott: Stop using old labs-ns0 and labs-ns1, move ns2/ns3 to ns0/ns1 [puppet] - 10https://gerrit.wikimedia.org/r/279945 (https://phabricator.wikimedia.org/T131052) [15:26:31] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 323 bytes in 0.039 second response time [15:26:41] !log thcipriani@tin Synchronized php-1.27.0-wmf.19/extensions/ContentTranslation/modules/publish/ext.cx.publish.js: SWAT: Try to avoid JS error [[gerrit:280389]] (duration: 00m 29s) [15:26:43] (03CR) 10Andrew Bogott: [V: 032] Stop using old labs-ns0 and labs-ns1, move ns2/ns3 to ns0/ns1 [puppet] - 10https://gerrit.wikimedia.org/r/279945 (https://phabricator.wikimedia.org/T131052) (owner: 10Andrew Bogott) [15:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:48] ^ kart_ check please [15:27:12] andrewbogott: ^ you? [15:27:33] andrewbogott: we have seen that tools home page alert before on dns issues, did it come from the above ns0/1/2/3 change? [15:27:35] godog: it's not something I'm doing but I think I have a fix [15:27:49] kk, thanks! [15:27:54] I'm gonig to silence it for a bit [15:28:04] should be fixed now actually [15:28:27] thcipriani: thanks. [15:28:33] some ttl just now expired and I didn't realize that my patch needed to be in before that [15:28:40] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 806892 bytes in 3.600 second response time [15:28:40] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [15:29:22] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [15:29:25] thcipriani, yeah, it worked on testwiki. I just mocked the edit count part. [15:29:28] akosiaris: fyi, the govt db is back in business ^ [15:29:34] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [15:29:40] matt_flaschen: awesome, thank you for checking! [15:30:12] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [15:30:44] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [15:32:34] !log thcipriani@tin Synchronized php-1.27.0-wmf.19/extensions/Translate/resources/js/ext.translate.editor.js: SWAT: Fix regressions in insertables placement [[gerrit:280414]] (duration: 00m 27s) [15:32:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:39] ^ kart_ check please [15:34:21] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 323 bytes in 0.013 second response time [15:34:49] Nikerabbit: Translate fix check :) [15:35:33] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [15:35:57] thcipriani: checking.. [15:36:24] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2160973 (10demon) >>! In T131189#2160954, @akosiaris wrote: >>>! In T131189#2160951, @demon wrote: >>>>! In T131189#2160949, @Paladox wrote: >>> @hashar would restarting gerrit fix that. >> >> No... [15:36:48] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2160975 (10hashar) >>! In T131189#2160966, @Paladox wrote: > @demon do you know when we can upgrade to gerrit 2.12. > > Also if we do restart the problem will come back after a fews per the link... [15:37:02] kart_: checking [15:38:01] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2160979 (10Paladox) @hashar upgrading gerrit will fix the problem. But yes that question should have been asked on the other task. Would that task block this then. Since the only work around is t... [15:38:25] Nikerabbit: seems good on mw.org [15:38:46] kart_: looks bad on meta though [15:39:07] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2160985 (10demon) >>! In T131189#2160979, @Paladox wrote: > Would that task block this then. Since the only work around is to restart gerrit but that is a horrible thing todo. That's not the only... [15:39:39] Nikerabbit: https://meta.wikimedia.org/wiki/Special:Version meta is on wmf18 [15:39:58] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2160988 (10Paladox) >>! In T131189#2160985, @demon wrote: >>>! In T131189#2160979, @Paladox wrote: >> Would that task block this then. Since the only work around is to restart gerrit but that is a... [15:40:03] only testwiki(s) and mw.org on wmf19. [15:40:09] wikimedia/operations-apache-config created by wmfphab [15:40:09] https://github.com/wikimedia/operations-apache-config [15:40:14] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2160989 (10Paladox) [15:40:15] was that deliberate ostriches? [15:40:16] kart_: right, mw is good [15:40:26] meta will get the fix soon anyway [15:40:27] thcipriani: we're good. Thanks! [15:40:34] Nikerabbit: tonight :) [15:40:39] Krenair: Yes, because it was exploding on replication from Gerrit and spamming the hell out of my logs. [15:40:40] kart_: Nikerabbit cool, thanks for checking [15:40:46] k [15:40:59] :) [15:41:31] !log thcipriani@tin Synchronized php-1.27.0-wmf.18/extensions/Echo/modules/ooui/mw.echo.ui.NotificationBadgeWidget.js: SWAT: Change threshold for survey invitation from 2 unread notifs to 1 [[gerrit:280369]] (duration: 00m 28s) [15:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:36] ^ matt_flaschen check please [15:43:05] mafk: so it looks like there are 6 new commits in portals that aren't live, does this look right? https://phabricator.wikimedia.org/P2834 [15:43:53] thcipriani: I can't reply to changes which ain't mine... maybe jgirault can [15:44:02] akosiaris: So, I think I've figured out what's happened with Gerrit but I'm unsure on the best fix. Basically SMTP hiccuped, we hit a known bug in Gerrit (to be fixed in next upgrade). I killed the smtp job that caused the backup, but the queue won't continue process because the thread still has the open socket to mx1001. Restarting gerrit will *work*, but I [15:44:02] stupidly cannot remember what that does to currently queued jobs (block shutdown? kill them all? hold until restart?). Is there a way to kill that particular open socket? [15:44:02] lately he's been taking care of portals [15:44:18] And if not: do we just declare those ~2300 emails lost and restart anyway? [15:44:25] *potentially lost [15:45:03] RECOVERY - Varnishkafka log producer on cp1043 is OK: PROCS OK: 1 process with command name varnishkafka [15:45:35] ostriches: can one thread dump the tasks ? :D [15:45:43] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:45:44] no clue what magic can be achieved with the jvm :D [15:46:00] hashar: I have no idea what you mean :P [15:46:24] hook in the jvm to inspect the threads and collect their data (i.e. email payload) [15:46:49] What would that gain us short of seeing what e-mails were blocked? [15:48:24] ostriches: so I am trying this https://www.box.com/blog/remember-unix-runs-under-your-jvm/ [15:48:40] the idea is to call the close(2) system call on the socket for force gerrit to reopen it [15:48:41] thcipriani, works, tested the same way, on enwiki this time. [15:48:49] to force* [15:48:52] matt_flaschen: cool, thanks! [15:49:01] I got 3 smtp sockets listening [15:49:05] er open [15:49:29] Yeah that's what I saw poking around with lsof -i [15:49:32] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:49:33] sudo lsof -p 7834 | grep smtp [15:49:38] 229, 251, 461 [15:49:55] problem is, despite loading the debugging symbols, I get [15:49:56] No symbol "close" in current context. [15:50:03] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:50:06] and I am thinking this is because on gdb attach I get [15:50:13] Attaching to process 7834 [15:50:13] /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java (deleted): No such file or directory. [15:50:23] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:50:35] roughly java got upgraded, gerrit not restarted and now I am missing the file :( [15:50:46] still trying to figure it out [15:51:24] 6Operations, 10RESTBase, 13Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2160997 (10GWicke) tl;dr: We don't have enough spares to temporarily stand up an extra node with 5T storage. [15:52:02] akosiaris: gdb -exec=`which java` -p ... ? [15:52:27] problems is I have no idea what java version we were running on the last gerrit restart... and I got 13 candidates [15:52:42] ori: will that have the exact same symbols ? [15:52:47] I can try .. let's see [15:52:58] (03PS1) 10Thcipriani: Bump portals [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280456 [15:53:38] mafk: ^ those are all the changes to portals that are undeployed [15:53:42] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [15:53:53] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [15:53:55] thcipriani: I see just a SHA-1 diff? [15:54:09] I wasn't aware we had to mirror the changes there [15:54:42] ideally, changes merged on wikimedia/portals should be mirrored auto by a bot or something [15:54:53] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [15:55:01] like translation updates from translatewiki [15:55:24] Causes problems on deployment hosts when stuff isn't deployed straight away [15:55:36] ori: ok that helped, but turns out that [15:55:41] (gdb) call close(251) [15:55:41] Entry point address is not known. [15:55:45] sigh ... [15:56:22] mafk: the sha1 diff is all you'll see for a submodule update, if you do: git diff fea616633..4c670a7e4 or git diff fea616633..HEAD in the portals repo, you'll see all the changes [15:56:44] apt-get download older java version, dpkg -x , then gdb -exec /path/to/extracted/package/files? [15:57:04] At this point I'm more inclined to actually just restart gerrit... [15:57:09] thcipriani: I'd not merge. I'm not sure the changes which ain't mine are good or not [15:57:28] please add jgirault and mxn as reviewers so they can check [15:57:31] ori: question is.. which of the 13 versions ? brute force approach ? [15:57:40] well... let's see [15:57:42] mafk: sounds good. Will do. thanks :) [15:57:42] PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: puppet fail [15:58:04] thcipriani: thanks, also, please add me to the patch as well [15:58:10] yep [15:58:10] as reviewer I mean [15:58:14] so I can stay updated [15:58:20] what a mess :S [15:58:23] akosiaris: Last restart was Mar11 [15:58:37] what are you trying to do? [15:58:41] ostriches: that helps! [15:58:49] -rw-r--r-- 1 root root 41012740 Feb 1 20:14 /var/cache/apt/archives/openjdk-7-jre-headless_7u95-2.6.4-0ubuntu0.12.04.1_amd64.deb [15:58:54] that might be our guy [15:59:01] er sorry entity .. [15:59:05] well something PC anyway [16:00:04] "package"? :) [16:00:35] (03CR) 10Thcipriani: "This bumps portals to the current head of master inside mediawiki-config. portals has the following changes: https://phabricator.wikimedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280456 (owner: 10Thcipriani) [16:00:42] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 806771 bytes in 8.875 second response time [16:03:39] (03CR) 10MarcoAurelio: "L1 and L2 on that Paste liked above are mine, but I can't tell about the others. I suggest to halt merging this until we can hear from JGi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280456 (owner: 10Thcipriani) [16:05:06] (03PS2) 10Ema: Install VMODs on Varnish 4 instances [puppet] - 10https://gerrit.wikimedia.org/r/279617 [16:05:29] (03PS1) 10Elukey: Add correct varnishkafka configuration files for Varnish 4 servers. [puppet] - 10https://gerrit.wikimedia.org/r/280459 (https://phabricator.wikimedia.org/T124278) [16:05:29] ostriches: success!!!! [16:05:42] ori: thanks! you really helped ;-) [16:06:10] gerrit queue is down to 2318 objects and dropping [16:06:37] (03CR) 10jenkins-bot: [V: 04-1] Add correct varnishkafka configuration files for Varnish 4 servers. [puppet] - 10https://gerrit.wikimedia.org/r/280459 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [16:08:00] here come a bunch of emails [16:08:12] akosiaris: AWESOME! [16:08:24] akosiaris: Thanks for fixing it. Ive got gerrit mail now. [16:08:36] I'll update the task [16:09:01] 6Operations, 10Gerrit, 10Mail: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2161050 (10Paladox) @akosiaris fixed the problem. Ive now got gerrit mail. [16:10:26] (03CR) 10Elukey: [C: 04-1] Add correct varnishkafka configuration files for Varnish 4 servers. [puppet] - 10https://gerrit.wikimedia.org/r/280459 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [16:10:44] 6Operations, 10Gerrit, 10Mail, 7Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2161052 (10demon) p:5Unbreak!>3High So this ultimately was the upstream SMTP issue. @akosiaris was able to attach to the process to drop the open socket and e-mails should now be... [16:12:47] akosiaris: This is probably the most annoying bug I've spotted in what has been an otherwise incredibly stable 2.8.x branch of Gerrit. $reasons_to_upgrade++ [16:12:55] 6Operations, 10Gerrit, 10Mail, 7Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2161055 (10akosiaris) So for posterity's sake here's a transcript of what I did. Inspired by https://www.box.com/blog/remember-unix-runs-under-your-jvm/ (an acquaintance of mine), I... [16:13:05] (03CR) 10MarcoAurelio: "check experimental" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280445 (https://phabricator.wikimedia.org/T131247) (owner: 10MarcoAurelio) [16:13:06] I've been working on that this week tho [16:13:08] So, soon [16:13:53] And I'll do a gerrit restart today too, once the queue goes away [16:14:24] ostriches: Thanks [16:14:43] ostriches: cool. [16:14:53] (03CR) 10MarcoAurelio: "check zend" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280445 (https://phabricator.wikimedia.org/T131247) (owner: 10MarcoAurelio) [16:15:18] ostriches: ah now that you are here, I need a new repo for ores. should I create it in gerrit or directly in diffusion ? (mind you, I lack the rights to create a new repo in diffusion) [16:16:02] Lemme fix the latter right now [16:16:45] (that's granted by the viral group Repo-Administrators. Any existing member can add new members) [16:17:18] And if you're looking at trying out Diffusion, we can definitely do that :) [16:17:41] Diffusion sounds fine to me [16:17:43] :-) [16:19:20] akosiaris: You should be able to use https://phabricator.wikimedia.org/diffusion/create/ now [16:19:25] thanks akosiaris and ostriches! [16:19:39] akosiaris: ! tell me about ores + wheels deployment! [16:19:40] so magically fixed ? :D [16:19:52] how you do? [16:19:59] 6Operations, 10Ops-Access-Requests: Grant reedy access to librenms - https://phabricator.wikimedia.org/T131252#2161064 (10Reedy) [16:20:02] 6Operations: Security audit for tftp on Carbon - https://phabricator.wikimedia.org/T122210#1898315 (10mark) What exactly is considered problematic about tftp being open? :) [16:20:03] or, should I wait til yuvi wakes up and ask him? [16:21:08] akosiaris: did you get the Gerrit issue fixed by closing a stall file descriptor with gdb? [16:21:16] (03PS2) 10Elukey: Add correct varnishkafka configuration files for Varnish 4 servers. [puppet] - 10https://gerrit.wikimedia.org/r/280459 (https://phabricator.wikimedia.org/T124278) [16:21:22] hashar: yup [16:21:34] ostriches: thanks! [16:21:35] * hashar awards Hacker badge to akosiaris [16:22:28] I was wondering if we could send a forged IP/TCP packet to close the connection [16:22:42] queue empty\O/ kudos [16:22:43] (03PS3) 10Elukey: Add correct varnishkafka configuration files for Varnish 4 servers. [puppet] - 10https://gerrit.wikimedia.org/r/280459 (https://phabricator.wikimedia.org/T124278) [16:23:24] ottomata: still figuring that out fully. The idea is to deploy a repo with wheels (and git-fat possibly) to the hosts and then get a venv installed using those wheels. The wheels idea was yuvipanda's in order to avoid the non relocatable virtualenvs issue [16:23:49] hashar: no, I tried the TCP close approach by restarting exim on mx1001 [16:23:58] akosiaris: i am thinking about how to fix our eventlogging deployment, and use scap3 in the process [16:24:00] so i'm thinking about this a lot [16:24:08] currently all our deps are satisfied by deb packages that I maintain [16:24:14] trying to decide if we should change that [16:24:29] i was thinking of doing virtualenv in a deploy/ repo, similar to the way services does the node_modules thing [16:24:39] but, i'm not v familiar with virtualenv, and i just heard of wheels like yesterday [16:24:41] !log gerrit: restarting to clean up mismatched jvm versions [16:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:24:48] akosiaris: ^ [16:24:50] what's the non relocatable virtualenvs issue? compiled depenencies? [16:24:58] ostriches: cool! [16:25:10] ottomata: paths are hardcoded basically in virtualenvs [16:25:23] RECOVERY - puppet last run on mw2118 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [16:25:45] hm, yes, but if you activate a virtualenv, that is taken care of? [16:26:06] activate it on what ? the host you built-in on ? another host ? another path ? [16:26:25] another host after it is deployed there [16:26:28] the problem is if you build a venv and move it around, chances are it might not work [16:26:31] oh, i see, you are saying its built locally [16:26:35] and deployed to a different location [16:26:36] huh. [16:26:40] hm [16:26:53] Hey, anyone know how can I login via deploy-service (in beta) to test? [16:26:54] wheels are a binary format that is more universal and does not suffer from that [16:26:56] i haven't gotten far enough with this yet to have run into that, but ja that would be a problem [16:27:12] https://wikitech.wikimedia.org/wiki/Keyholder [16:27:23] this doesn't give any good idea how to do it [16:28:26] Amir1: scap3 is supposed to do that for you. trying to debug something ? [16:28:28] I tried "keyholder arm" and arming the deployservice user and then logging in [16:28:35] akosiaris: is there some wheels code i can look at somewhere? [16:28:49] akosiaris: yup, something works on my username but puppet can't do it [16:29:03] Amir1: if you are a project admin, you can su - deploy-service on the node and then do whatever [16:29:08] ottomata: not sure I follow... [16:29:29] what do you mean wheels code ? all you do is python setup.py bdist_wheel to generate a wheel [16:29:39] instead of python setup.py sdist [16:29:51] Amir1: hm, there's something wrong in that statement - puppet runs as root, so if you can do it, it can do it too for sure [16:30:06] well, actually you can do both. python setup.py bdist_wheel sdist upload --sign is my usual incantation [16:30:36] mobrovac: you're right but puppet gives me this [16:30:50] akosiaris: i guess i meant deployment setup [16:31:13] "sudo puppet agent --test --verbose [16:31:30] akosiaris: thanks (re: restart exim to trigger tcp close ) :-} [16:31:37] like, how what's the process yall are doing for ores in prod [16:31:48] Error: Execution of '/usr/bin/deploy-local --repo ores/deploy -D log_json:False' returned 70: [16:31:48] Error: /Stage[main]/Ores::Base/Scap::Target[ores/deploy]/Package[ores/deploy]/ensure: change from absent to present failed: [16:31:53] instructions, wrapper script, release process, etc. [16:32:05] ottomata: ah, the idea is a after sync command in scap doing a pip install (that will only use those wheels and not reach out to the internet etc) [16:32:13] but that exact command does work (even without sudo) [16:32:17] ottomata: the moment we got something working I 'll let you know [16:32:35] Amir1: ah! you have to go onto deployment-tin and issue a deploy from /srv/deployment/ores/deploy [16:32:43] HM [16:32:44] Amir1: it's a work-around for scap basically [16:32:49] pip install to where? global? [16:32:59] or some local contained dir (virtualenv?) [16:33:04] Amir1: then you can run puppet on the node again after that [16:33:09] ottomata: hey, no, we make a venv [16:33:24] let me get configs for you [16:33:30] ok, so venv/bin/pip install using only provided wheels somehow [16:33:51] https://github.com/wiki-ai/ores-wikimedia-config/blob/master/scap [16:34:04] check cmd.sh [16:34:18] (or the fabfile in root) [16:34:36] mobrovac: thanks, let me check that [16:34:38] HUH [16:34:40] interesting [16:35:28] where do the submodules/wheels/*.whl come from? [16:35:54] it's a submodule [16:36:07] in research/ores/wheels.git (in gerrit) [16:36:11] ottomata, right now, we build those wheels on ores-compute-01.eqiad.wmflabs [16:36:23] Which matches our labs env (Jessie 8.3) [16:36:24] o/ halfak [16:36:26] o/ [16:36:35] * halfak watches Amir1 getting hammed with questions :) [16:36:43] hehe [16:36:46] *hammered [16:36:57] ok so, lemme give my understanding and you can correct me [16:37:05] 1. dependencies are packaged in wheels and committed to repo [16:37:16] yes [16:37:16] that repo is deployed along with source code to targets [16:37:22] yes [16:37:39] during deploy after checkout, venv/bin/pip install with wheels [16:38:00] and services are started using that venv [16:38:46] Amir1: how do --use-wheel and --no-deps work together [16:38:49] yup [16:38:51] are the wheels not actually copied into the venv? [16:38:59] they should work together [16:39:14] OH, i see [16:39:16] because use-wheels automatically tries to downloads the deps [16:39:23] that is installing each of the wheels into the venv [16:39:33] ah, it will try to download even if it has the wheels locally? [16:39:35] so you need to build wheels for all depds (recursively) too [16:39:36] without --no-deps? [16:39:43] aye [16:40:13] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [16:40:15] it doesn't download anything if you use --use-wheel and --no-deps [16:40:18] ok [16:40:46] ok, so you don't pip install your source code into the venv though, right? [16:40:48] just the deps? [16:40:54] mobrovac: gave me premission error (public key) which means public key of deploy_user isn't added [16:41:06] keyholder arm worked but still the same error here [16:41:29] ottomata: we don't [16:41:37] do you set PYTHONPATH=/path/to/src [16:41:40] and activate the venv? [16:42:01] but we can package and do that (we have some deps that we package and install using wheels) [16:42:29] but for the ores package (not revscoring) I think we do, halfak knows this part better [16:44:25] hm [16:44:43] Amir1: do you have a betacluster-specific env in your deploy repo's scap/ dir? [16:45:01] Amir1: i don't fully understand your last statement [16:45:06] "but we can package and do that (we have some deps that we package and install using wheels)" [16:45:09] mobrovac: no [16:45:13] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 211 seconds ago with 0 failures [16:45:16] * halfak sees ping and reads [16:45:29] ottomata: ores is not just one package, it's several packages [16:45:41] the main one, I think it's not being installed [16:45:43] package == repository here? [16:45:50] yup [16:45:53] (ores) [16:45:54] ok [16:45:56] OK So, 'ores' is a submodule in our deploy/config repository [16:46:04] Amir1: so you'll need one :P you get pubkey denied because it is trying to connect from beta to prod, which is a no-go [16:46:10] But we package up 'revscoring' as a wheel when deploying [16:46:16] but their deps like revscoring are being packaged, wheeled, and installed [16:46:18] revscoring is a dependency of ores? [16:46:21] Yes [16:46:22] ok [16:46:34] then when running code out of ores source [16:46:36] mobrovac: oh, we do that [16:46:38] how do you prep env? [16:46:46] you've got a venv with all deps [16:46:46] ottomata, yeah. that's right. [16:46:53] including ones you've written and packaged [16:47:00] with an active venv [16:47:06] targets are okay: https://github.com/wiki-ai/ores-wikimedia-config/blob/master/scap [16:47:09] mobrovac: ^ [16:47:09] the ores source wont' be in PYTHONPATH, right? [16:47:18] because the ores code is not installed into the venv [16:47:24] ottomata, it's in the directory our uwsgi executes from [16:47:31] So it is in the implied context of "." [16:47:38] ah [16:47:39] ah! [16:47:40] there's actually a symlink to the submodule dir [16:47:40] ok. [16:47:41] hm [16:47:52] ok cool [16:47:54] So that "ores" --> "submodules/ores/ores" [16:47:56] 100% understood, i like this. [16:48:06] :) [16:48:07] i was trying to figure out how to do this with just virtualenvs, and it was not working [16:48:18] i like that you never pip install your source, even with -e [16:48:21] that was getting confusing [16:48:44] Yeah. It is a pain to maintain an ores wheel when we want to do minor changes. Much easier to manage the submodule. :) [16:49:01] But we still do produce a wheel for pypi :) [16:49:13] And do semantic versioning [16:49:37] indeed [16:49:41] ok [16:49:52] hmmm [16:49:59] Amir1: does running deploy-log --latest reveal something interesting? [16:50:07] (when run form deployment-tin) [16:50:12] s/form/from/ [16:50:13] PROBLEM - check_puppetrun on alnilam is CRITICAL: CRITICAL: Puppet has 1 failures [16:50:29] akosiaris: q for you then [16:50:41] at the moment, most eventlogging dependencies are already deb packages [16:50:56] there are 2 that i didn't build, and are only available in jessie-backports [16:51:03] hmm, actually maybe just one [16:51:14] (tornado) its used for eventlogging-service eventbus [16:51:23] that's the only eventlogging daemon that runs in jessie right now [16:51:40] the wheels solution isn't that different than using deb packages [16:51:54] you have to prebuild all deps either way [16:52:28] eventlogging is currently deployed on trusty with a git deploy global pip install, i want to use scap and avoid the global pip install [16:52:58] do you think I should look into this wheels route, or just continue to rely on deb packages for the time being, seeing as most of the deps are already debs [16:53:29] longterm ? yes. short term ? wait a bit. [16:53:47] medium term ? yes again I think. but it's difficult to define what medium-term is [16:54:17] all in all, yes eventlogging should go down that way, but wait a bit till we have fully fleshed it out [16:54:36] let me check [16:55:04] mobrovac: same thing: 16:38:38 [deployment-tin] [u'/usr/bin/deploy-local', u'-v', u'--repo', u'ores/deploy', u'-g', u'flower', u'fetch'] on deployment-ores-web.deployment-prep.eqiad.wmflabs returned [255]: Permission denied (publickey). [16:55:17] let me check my ssh agent to see if the key is added or not [16:55:36] ok akosiaris in that case, i will stick with .deb packages for now [16:57:28] 6Operations, 10Gerrit, 10Mail, 7Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2158668 (10Krenair) >>! In T131189#2161052, @demon wrote: > there's a ~2300 email backlog as of writing A disproportionately high percentage of which has now reached my inbox... show... [17:00:41] 6Operations: Reimage hooft with jessie and rename to bast3001 - https://phabricator.wikimedia.org/T123712#1936351 (10Dzahn) Once we drop the .esams. from hooft, what would we do with these? ``` ; esams Service aliases ; FIXME: all 3 of these are suspect, we don't do these per-subdomain elsewhere.. puppet.esams... [17:05:41] mobrovac: it seems keyholder doesn't add those keys to the agent [17:05:41] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [17:05:41] RECOVERY - check_puppetrun on alnilam is OK: OK: Puppet is currently enabled, last run 122 seconds ago with 0 failures [17:05:41] anyone with keyholder experience? [17:09:39] grrrit-wm stopped talking due to gerrit restart around 9.30 i think [17:10:09] PST. 50 min ago [17:10:16] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [17:10:20] 6Operations, 10hardware-requests: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2161254 (10Cmjohnson) [17:10:22] 6Operations, 10ops-eqiad: Update physical hostname labels for analytics 1017 & 1021 to read new hostnames notebook 1001 & 1002 - https://phabricator.wikimedia.org/T131216#2161251 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson Completed labels and racktables update analytics1017 => notebook1001 analytics1021... [17:10:46] 6Operations, 10ops-eqiad: Update the visible label field in racktables for analytics 1017 & 1021 to notebook 1001 & 1002 - https://phabricator.wikimedia.org/T131217#2161258 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson [17:10:48] 6Operations, 10hardware-requests: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2145652 (10Cmjohnson) [17:12:54] mutante, can only yuvipanda poke that at the moment? [17:12:55] 6Operations, 10hardware-requests: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760#2161280 (10Cmjohnson) [17:12:57] 6Operations, 10ops-eqiad, 10netops: Update the port description on the network switch for analytics 1017 and 1021 - https://phabricator.wikimedia.org/T131218#2161277 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson updated switches [17:13:16] Amir1, I think ori has keyholder experience :) [17:13:34] oh thanks Krenair [17:13:58] have you done deployment in tin in beta? [17:14:03] yeah [17:14:37] with scap [17:14:37] it gives me premission error when I use "deploy-service" [17:14:44] public_key [17:14:51] never successfully deployed with trebuchet [17:15:02] https://www.irccloud.com/pastebin/O3mlPhDw/ [17:15:09] it's scap ^ [17:15:16] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 206 seconds ago with 0 failures [17:15:32] what is that? [17:15:33] config: https://github.com/wiki-ai/ores-wikimedia-config/blob/master/scap/scap.cfg [17:15:37] scap3? [17:15:38] Amir1: ah, you probably need to be added to the deploy-service group if you're trying to use the deploy-service user [17:15:43] yeah [17:15:53] you may need to sudo -u [17:15:57] maybe jenkins-deploy or something [17:16:03] thcipriani: hey, thanks [17:16:16] can you add me if possible? [17:16:25] Krenair: doesn't work, tried that [17:16:26] in deployment-prep? you should have full sudo as root [17:16:46] Krenair: not sure, but the docs include " [17:16:52] "sudo su yuvipanda" [17:17:03] yeah, I have but doesn't work [17:17:08] i suppose i can as root [17:17:10] let me get you result [17:19:07] https://www.irccloud.com/pastebin/p56z0QRf/ [17:19:11] Krenair: ^ [17:19:34] thcipriani: btw. I'm doing it in deployment-prep not prod [17:20:10] sudo -u deploy-service -i [17:20:13] then run the command? [17:20:45] oh, it probably doesn't like that you're in ~ladsgroup [17:21:02] Amir1: maybe your ops/puppet patch needs some changes? [17:21:05] Amir1: yup. So what's happening is keyholder won't let you use the deploy-service key since you're not in the deploy-service group (and there doesn't seem to be a deploy-service group :)) [17:21:20] ah! right thcipriani! [17:21:22] hehe [17:21:41] krenair@deployment-ores-web:~$ getent group deploy-service [17:21:41] deploy-service:x:998: [17:21:53] hmm [17:22:04] Krenair: it's working [17:22:24] eh, that group should be on deployment-tin for keyholder. [17:22:25] !log restarting grrrit-wm [17:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:22:35] great [17:22:39] * Krenair gets dinner [17:22:51] I found the real issue, since scap is the best software I've ever seen to to hide real cause of failures [17:23:08] lol [17:23:42] Amir1: try: deploy-log [17:23:57] /usr/bin/deploy-local', u'-v', u'--repo', u'ores/deploy', u'-g', u'flower', u'fetch'] on deployment-ores-web.deployment-prep.eqiad.wmflabs returned [255] : Permission denied (publickey). [17:24:00] I do [17:25:03] that's very simpel but I can think of at leat ten cases that ssh has issues but it fails without a warning [17:25:44] Amir1: halfak: https://phabricator.wikimedia.org/diffusion/1880/ .. still figuring out the acl for push, but otherwise repo is good to go [17:26:17] akosiaris, "mediawiki-services"? [17:26:19] thanks [17:26:22] (03PS1) 10Dzahn: hooft->bast3001 in smokeping,dhcp/network comments [puppet] - 10https://gerrit.wikimedia.org/r/280466 (https://phabricator.wikimedia.org/T123712) [17:27:08] (03PS5) 10Dzahn: elastic: change disk space monitoring to alert at 15% [puppet] - 10https://gerrit.wikimedia.org/r/280343 (https://phabricator.wikimedia.org/T130329) [17:27:19] halfak: another defacto standard [17:27:25] (03CR) 10Dzahn: [C: 032] elastic: change disk space monitoring to alert at 15% [puppet] - 10https://gerrit.wikimedia.org/r/280343 (https://phabricator.wikimedia.org/T130329) (owner: 10Dzahn) [17:27:37] akosiaris, OK :) [17:27:54] (03PS1) 10Andrew Bogott: Replace horizon dns and proxy guis. [puppet] - 10https://gerrit.wikimedia.org/r/280467 [17:28:03] halfak: https://gerrit.wikimedia.org/r/#/admin/projects/?filter=services [17:28:12] andrewbogott: you got one on palladium [17:28:18] quite a few there already [17:28:25] "wikibase/data-model-services" lol [17:28:45] mutante: sorry, fixed [17:28:55] andrewbogott: np, thx [17:28:58] ottomata: https://phabricator.wikimedia.org/T130205#2138030 was my comment about wheels vs debs vs virtualenvs [17:29:15] akosiaris: thanks for taking care of grrrit-wm1! [17:29:19] halfak: I 'll create a "policy" project to be used for push rights [17:29:26] yuvipanda: thanks for that documentation ! [17:29:42] the sudo su yuvipanda part though proved not required ;-) [17:30:02] OO reading thanks yuvipanda [17:30:03] akosiaris: oh? that's interesting. not sure where the .kube/config came from otherwise [17:31:43] yuvipanda: I have no idea [17:31:49] akosiaris@tools-k8s-master-01:~$ kubectl --user=lolrrit-wm --namespace=lolrrit-wm get pods [17:31:49] NAME READY STATUS RESTARTS AGE [17:31:49] grrrit-v9g1e 1/1 Running 0 9m [17:31:53] see ? [17:31:55] interesting. [17:31:58] my $HOME is empty [17:32:41] akosiaris: hmm that's concerning :D [17:34:57] yuvipanda: could we do something with wheels and some local pypi (?) repo like we with jars and archiva and git fat? [17:35:14] ottomata: yeah, long term something like archiva would be great [17:35:49] like, if you could publish to a locally hosted pypi(? is that the right term) repo, and then wheel artifacts would be added to git with git fat [17:36:06] ottomata: yup. [17:37:06] ottomata: you can kinda already do that easily. [17:37:20] ottomata: can just be a static fileserver (nginx?) with directory listing turned on [17:39:30] ja but something would have to host the wheels there as symlinked shas for git fat to work [17:39:52] there is a script in the archiva role that creates those symlinks [17:40:00] ottomata: pip can just find them from a directory listing though [17:40:16] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [17:40:23] oh i see, aye, without git fat [17:41:15] ottomata: yeah. [17:41:22] (03PS1) 10Rush: toollabs: use resolver array for redundancy [puppet] - 10https://gerrit.wikimedia.org/r/280469 [17:41:49] 6Operations, 10Ops-Access-Requests, 15User-greg: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2161480 (10Dzahn) 5Open>3Resolved a:3Dzahn You have been subscribed to the list now. [17:43:35] 6Operations: Add "subscribe and read the ops@lists.wikimedai.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2161485 (10greg) [17:44:01] (03CR) 10Hoo man: [C: 032] Rebuild interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280463 (owner: 10Hoo man) [17:44:06] (03CR) 10Andrew Bogott: [C: 031] "Yes please! I'd run this through the puppet compiler to make sure that the list is properly punctuated (comma? No comma? I'm not sure.)" [puppet] - 10https://gerrit.wikimedia.org/r/280469 (owner: 10Rush) [17:44:09] ottomata: that happens if you use pip with the --find-links and --no-index options [17:44:27] (03Merged) 10jenkins-bot: Rebuild interwiki map [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280463 (owner: 10Hoo man) [17:44:27] 6Operations: Add "subscribe and read the ops@lists.wikimedai.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2161499 (10greg) Maybe we can't change L3, but we should at least change https://wikitech.wikimedia.org/wiki/Requesting_shell_access [17:45:14] hm ok [17:45:16] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 169 seconds ago with 0 failures [17:45:25] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2161503 (10Legoktm) [17:45:57] (03CR) 10Rush: [C: 032] toollabs: use resolver array for redundancy [puppet] - 10https://gerrit.wikimedia.org/r/280469 (owner: 10Rush) [17:46:02] !log hoo@tin Synchronized wmf-config/interwiki.php: Sync changes on meta (duration: 00m 42s) [17:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:46:17] PROBLEM - puppet last run on restbase-test2001 is CRITICAL: CRITICAL: Puppet last ran 17 hours ago [17:46:54] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2161485 (10Dzahn) L3 (and L2) can be edited. There is "Edit Document", but what happens is that all people who signed the version before the edit will have this little icon next to the... [17:47:03] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2161485 (10RobH) It was always my understanding that we wanted any deployers to also subscribe to the operations list. Ideally, they are also subscribed to wikitech. Operations is us... [17:47:58] RECOVERY - puppet last run on restbase-test2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:48:12] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2161530 (10RobH) We've modified the L2 to clarify things over time without legal reviewing it. (You can see in the document history.) Mostly it was copy-editing, but occasionally it... [17:51:17] PROBLEM - puppet last run on restbase-test2003 is CRITICAL: CRITICAL: Puppet last ran 17 hours ago [17:51:19] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2161539 (10greg) https://wikitech.wikimedia.org/w/index.php?title=Requesting_shell_access&type=revision&diff=401835&oldid=340244 [17:52:28] ottomata: I'll be happy to help you setup this nginx based workflow thingy, btw. [17:52:57] RECOVERY - puppet last run on restbase-test2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:53:17] ottomata: an alternative to the 'just put nginx on a filesystem!' is to use devpi, which is like archiva [17:53:27] 6Operations, 10ops-codfw, 10RESTBase-Cassandra: restbase2004.codfw.wmnet: Failed disk/RAID - https://phabricator.wikimedia.org/T130990#2161545 (10Eevans) [17:53:55] (03CR) 10Alexandros Kosiaris: Assign roles::ores::web, roles::ores::worker to SCB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/278990 (https://phabricator.wikimedia.org/T124201) (owner: 10Alexandros Kosiaris) [17:55:04] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2161546 (10greg) So, a new section in L3 like: **Awareness** Subscribe and read the Operations mailing list ([[ https://lists.wikimedia.org/mailman/listinfo/ops | ops@lists.wikimedia.... [17:55:34] (03PS1) 10Dzahn: site.pp/hiera: rename hooft to bast3001 [puppet] - 10https://gerrit.wikimedia.org/r/280472 (https://phabricator.wikimedia.org/T123712) [17:59:01] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2161485 (10Krenair) L3 is required for all shell users, I would this requirement was only for deployers? [18:00:17] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2161575 (10greg) I personally think it should be required even for them, but, what do others think? [18:00:59] (03PS1) 10Yuvipanda: tools: Register tools by IP rather than hostname [puppet] - 10https://gerrit.wikimedia.org/r/280473 [18:01:26] (03PS2) 10Rush: tools: Register tools by IP rather than hostname [puppet] - 10https://gerrit.wikimedia.org/r/280473 (owner: 10Yuvipanda) [18:01:56] (03CR) 10jenkins-bot: [V: 04-1] tools: Register tools by IP rather than hostname [puppet] - 10https://gerrit.wikimedia.org/r/280473 (owner: 10Yuvipanda) [18:02:08] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/280343/5/hieradata/role/common/elasticsearch/server.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/193834 (owner: 10ArielGlenn) [18:02:25] (03PS3) 10Yuvipanda: docker: Fix format of auth config file [puppet] - 10https://gerrit.wikimedia.org/r/280084 [18:02:32] (03CR) 10Yuvipanda: [C: 032 V: 032] docker: Fix format of auth config file [puppet] - 10https://gerrit.wikimedia.org/r/280084 (owner: 10Yuvipanda) [18:02:43] (03PS3) 10Yuvipanda: tools: Register tools by IP rather than hostname [puppet] - 10https://gerrit.wikimedia.org/r/280473 [18:03:14] (03PS2) 10Dzahn: hooft->bast3001 in smokeping,dhcp/network comments [puppet] - 10https://gerrit.wikimedia.org/r/280466 (https://phabricator.wikimedia.org/T123712) [18:04:50] (03CR) 10Dzahn: [C: 032] "network/dhcp changes are comments only" [puppet] - 10https://gerrit.wikimedia.org/r/280466 (https://phabricator.wikimedia.org/T123712) (owner: 10Dzahn) [18:05:55] i merged both, yuvi [18:06:02] whoops, thanks mutante [18:07:11] (03PS11) 10Ladsgroup: [WIP] Scap3 deployment configurations for ores [puppet] - 10https://gerrit.wikimedia.org/r/280403 [18:07:59] (03PS1) 10Chad: pep8: Fix all mixed spaces/tab indentation (W191) [puppet] - 10https://gerrit.wikimedia.org/r/280475 [18:10:10] ostriches: ^ is that good to merge? I'll merge! (fuck tabs!) [18:10:16] PROBLEM - check_puppetrun on heka is CRITICAL: CRITICAL: Puppet has 1 failures [18:10:19] yuvipanda: Should be :) [18:10:30] (03CR) 10Yuvipanda: [C: 032] "FUCK TABS!" [puppet] - 10https://gerrit.wikimedia.org/r/280475 (owner: 10Chad) [18:10:33] It was a mostly automated replacement of "\t" to " " [18:10:45] ostriches: done. let's see if anything breaks [18:11:01] 6Operations, 10DBA, 13Patch-For-Review, 7Performance, and 2 others: Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming - https://phabricator.wikimedia.org/T124697#2161628 (10jcrespo) The following tests have been done: * Mass-requ... [18:11:05] :) down with tabs [18:11:57] $ pep8 . | grep W191 | wc -l [18:11:57] 1339 [18:11:58] (in the mediawiki config repo too ? *:g*:) [18:12:01] Before the patch ^ [18:12:03] :p [18:12:11] nice [18:15:34] We went from 2385 violations to 977! [18:15:34] go me [18:15:35] (03PS2) 10Andrew Bogott: Replace horizon dns and proxy guis. [puppet] - 10https://gerrit.wikimedia.org/r/280467 [18:15:35] RECOVERY - check_puppetrun on heka is OK: OK: Puppet is currently enabled, last run 177 seconds ago with 0 failures [18:16:42] poor tabs :( [18:16:53] (03CR) 10Andrew Bogott: [C: 032] Replace horizon dns and proxy guis. [puppet] - 10https://gerrit.wikimedia.org/r/280467 (owner: 10Andrew Bogott) [18:20:12] !log starting restbase update to 9fe8676d [18:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:20:26] MatmaRex: it's ok .. if you are a guitar player [18:22:38] (03PS4) 10Yuvipanda: tools: Register tools by IP rather than hostname [puppet] - 10https://gerrit.wikimedia.org/r/280473 [18:23:17] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Register tools by IP rather than hostname [puppet] - 10https://gerrit.wikimedia.org/r/280473 (owner: 10Yuvipanda) [18:30:03] (03PS2) 10Ottomata: Use eventbus topic config from mediawiki/event-schemas repo [puppet] - 10https://gerrit.wikimedia.org/r/280097 [18:31:24] (03CR) 10Ottomata: [C: 032 V: 032] Use eventbus topic config from mediawiki/event-schemas repo [puppet] - 10https://gerrit.wikimedia.org/r/280097 (owner: 10Ottomata) [18:31:30] (03PS1) 10Dzahn: install_server: rename hooft to bast3001 [puppet] - 10https://gerrit.wikimedia.org/r/280478 (https://phabricator.wikimedia.org/T123712) [18:33:25] (03PS1) 10Ottomata: Remove non existent dependency in eventbus.pp [puppet] - 10https://gerrit.wikimedia.org/r/280479 [18:34:50] (03CR) 10Ottomata: [C: 032] Remove non existent dependency in eventbus.pp [puppet] - 10https://gerrit.wikimedia.org/r/280479 (owner: 10Ottomata) [18:36:11] (03PS1) 10Chad: deploy: delete id_rsa.pub [puppet] - 10https://gerrit.wikimedia.org/r/280480 [18:38:04] (03PS1) 10Chad: delete gitweb_config.perl [puppet] - 10https://gerrit.wikimedia.org/r/280481 [18:38:10] (03PS1) 10ArielGlenn: move dump prod files into subdir [dumps] - 10https://gerrit.wikimedia.org/r/280482 [18:39:33] (03CR) 10ArielGlenn: [C: 032] move dump prod files into subdir [dumps] - 10https://gerrit.wikimedia.org/r/280482 (owner: 10ArielGlenn) [18:43:19] 6Operations, 10ops-eqiad: upgrade package_builder machine with SSD - https://phabricator.wikimedia.org/T130759#2161794 (10Cmjohnson) @fgiunchedi When would you like to do this? [18:46:30] (03PS1) 10GWicke: Remove main node IPs from codfw restbase seeds [puppet] - 10https://gerrit.wikimedia.org/r/280484 [18:48:13] (03CR) 10GWicke: "The fact that 2004 is still listed in restbase seeds is producing about 100 warnings to logstash per second, so it would be good to merge " [puppet] - 10https://gerrit.wikimedia.org/r/280484 (owner: 10GWicke) [18:49:30] (03PS2) 10GWicke: Remove main node IPs from codfw restbase seeds [puppet] - 10https://gerrit.wikimedia.org/r/280484 [18:50:05] (03PS1) 10Ottomata: No-op refactor of eventlogging module [puppet] - 10https://gerrit.wikimedia.org/r/280486 (https://phabricator.wikimedia.org/T131263) [18:50:07] (03CR) 10Ppchelko: [C: 031] Remove main node IPs from codfw restbase seeds [puppet] - 10https://gerrit.wikimedia.org/r/280484 (owner: 10GWicke) [18:51:35] (03PS1) 10Alex Monk: Horizon proxy dashboard: Fix backend URL format [puppet] - 10https://gerrit.wikimedia.org/r/280488 [18:53:21] (03CR) 10Eevans: [C: 031] Remove main node IPs from codfw restbase seeds [puppet] - 10https://gerrit.wikimedia.org/r/280484 (owner: 10GWicke) [18:53:34] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2161832 (10RobH) Please note that the order for 16 systems was placed today on blocking task T129381. Since the blocking task is private, I wanted to provide a public updat... [18:54:33] 6Operations, 10Gerrit, 10Mail, 7Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2161836 (10hashar) Might be fixed in Gerrit 2.8.4 by {a7e343131777750d11c12d06354e52aaae9badc5} [18:55:03] !log finished restbase update to 9fe8676d [18:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:26] (03CR) 10Yuvipanda: [C: 032] Remove main node IPs from codfw restbase seeds [puppet] - 10https://gerrit.wikimedia.org/r/280484 (owner: 10GWicke) [18:56:45] 6Operations, 10Gerrit, 10Mail, 7Upstream: Only receiving few emails from Gerrit - https://phabricator.wikimedia.org/T131189#2161857 (10hashar) Notably https://gerrit.googlesource.com/gerrit/+/a7e343131777750d11c12d06354e52aaae9badc5%5E%21/#F4 sets a timeout on the socket [18:56:49] (03PS2) 10Dzahn: install_server: rename hooft to bast3001 [puppet] - 10https://gerrit.wikimedia.org/r/280478 (https://phabricator.wikimedia.org/T123712) [18:57:02] (03CR) 10Dzahn: [C: 032] install_server: rename hooft to bast3001 [puppet] - 10https://gerrit.wikimedia.org/r/280478 (https://phabricator.wikimedia.org/T123712) (owner: 10Dzahn) [18:58:18] (03PS2) 10Ottomata: No-op refactor of eventlogging module [puppet] - 10https://gerrit.wikimedia.org/r/280486 (https://phabricator.wikimedia.org/T131263) [18:58:41] !log doing a puppet run in codfw restbase nodes and performing a rolling restart of restbase [18:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160330T1900). Please do the needful. [19:00:22] * thcipriani does needful [19:03:21] (03PS1) 10Thcipriani: group1 wikis to 1.27.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280490 [19:07:09] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.27.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280490 (owner: 10Thcipriani) [19:07:34] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280490 (owner: 10Thcipriani) [19:07:48] (03PS1) 10Dzahn: DHCP: set next-server for public-esams subnet to carbon [puppet] - 10https://gerrit.wikimedia.org/r/280491 (https://phabricator.wikimedia.org/T123712) [19:07:54] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.19 [19:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:15] blerg. lots of Notice: Undefined index: title in /srv/mediawiki/php-1.27.0-wmf.19/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php on line 682 [19:14:33] title and namespace :\ [19:15:47] (03PS2) 10Alex Monk: Horizon proxy dashboard: Fix backend URL format [puppet] - 10https://gerrit.wikimedia.org/r/280488 [19:16:41] (03CR) 10Dzahn: [C: 032] DHCP: set next-server for public-esams subnet to carbon [puppet] - 10https://gerrit.wikimedia.org/r/280491 (https://phabricator.wikimedia.org/T123712) (owner: 10Dzahn) [19:17:31] thcipriani, looking [19:17:37] thanks [19:17:46] (03PS6) 10Yuvipanda: tools: Add class that helps build kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/279648 (https://phabricator.wikimedia.org/T129311) [19:18:35] (03PS3) 10Andrew Bogott: Horizon proxy dashboard: Fix backend URL format [puppet] - 10https://gerrit.wikimedia.org/r/280488 (owner: 10Alex Monk) [19:18:53] (03CR) 10Andrew Bogott: [C: 032] "Hotfixed on labtestweb, looks good." [puppet] - 10https://gerrit.wikimedia.org/r/280488 (owner: 10Alex Monk) [19:20:00] (03PS3) 10Ottomata: No-op refactor of eventlogging module [puppet] - 10https://gerrit.wikimedia.org/r/280486 (https://phabricator.wikimedia.org/T131263) [19:21:32] (03CR) 10jenkins-bot: [V: 04-1] No-op refactor of eventlogging module [puppet] - 10https://gerrit.wikimedia.org/r/280486 (https://phabricator.wikimedia.org/T131263) (owner: 10Ottomata) [19:22:28] (03PS4) 10Ottomata: No-op refactor of eventlogging module [puppet] - 10https://gerrit.wikimedia.org/r/280486 (https://phabricator.wikimedia.org/T131263) [19:23:31] thcipriani, https://gerrit.wikimedia.org/r/280494 [19:24:18] (03PS5) 10Ottomata: No-op refactor of eventlogging module [puppet] - 10https://gerrit.wikimedia.org/r/280486 (https://phabricator.wikimedia.org/T131263) [19:25:00] MaxSem: okie doke. wmf if you want to backport to wmf.19. [19:25:17] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:31:21] (03PS2) 10Dzahn: rename hooft.esams to bast3001 [dns] - 10https://gerrit.wikimedia.org/r/280464 (https://phabricator.wikimedia.org/T123712) [19:33:34] MaxSem: this isn't popping up with too much frequency, very occasional large spikes :\ [19:33:53] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2162100 (10RobH) @krenair is right, and in my head I just thought that the majority of our shell users have advanced rights, but not all. Do we want to require all shell users to subs... [19:34:41] 6Operations, 6Discovery, 7Elasticsearch, 13Patch-For-Review: Icinga should alert on free disk space < 15% on Elasticsearch hosts - https://phabricator.wikimedia.org/T130329#2162102 (10Dzahn) needs confirmation on neon itself, in the generated icinga config.. don't see it just yet [19:34:51] 6Operations, 6Discovery, 7Elasticsearch, 13Patch-For-Review: Icinga should alert on free disk space < 15% on Elasticsearch hosts - https://phabricator.wikimedia.org/T130329#2162103 (10Dzahn) p:5Triage>3Normal [19:34:58] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [19:35:20] 6Operations, 6Discovery, 7Elasticsearch: Icinga should alert on free disk space < 15% on Elasticsearch hosts - https://phabricator.wikimedia.org/T130329#2132816 (10Dzahn) [19:35:35] (03PS1) 10Ottomata: [WIP] Run eventlogging services out of deployed eventlogging source path [puppet] - 10https://gerrit.wikimedia.org/r/280497 (https://phabricator.wikimedia.org/T131263) [19:36:55] (03CR) 10Dzahn: [C: 032] rename hooft.esams to bast3001 [dns] - 10https://gerrit.wikimedia.org/r/280464 (https://phabricator.wikimedia.org/T123712) (owner: 10Dzahn) [19:39:38] !log hooft - updating root password [19:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:40:16] PROBLEM - check_puppetrun on alnitak is CRITICAL: CRITICAL: puppet fail [19:42:37] (03PS2) 10Dzahn: site.pp/hiera: rename hooft to bast3001 [puppet] - 10https://gerrit.wikimedia.org/r/280472 (https://phabricator.wikimedia.org/T123712) [19:43:31] !log hooft is going to be reinstalled and renamed, affects esams bastion and ganglia during the install [19:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:43:57] (03CR) 10Dzahn: [C: 032] site.pp/hiera: rename hooft to bast3001 [puppet] - 10https://gerrit.wikimedia.org/r/280472 (https://phabricator.wikimedia.org/T123712) (owner: 10Dzahn) [19:45:16] PROBLEM - check_puppetrun on alnitak is CRITICAL: CRITICAL: puppet fail [19:48:20] !log hooft revoking puppet cert, salt key, reboot into PXE [19:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:50:16] RECOVERY - check_puppetrun on alnitak is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160330T2000). [20:00:18] no mobileapps deployment [20:00:36] (03PS6) 10Ottomata: No-op refactor of eventlogging module [puppet] - 10https://gerrit.wikimedia.org/r/280486 (https://phabricator.wikimedia.org/T131263) [20:00:38] (03PS2) 10Ottomata: [WIP] Run eventlogging services out of deployed eventlogging source path [puppet] - 10https://gerrit.wikimedia.org/r/280497 (https://phabricator.wikimedia.org/T131263) [20:00:42] (03PS1) 10Dzahn: resolving::domain_search: drop esams.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/280503 (https://phabricator.wikimedia.org/T123712) [20:01:52] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Run eventlogging services out of deployed eventlogging source path [puppet] - 10https://gerrit.wikimedia.org/r/280497 (https://phabricator.wikimedia.org/T131263) (owner: 10Ottomata) [20:02:25] (03PS3) 10Ottomata: [WIP] Run eventlogging services out of deployed eventlogging source path [puppet] - 10https://gerrit.wikimedia.org/r/280497 (https://phabricator.wikimedia.org/T131263) [20:03:03] !log update restbase to 1b52276f canary deploy to restbase1005 [20:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:13] (03PS4) 10Ottomata: [WIP] Run eventlogging services out of deployed eventlogging source path [puppet] - 10https://gerrit.wikimedia.org/r/280497 (https://phabricator.wikimedia.org/T131263) [20:03:34] 6Operations, 10ops-codfw: install SSDs in restbase2001-restbase2006 - https://phabricator.wikimedia.org/T127333#2162402 (10Papaul) a:5Papaul>3fgiunchedi complete [20:05:46] (03PS1) 10Dzahn: DHCP: don't use esams.wm.org as domain name [puppet] - 10https://gerrit.wikimedia.org/r/280505 [20:07:51] !log starting parsoid deploy [20:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:09:22] (03PS1) 10Dzahn: network.pp: drop SLAAC addresses for hooft [puppet] - 10https://gerrit.wikimedia.org/r/280506 [20:10:37] !log starting update restbase to 1b52276f [20:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:10:48] 6Operations, 13Patch-For-Review: Reimage hooft with jessie and rename to bast3001 - https://phabricator.wikimedia.org/T123712#2162426 (10Dzahn) >>! In T123712#1948876, @faidon wrote: > After it's done, we should drop the .esams.wikimedia.org suffix from everywhere (at least `base::resolving::domain_search` und... [20:11:45] !log synced code; restarted parsoid on wtp1004 as a canary [20:11:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:12:41] (03PS5) 10Ottomata: Run eventlogging services out of deployed eventlogging source path [puppet] - 10https://gerrit.wikimedia.org/r/280497 (https://phabricator.wikimedia.org/T131263) [20:14:17] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:15:38] !log finished update restbase to 1b52276f [20:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:20:10] !log finished deploying parsoid version a20ef276 [20:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:25:33] (03PS1) 10Andrew Bogott: Horizon proxy panel: Remove terminal . in proxy frontend [puppet] - 10https://gerrit.wikimedia.org/r/280509 [20:26:11] (03PS2) 10Andrew Bogott: Horizon proxy panel: Remove terminal . in proxy frontend [puppet] - 10https://gerrit.wikimedia.org/r/280509 [20:27:41] (03CR) 10Alex Monk: [C: 031] Horizon proxy panel: Remove terminal . in proxy frontend [puppet] - 10https://gerrit.wikimedia.org/r/280509 (owner: 10Andrew Bogott) [20:27:51] (03CR) 10Yuvipanda: [C: 031] Horizon proxy panel: Remove terminal . in proxy frontend [puppet] - 10https://gerrit.wikimedia.org/r/280509 (owner: 10Andrew Bogott) [20:27:53] (03CR) 10Andrew Bogott: [C: 032] Horizon proxy panel: Remove terminal . in proxy frontend [puppet] - 10https://gerrit.wikimedia.org/r/280509 (owner: 10Andrew Bogott) [20:28:09] (03CR) 10Ottomata: [C: 031] Add correct varnishkafka configuration files for Varnish 4 servers. [puppet] - 10https://gerrit.wikimedia.org/r/280459 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [20:30:08] (03PS7) 10Yuvipanda: tools: Add class that helps build kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/279648 (https://phabricator.wikimedia.org/T129311) [20:31:43] (03CR) 10Ottomata: "This is applied in beta, and looks good in puppet compiler: https://puppet-compiler.wmflabs.org/2234/" [puppet] - 10https://gerrit.wikimedia.org/r/280486 (https://phabricator.wikimedia.org/T131263) (owner: 10Ottomata) [20:32:32] (03CR) 10Ottomata: "Applied and tested in beta, puppet compiler for prod hosts looks good: https://puppet-compiler.wmflabs.org/2235/eventlog1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/280497 (https://phabricator.wikimedia.org/T131263) (owner: 10Ottomata) [20:34:10] 6Operations, 10Ops-Access-Requests, 15User-greg: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2162559 (10Dereckson) Thanks. [20:40:11] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2162593 (10Dzahn) What Robh said, otherwise to get shell you have to sign L3 and that tells you you need to do something for which you have to sign L2. Or you can just tell everybody... [20:41:33] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2162596 (10RobH) Als L2 isn't public viewable (I've never really understood why). For folks to sign that NDA, they ahve to be added to a pending NDA group to view and sign. We'd need... [20:41:35] 6Operations, 10Traffic, 7HTTPS, 13Patch-For-Review: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2162597 (10Dzahn) @Krenair what do you think about removing the Apache from that entirely, also HTTP? [20:41:37] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2162598 (10greg) Bah, right.... shall I undo this edit? https://wikitech.wikimedia.org/w/index.php?title=Requesting_shell_access&type=revision&diff=401835&oldid=340244 task->invalid?... [20:42:08] (03PS1) 10ArielGlenn: update README to reflect that prod scripts are now in master [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280560 [20:42:29] (03CR) 10Alex Monk: "also delete the mw_rc_irc::apache class and any files?" [puppet] - 10https://gerrit.wikimedia.org/r/280342 (https://phabricator.wikimedia.org/T130981) (owner: 10Dzahn) [20:42:45] 6Operations, 10Traffic, 7HTTPS, 13Patch-For-Review: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2162599 (10Krenair) Fine with me. @Krinkle? [20:43:03] (03PS1) 10Alex Monk: Only allow creation of proxies directly under wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/280567 (https://phabricator.wikimedia.org/T131270) [20:44:44] 6Operations, 10Ops-Access-Requests, 15User-greg: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2162608 (10hashar) Awesome, thank you @Dzahn / @RobH :-) What is the next step now? Shows up on a SWAT slot and start break^H^H^H^Hdeploying stuff pairi... [20:45:22] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2162609 (10RobH) I would indeed undo that edit, though when you made it, I agreed with it! [20:46:28] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2162614 (10RobH) Though perhaps change to wikitech-l? Ops list should have a huge overlap, and outage and other site issues are commonly reported to operations, and then followed to w... [20:46:46] (03PS2) 10Alex Monk: Only allow creation of proxies directly under wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/280567 (https://phabricator.wikimedia.org/T131270) [20:47:29] (03CR) 10ArielGlenn: [C: 032] update README to reflect that prod scripts are now in master [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280560 (owner: 10ArielGlenn) [20:47:31] (03CR) 10jenkins-bot: [V: 04-1] Only allow creation of proxies directly under wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/280567 (https://phabricator.wikimedia.org/T131270) (owner: 10Alex Monk) [20:49:47] (03CR) 10Alex Monk: "tested in labtest" [puppet] - 10https://gerrit.wikimedia.org/r/280567 (https://phabricator.wikimedia.org/T131270) (owner: 10Alex Monk) [20:49:50] 6Operations, 10Traffic, 7HTTPS, 13Patch-For-Review: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2162646 (10Krinkle) Sounds fine. I think it'd be nice if we can find a way to serve the redirect from misc-web-lb, but that's probably not feasible. [20:50:12] (03CR) 10jenkins-bot: [V: 04-1] Only allow creation of proxies directly under wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/280567 (https://phabricator.wikimedia.org/T131270) (owner: 10Alex Monk) [20:51:01] (03CR) 10Andrew Bogott: [C: 031] "Looks good to me, once pep8 is happy" [puppet] - 10https://gerrit.wikimedia.org/r/280567 (https://phabricator.wikimedia.org/T131270) (owner: 10Alex Monk) [20:52:34] 6Operations, 10Ops-Access-Requests, 15User-greg: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2162653 (10greg) @Dereckson, which timeslots are you most available for? 15:00 UTC or 23:OO UTC? [20:55:14] 6Operations, 10Ops-Access-Requests, 15User-greg: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2162658 (10Dereckson) 23:00 UTC, and generally fairly available at 15:00 too [20:55:25] (03PS1) 10Andrew Bogott: Proxy panel: More . juggling [puppet] - 10https://gerrit.wikimedia.org/r/280578 [20:57:39] (03CR) 10Alex Monk: [C: 031] "looks good but I didn't test" [puppet] - 10https://gerrit.wikimedia.org/r/280578 (owner: 10Andrew Bogott) [21:01:56] (03PS2) 10Andrew Bogott: Proxy panel: More . juggling [puppet] - 10https://gerrit.wikimedia.org/r/280578 [21:04:09] (03CR) 10Andrew Bogott: [C: 032] Proxy panel: More . juggling [puppet] - 10https://gerrit.wikimedia.org/r/280578 (owner: 10Andrew Bogott) [21:05:07] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:05:11] !log maxsem@tin Synchronized php-1.27.0-wmf.19/extensions/CirrusSearch/: https://gerrit.wikimedia.org/r/#/c/280549/ (duration: 00m 40s) [21:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:05:41] (03CR) 10Krinkle: [C: 031] "Haven't tested but looks good :)" [puppet] - 10https://gerrit.wikimedia.org/r/280204 (https://phabricator.wikimedia.org/T126280) (owner: 10Gehel) [21:05:44] (03PS1) 10ArielGlenn: add dumps to scap3 deployment repo config [puppet] - 10https://gerrit.wikimedia.org/r/280579 [21:06:07] (03PS3) 10Andrew Bogott: Only allow creation of proxies directly under wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/280567 (https://phabricator.wikimedia.org/T131270) (owner: 10Alex Monk) [21:06:39] (03PS2) 10ArielGlenn: add dumps to scap3 deployment repo config [puppet] - 10https://gerrit.wikimedia.org/r/280579 [21:06:41] (03CR) 10Krinkle: "I'd recommend maybe applying this locally to one of the debug app servers first and verify the results from a browser using https://wikite" [puppet] - 10https://gerrit.wikimedia.org/r/280204 (https://phabricator.wikimedia.org/T126280) (owner: 10Gehel) [21:07:28] (03CR) 10Andrew Bogott: [C: 032] Only allow creation of proxies directly under wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/280567 (https://phabricator.wikimedia.org/T131270) (owner: 10Alex Monk) [21:07:53] (03PS3) 10ArielGlenn: add dumps to scap3 deployment repo config [puppet] - 10https://gerrit.wikimedia.org/r/280579 [21:09:23] (03CR) 10ArielGlenn: [C: 032] add dumps to scap3 deployment repo config [puppet] - 10https://gerrit.wikimedia.org/r/280579 (owner: 10ArielGlenn) [21:09:30] yay [21:09:58] 6Operations, 10Traffic, 7HTTPS, 13Patch-For-Review: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2162699 (10Krenair) The only way I can think of to get out of having to put a certificate on the box would be to forward traffic on ports 80 and 443 to misc-web-lb, but... [21:14:39] mutante: Trivial tech debt-- https://gerrit.wikimedia.org/r/#/c/280481/ :) [21:16:55] (03CR) 10BryanDavis: [C: 031] "Looks like ancient crap to me." [puppet] - 10https://gerrit.wikimedia.org/r/280480 (owner: 10Chad) [21:17:55] bd808: thx [21:25:55] godog, this keeps happening in -releng: PROBLEM - SSH on deployment-ms-be02 is CRITICAL: Server answer [21:26:08] I believe that instance is yours [21:26:43] (03CR) 10Thcipriani: [C: 031] deploy: delete id_rsa.pub [puppet] - 10https://gerrit.wikimedia.org/r/280480 (owner: 10Chad) [21:26:56] PROBLEM - MariaDB disk space on silver is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=57%) [21:27:35] andrewbogott: ^ something filling up silver? [21:28:37] PROBLEM - Disk space on silver is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=57%) [21:28:48] what is it? [21:29:04] I don't know what's happening but I'm looking [21:30:16] .../var/tmp is 2.7G [21:30:27] bacula? [21:30:45] it didn't have too much headroom to begin w/ that that ate the last of it [21:30:48] ah, yeah, jynus, is that you? [21:30:54] no [21:31:32] jynus: didn't you do something regarding backups and silver in response to csteipp's request? [21:31:53] ot 2.7G Mar 30 21:24 labswiki-20160329.sql.gz [21:31:55] yes [21:32:02] but always revocerint to /srv [21:32:05] i also tried to restore that [21:32:16] did you change the destination? [21:32:28] or did you recover to the default destination? [21:32:33] I didn't change anything, just seeing this big file in /var/tmp/bacula-restores [21:32:36] yes, that was me starting that job [21:32:39] oh, sorry, that was to mutante [21:32:50] .../srv here is on / I think /a is mounted as md2 [21:32:55] i wanted to see if that error about decryption depends on the client. sorry [21:32:59] /a [21:33:01] I mean [21:33:02] the good thing is.. [21:33:04] it worked [21:33:09] no, it didn't [21:33:19] labswiki-20160329 I do not need [21:33:37] so can I rm -rf /var/tmp/bacula-restores ? [21:33:37] that is after the last backup, which is not useful [21:33:42] yes [21:33:42] sigh, ok, so it's just that one day that is somehow corrupted? [21:33:45] yes [21:33:57] done, tahnks [21:34:00] no, multiple days [21:34:07] RECOVERY - Disk space on silver is OK: DISK OK [21:34:20] I tested since 28 february [21:34:24] RECOVERY - MariaDB disk space on silver is OK: DISK OK [21:34:26] all unreceovered [21:35:10] make sure you recover to the large partition or to another host, as I did [21:36:34] it wouldn't hurt also getting a larger volume, if space is not a huge problem? Is it a VM? [21:36:49] (not high priority, though) [21:37:01] it's physical [21:37:09] and no lvm2 [21:37:16] PROBLEM - Host pay-lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [21:37:17] :-/ [21:37:37] so one more to migrate at some point, like phab's [21:38:12] ^that is more important [21:38:18] PROBLEM - Host mintaka is DOWN: PING CRITICAL - Packet loss = 100% [21:38:25] but I think it is the known issue [21:38:40] with network? [21:38:46] that smacks of that issue w/ the srx periodically [21:38:59] PROBLEM - Host payments2002 is DOWN: PING CRITICAL - Packet loss = 100% [21:39:24] well, good thing we're all trained to ignore fundraising outages [21:39:49] Jeff_Green: Do you have an encouraging word to share about those alerts? [21:39:50] payments is up, so non-critical [21:39:59] PROBLEM - Host betelgeuse is DOWN: PING CRITICAL - Packet loss = 100% [21:40:01] there's that yeah [21:40:05] atm my brain checks for codfw and verifies the payments sites I know that all seem good [21:40:07] i suspect we just lost a firewall [21:40:07] so allow me to retire [21:40:12] but yeah this is pretty not ideal [21:40:37] RECOVERY - Host payments2002 is UP: PING OK - Packet loss = 0%, RTA = 37.05 ms [21:40:47] and we're back...(?) [21:41:11] *sigh* [21:41:15] I really need my phone to make a different noise for recovery alerts [21:41:15] why does this always happen when I'm deploying major puppet changes, so I have a heart attack [21:41:21] lol [21:41:27] I hope we got a coredump this time [21:41:34] because the case went nowhere last time around [21:41:44] yeah that would be good [21:42:15] nope :/ [21:42:21] -rw-rw---- 1 root wheel 0 Mar 30 21:34 /var/tmp/flowd_octeon_hm.core.3.gz [21:42:45] we should just rip them out and replace them with a couple of openbsd boxes [21:43:08] RECOVERY - Host pay-lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 37.42 ms [21:44:19] RECOVERY - Host mintaka is UP: PING OK - Packet loss = 0%, RTA = 37.09 ms [21:45:08] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [21:45:18] RECOVERY - Host betelgeuse is UP: PING OK - Packet loss = 0%, RTA = 36.98 ms [21:45:20] why are we paging on every single host btw? [21:45:32] some of these are part of a redundant pair right? [21:45:37] ? [21:46:44] i guess I could spend the time to figure out how to page for individual hosts for the two cases where it applies, but that's pretty low on the list of priorities [21:50:17] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: puppet fail [21:50:49] looking ^^^ [21:51:05] puppet runs fine... [21:55:17] RECOVERY - check_puppetrun on pay-lvs1002 is OK: OK: Puppet is currently enabled, last run 249 seconds ago with 0 failures [22:03:01] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2162903 (10greg) Undid. [22:03:27] 6Operations: Add "subscribe and read the ops@lists.wikimedia.org mailing list" to L3 - https://phabricator.wikimedia.org/T131262#2162911 (10greg) 5Open>3declined /me walks away slowly [22:03:46] mark: btw, with the raised hangouts limit, we may want to consider using Hangouts again in the future for the weekly meeting [22:13:41] !log smokeping/netmon - remove hooft/bast3001, restart service [22:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:22:09] (03PS1) 10Madhuvishy: [WIP] Update entries for analytics1017 and 1021 to notebook1001 and 1002 [dns] - 10https://gerrit.wikimedia.org/r/280593 (https://phabricator.wikimedia.org/T130760) [22:24:54] robh: I made ^ patch - had a couple questions that I left in the commit message. [22:27:39] 6Operations, 10Traffic, 7HTTPS, 13Patch-For-Review: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2163530 (10Dzahn) a:3Dzahn [22:29:53] mutante: When were you gonna e-mail ops-l telling me I can't use hooft anymore? [22:30:21] * Krinkle fixes ssh config [22:31:25] Krinkle: sorry, i was hoping it'd be done already but it's extremely slow [22:32:01] oh bast3001 doesn't connect either [22:32:07] no [22:32:11] I thought this was the moment hooft was permanently renamed [22:32:13] or is it? [22:32:14] because the installer is running right now [22:32:24] yes [22:32:29] anyway, I'll update my ssh config either way [22:32:31] using 1001 for now [22:32:42] yes please. i will mail for sure when you can use 3001 [22:36:30] (03CR) 10RobH: "no need to add reverse entries for items that are being renamed, so just make sure there are reverse entries for the notebook hostnames. " [dns] - 10https://gerrit.wikimedia.org/r/280593 (https://phabricator.wikimedia.org/T130760) (owner: 10Madhuvishy) [22:37:17] madhuvishy: uh, your change doesnt touch production dns [22:37:22] yet your message says it does? [22:38:08] (03CR) 10RobH: "Also the patchset message says it covers production dns, but I only see mgmt dns updates in ps1." [dns] - 10https://gerrit.wikimedia.org/r/280593 (https://phabricator.wikimedia.org/T130760) (owner: 10Madhuvishy) [22:38:40] robh: aah - that might be me misunderstanding - I thought the entries in wmnet are the prod dns ones. Because your comment on phab that lists the things that need to be done said "update production dns entries to remove old hostnames" [22:39:18] wmnet has both [22:39:26] but you only modified the subnets for mgmt. [22:39:32] you didnt touch the production entries [22:39:52] oh i missed it [22:39:54] right [22:39:59] so to do that you have to know where they are located (racktables helps) and then find them and modify [22:40:06] if they havent changed rows or vlans, its easy [22:40:11] if they change vlans, less so. [22:41:18] robh: I think nothing was changed - which is what I understand from https://phabricator.wikimedia.org/T131216 [22:41:46] yep, so no need to know where they are [22:41:51] just need to find and replace the existing entries [22:42:07] wait, maybe not [22:42:09] uhh [22:42:18] madhuvishy: i may be wrong, they didnt move their racks [22:42:27] but right now they are in analytics vlan i bet. [22:42:30] yes [22:42:31] and they likely need to move. [22:42:36] and we asked not to move it [22:42:39] oh, ok [22:42:41] then its easy. [22:42:53] yeah cool, i'll fix the missed entries [22:42:57] thanks! [22:43:04] quite welcome =] [22:43:18] im happy to review/merge when needed [22:43:18] 6Operations: smokeping config puppetization issue? - https://phabricator.wikimedia.org/T131326#2163604 (10Dzahn) [22:44:10] (03PS8) 10Yuvipanda: tools: Add class that helps build kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/279648 (https://phabricator.wikimedia.org/T129311) [22:45:06] 6Operations: smokeping config puppetization issue? - https://phabricator.wikimedia.org/T131326#2163624 (10Dzahn) the puppet edit was here: https://gerrit.wikimedia.org/r/#/c/280466/2/modules/smokeping/files/config.d/Targets so modules/smokeping/files/config.d/Targets the file on the server is /etc/smokeping/c... [22:47:38] (03PS9) 10Yuvipanda: tools: Add class that helps build kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/279648 (https://phabricator.wikimedia.org/T129311) [22:48:43] (03CR) 10jenkins-bot: [V: 04-1] tools: Add class that helps build kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/279648 (https://phabricator.wikimedia.org/T129311) (owner: 10Yuvipanda) [22:51:28] 6Operations, 10Ops-Access-Requests: Grant reedy access to librenms - https://phabricator.wikimedia.org/T131252#2161064 (10Dzahn) https://wikitech.wikimedia.org/wiki/LibreNMS says " User creds are stored in MySQL: # grep auth_mechanism /srv/deployment/librenms/librenms/config.php To add a user, run:... [22:51:46] robh: how would I go about adding a ipv6 address? I'm not sure where to find it [22:52:57] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [5000000.0] [22:53:25] (03PS1) 10Madhuvishy: [WIP] Update entries for analytics1017 and 1021 to notebook1001 and 1002 [dns] - 10https://gerrit.wikimedia.org/r/280602 (https://phabricator.wikimedia.org/T130760) [22:53:47] (03Abandoned) 10Madhuvishy: [WIP] Update entries for analytics1017 and 1021 to notebook1001 and 1002 [dns] - 10https://gerrit.wikimedia.org/r/280602 (https://phabricator.wikimedia.org/T130760) (owner: 10Madhuvishy) [22:54:25] madhuvishy: you would add a AAAA record right after the A record [22:54:34] in the DNS zone template [22:54:46] mutante: that I see :) what would the address be though? [22:56:34] mutante: Hm.. do you know where adminbot's config.py is maintained? It seems it's not marking its own edits as bots. The default config.py has wiki_bot = True. - https://github.com/search?utf8=%E2%9C%93&q=wiki_bot+%40wikimedia&type=Code&ref=searchresults [22:56:34] ah, yes, you use this puppet line that creates it for you [22:56:34] presumably there is a private version somewhere that contains the password as well etc. [22:56:34] madhuvishy: interface::add_ip6_mapped { 'main': } you add this on a node. see site.pp [22:56:35] madhuvishy: then puppet runs and adds the IP to interface, it's going to be a "mapped" address, where the v4 address appears in the v6 address [22:56:35] (03PS2) 10Madhuvishy: [WIP] Update entries for analytics1017 and 1021 to notebook1001 and 1002 [dns] - 10https://gerrit.wikimedia.org/r/280593 (https://phabricator.wikimedia.org/T130760) [22:56:35] yeah, you just append the end of the ipv5 address into the ivp6 space for the vlan [22:56:35] it looks far more complex than it really is [22:56:35] madhuvishy: then you edit DNS and use the one you got on the interface from puppet [22:56:38] ipv6 [22:56:42] not ipv5 =P [22:56:51] mutante: uh, i dont think you avhe to do it via that... [22:57:00] pretty sure you can just make the mapping manually [22:57:05] but both work yes [22:57:30] checking... [22:57:50] example for mapped address: v4 10.64.5.12 v6 2620:0:861:104:10:64:5:12 [22:57:54] yeah [22:57:57] okay I'll leave it for now - and once we update the puppet code for the servers with new hostnames - will add ipv6 entry for this [22:57:59] so all wmnet ipv6 are the same [22:58:02] 2620:0:861:104:ipv4 [22:58:05] ah [22:58:14] its more complicated in public subnets ;] [22:58:27] oh wait [22:58:28] im wrong. [22:58:31] madhuvishy: ^ [22:59:00] the 2620:0:861: stays the same [22:59:09] but the 104 changes depending on soemthing, still looking. [22:59:15] robh: okay :) [22:59:16] likely row [22:59:28] each row is its own private subnet, so lets see what the ipvs prefix is for these [22:59:53] depends on the row [23:00:04] RoanKattouw ostriches Krenair MaxSem: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160330T2300). [23:00:05] Luke081515 MaxSem: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:14] * Luke081515 is here [23:00:15] * MaxSem looks arund [23:00:36] madhuvishy: so analtycs1021 is one of these right? [23:00:46] madhuvishy: add interface::add_ip6_mapped { 'main': } on the nodes [23:00:55] robh: an1021 already has it. 1017 is missing ipv6 [23:01:01] ok, cool [23:01:04] Who can SWAT? [23:01:05] ok, it;ll be me [23:01:12] ok [23:01:18] 6Operations, 10Ops-Access-Requests, 15User-greg: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2163662 (10greg) https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=402850&oldid=402831 and https://wikitech.wikimedia.org/w/... [23:01:19] You can deploy your patch first [23:01:50] nah, yours will go through Zull much faster anyway [23:01:56] mutante: okay. robh I think it'll be fine if we add the ipv6 entry after puppet says what it is. Would that work? [23:02:00] Krinkle: i checked private puppet but there are no secrets for adminbot and it runs in tool labs. so i dont really know where that is [23:02:13] madhuvishy: yes, exactly [23:02:18] madhuvishy: yep, but now im gonna try to see what cuz it bugs meeeeeeee [23:02:18] ok, I'm ready [23:02:18] heh [23:02:28] PROBLEM - Host hooft is DOWN: PING CRITICAL - Packet loss = 100% [23:02:30] but indeed, you should tuypically be fine to do it post install [23:02:34] i just dont like that i dont know the answer. [23:02:37] robh: okay :) [23:02:59] (03PS3) 10Madhuvishy: Update entries for analytics1017 and 1021 to notebook1001 and 1002 [dns] - 10https://gerrit.wikimedia.org/r/280593 (https://phabricator.wikimedia.org/T130760) [23:03:05] i always just did it that way, let puppet add it to the interface and then use the one you get [23:03:17] that number in between is the row though [23:03:33] found it [23:03:38] (03CR) 10MaxSem: [C: 032] Add movefile to the editor group at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280449 (https://phabricator.wikimedia.org/T131249) (owner: 10Luke081515) [23:03:44] 1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa [23:03:47] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [23:03:54] open that file up, and you can see the pv6 prefix for each subnet [23:04:08] ; analytics1-c-eqiad (2620:0:861:106::/64) [23:04:19] which is where notebook1001 is [23:04:33] (03Merged) 10jenkins-bot: Add movefile to the editor group at arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280449 (https://phabricator.wikimedia.org/T131249) (owner: 10Luke081515) [23:04:33] so 106 [23:04:49] yep, and you have to add it in wmnet an din the 1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa [23:04:59] i just had to stare at the repo and jog my memory, heh [23:05:21] what mutante suggested would also work and give you the answer i suppose, but i have never done it that way [23:05:28] (both work!) [23:05:53] mutante: k, I'll check on tools [23:06:08] though his way tells you how to format the reverse entry i think, so its easier. [23:06:28] cuz now i have to recall how to do that, and i dont recall, heh [23:06:42] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/280449/ (duration: 00m 43s) [23:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:50] Luke081515, ^ [23:06:53] robh: okay :) I updated the patch for the other things - feel free to review it any time [23:06:53] !log torrus - follow instruction for deadlock problem [23:06:56] * robh resigns himself to leaving it alone for 15minutes and see if it still bugs him [23:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:07:00] cool [23:07:03] will do now [23:07:04] MaxSem: Checked, works. Thanks for SWAT, I will close the task now [23:08:06] madhuvishy: looks good i'm merging [23:08:11] well, rebasing dependecy hell [23:08:14] (03PS4) 10RobH: Update entries for analytics1017 and 1021 to notebook1001 and 1002 [dns] - 10https://gerrit.wikimedia.org/r/280593 (https://phabricator.wikimedia.org/T130760) (owner: 10Madhuvishy) [23:08:17] heh [23:08:21] robh: great! thanks :) [23:08:35] !log maxsem@tin Synchronized php-1.27.0-wmf.19/extensions/Kartographer/: https://gerrit.wikimedia.org/r/#/c/280601/ (duration: 00m 34s) [23:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:07] tfinc, works now [23:09:31] (03CR) 10RobH: [C: 032] Update entries for analytics1017 and 1021 to notebook1001 and 1002 [dns] - 10https://gerrit.wikimedia.org/r/280593 (https://phabricator.wikimedia.org/T130760) (owner: 10Madhuvishy) [23:09:32] MaxSem: you win this one :) [23:09:44] yurik: --^ [23:10:13] awesome, thx MaxSem :) [23:10:42] !log maxsem@tin Synchronized php-1.27.0-wmf.18/extensions/Kartographer/: https://gerrit.wikimedia.org/r/#/c/280600/ (duration: 00m 31s) [23:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:11:02] madhuvishy: dns change is live [23:11:08] coool [23:11:12] anybody needs stuff deployed? [23:11:27] this swat was too quck! [23:11:41] robh: now puppet I suppose. I should first remove all the old references, and then patch with updated new ones right? [23:12:01] madhuvishy: well, if the old references are in dhcpd, just change the hostnames for the install_server module update [23:12:10] but for the netboot.cfg you wanna change [23:12:18] ideally we always put things in an lvm, and dual disks become raid 1 [23:12:39] so in the install_server module, in netboot.cfg its likely not in lvm [23:13:08] * madhuvishy looks at puppet [23:13:24] analytics1017) echo partman/raid1-30G.cfg ;; \ [23:13:25] analytics1015) echo partman/raid1-lvm-ext4.cfg ;; \ [23:13:37] so no entry for 1021 that i can see at a quick glance [23:13:46] but, neither of those are great... these are both... checking.. [23:14:03] oh, these are huge boxes [23:14:14] they have raid10 hw likely enabled, so no raid crap in partman. [23:15:20] robh: i understand very less of these things at this point [23:15:34] ok, no worries, im about to give you the answer =] [23:15:52] 1021 might have been decommissioned before. [23:16:08] first step is someone in ops has to reinstall these for you, since you wont be able to connect to the drac anyhow [23:16:17] lets see if it has an entry [23:16:17] robh: okay [23:16:50] yuvipanda: will you do the reinstalls? [23:16:57] I could, yeah :) [23:17:11] hrmm, 1021 doesnt. so since I'll have to pull that i may as well pull the mac info and make a patchset. typically i do this stuff fo rmost folks [23:17:20] but you guys knew what you wanted and got it approved and stuff [23:17:25] or yuvi can ;D [23:17:35] robh: I'll be grateful if you could :D [23:17:39] yuvipanda: have you polled drac interfaces for mac info before? [23:17:44] heh [23:17:44] nope [23:17:49] I've used the reinstall script in palladium [23:17:51] that's about it [23:17:52] yeah, no worries [23:18:00] lemme poke at these, we need to confirm raid settings [23:18:08] gimme a few minutes [23:18:10] ok! :D [23:18:50] are these jessie or trusty? [23:19:04] robh: they need to be jessie, yeah [23:19:08] not sure what they are now [23:19:50] torrrrus.. i am still trying to get it back [23:21:24] robh: thank you! [23:21:43] welcome [23:21:48] ACKNOWLEDGEMENT - Host hooft is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reinstall [23:23:42] (03PS1) 10RobH: setting install params for notebook100[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/280604 [23:24:16] did you guys decommission the analytics1017 and analytics1021 names out of icinga? [23:24:49] cuz you cannot reinstall or offline these until you do that, or you'll generate alerts for the old hostnames. [23:25:05] robh: not sure. they have been unused for a while [23:25:19] then you have to assume they havent [23:25:30] or i guess i will, but i dont wanna babysit neon =p [23:26:03] I'm looking at icinga now [23:26:04] stop puppet agent, run puppetstoredconfigclean.rb on master, run puppet on neon [23:26:06] yuvipanda: i'll take care of puppet and salt keys, and icinga. but you guys need to ensure all other cruft is gone [23:26:21] all of this is in the lifecycle steps as well [23:26:29] but if puppet runs again it will add itself back [23:26:45] yeah, 1021 and 1017 are both in icinga [23:27:09] urgh, that should have been done before this stuff [23:27:30] * yuvipanda reads https://wikitech.wikimedia.org/wiki/Server_Lifecycle [23:27:46] so yeah, review the lifecycle, and decom the use of the old names in icinga [23:28:05] (or i can if you dont wanna see how, but then next time you ask to skip the line im not doing it ;) [23:28:24] cuz you totally just jumped queue on other hw-requests in front of you [23:28:24] heh [23:28:25] hehe [23:28:26] yeah [23:29:10] im on the serial consoel for 1017 [23:29:14] i'll stop its puppet now [23:29:26] ok [23:29:44] * yuvipanda is getting on neon [23:30:03] well, first you ahve to do the upppetstoreddbclean [23:30:06] that mutante pointed out [23:30:12] and unsign puppet and salt [23:30:57] hmm [23:30:59] where's puppetstoredconfigclean.rb [23:31:16] I can't seem to find it in /usr/local/bin [23:31:25] bah [23:31:27] nvm [23:31:29] I'm an idiot [23:31:39] Manually run puppetstoredconfigclean.rb on the puppet master. [23:31:43] sudo on palladium [23:31:50] yup, I was on neon [23:31:56] > Can't find host analytics1017.eqiad.wmnet. [23:32:05] it's already gone from icinga [23:32:15] cool, if it doesnt show in icinga someone likely did it already [23:32:27] it was in icinga 10mins ago :D [23:32:36] puppet is halted on both machines, so now they wont add themselves back into puppet [23:32:39] and thus back to icinga [23:32:49] right ok [23:33:02] clearing out 1021 now [23:33:04] then I"ll unsign both [23:33:15] cool, once neon runs puppet after you do that then i can reboot them both into their bios [23:33:25] and confirm their raid is still raid10 and wasnt changed to anything wonky [23:33:28] then reinstall =] [23:33:58] (03CR) 10RobH: [C: 032] setting install params for notebook100[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/280604 (owner: 10RobH) [23:34:31] !log cleaned out puppet cert for analytics 1017 and 1021 [23:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:34:47] also make sure to check salt keys on neodymium [23:35:20] yeah doing that now [23:35:53] !log cleaned salt keys for analytics1017 and 1021 [23:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:18] I see the puppetrun in neon just took out 1021 [23:36:24] let me wait for it to complete [23:37:10] * yuvipanda waits [23:37:52] ok, checking notebook1001's bios/raid cuz it wasnt in icinga anyhow [23:37:57] (well, 1015 wasnt) [23:37:59] sorry17 [23:38:07] numbers are hard. [23:38:22] 6Operations, 10Ops-Access-Requests: Grant reedy access to librenms - https://phabricator.wikimedia.org/T131252#2161064 (10Krenair) `ls: cannot access /srv/deployment/librenms/librenms/config.php: No such file or directory` [23:38:33] icinga is clean [23:39:53] Surely it needs to be run on netmon1001 as where it's hosted? [23:40:12] Reedy: where whats hosted? [23:40:22] librenms? [23:40:28] "Currently hosted on netmon1001." [23:40:45] oh, you are referring to older discussion, sorry misunderstood, heh [23:40:55] oh, wikibugs, i just phase it out. [23:40:57] heh [23:41:04] (its not ignored, cuz techncially we spam with it so i need to hear it) [23:41:06] hehe [23:41:52] 6Operations: torrus broken - https://phabricator.wikimedia.org/T131329#2163811 (10Dzahn) [23:41:53] yuvipanda: So, it seems that analytics wasnt using the hw raid on these [23:42:07] i imagine you would want to, sinc eits hw raid and takes less cpu cycles [23:42:16] typically we setup a raid10 [23:42:18] !log maxsem@tin Synchronized php-1.27.0-wmf.19/extensions/Kartographer/: (no message) (duration: 00m 27s) [23:42:23] robh: +1 [23:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:42:29] cool, i'll make it happen now [23:42:33] how difficult is that? [23:42:36] cooool [23:42:42] thanks! [23:45:14] hrmm [23:45:20] wtf... it says applying changes in raid but doesnt [23:46:02] lemme check 1002 [23:47:12] it says it can take minutes, but gives no progress indicator on either [23:47:20] so im just going to set a 5 minutes timer and see if it s done then [23:52:42] ok! [23:53:04] and nope [23:53:06] hrmm... [23:53:34] oh, hell [23:53:37] these are h310s [23:53:42] this explains it all. [23:53:57] yuvipanda: we likely need to have chris swap out the h310 controllers on these before you want to use them [23:54:04] checking to ensure we have h710s on site [23:54:42] yuvipanda: so we have 10 replacement controllers, to go in exactly these kinds of sysetms [23:54:51] ones ordered with the shitty controller and not yet swapped out since they were in use [23:55:02] robh: ouch, I see. [23:55:05] if you guys can afford to wait another day, chris can install the proper controllers, setup raid10, and we can install [23:55:07] robh: ok! [23:55:26] i'll create the onsite task as a blocker to T130760 [23:55:27] T130760: Allocate 2 analytics machines to experiment with a jupyterhub notebook service - https://phabricator.wikimedia.org/T130760 [23:55:35] awesome! thanks [23:57:25] 6Operations, 10ops-eqiad: replace h310 with h710 controller in notebook1001 & notebook1002 poweredge r720xd systems - https://phabricator.wikimedia.org/T131331#2163887 (10RobH) [23:57:44] that was annoying the crap out of me, and then i decided to start over again and saw it was a shite controller [23:57:50] i thought we had replaced them all by now, heh [23:58:13] yuvipanda: so once the controller is swapped, carbon is ready for the install, has all the new info for partitioning and such [23:58:30] the blocking task asks chris to setup a raid10, so then you'll be set to pxe boot into the installer. [23:59:09] though now i'm making a note that i bet all the analytics 720xd have the crap h310 [23:59:18] and need swap eventually. [23:59:28] (03PS2) 10Krinkle: errorpages: Remove X-Wikimedia-Debug header from 404.php response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279571 [23:59:35] (03PS3) 10Krinkle: errorpages: Clean up 404.php code and simplify replacement url [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279572