[00:00:51] ebernhardson? [00:00:59] Krenair: works as expected [00:01:10] !log krenair Synchronized php-1.25wmf21/extensions/Flow/container.php: https://gerrit.wikimedia.org/r/#/c/199168/ (duration: 00m 05s) [00:01:14] Logged the message, Master [00:01:15] ebernhardson, ^ [00:02:15] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [00:02:34] Krenair: works there too, thans [00:02:39] yw [00:02:43] (swat is over) [00:13:22] (03PS4) 10Gergő Tisza: [WIP] Make vbench more generic [puppet] - 10https://gerrit.wikimedia.org/r/197240 (https://phabricator.wikimedia.org/T92701) [00:14:15] (03PS5) 10Gergő Tisza: [WIP] Make vbench more generic [puppet] - 10https://gerrit.wikimedia.org/r/197240 (https://phabricator.wikimedia.org/T92701) [00:28:16] RECOVERY - Disk space on palladium is OK: DISK OK [00:34:00] (03CR) 10Dzahn: "watched iptables counters for unexpected unexpectedly dropped packets. ganglia still working. lgtm." [puppet] - 10https://gerrit.wikimedia.org/r/172434 (owner: 10John F. Lewis) [00:37:05] (03CR) 10Gergő Tisza: "Ori: I pasted the output to T92701#1143733." [puppet] - 10https://gerrit.wikimedia.org/r/197240 (https://phabricator.wikimedia.org/T92701) (owner: 10Gergő Tisza) [00:37:22] (03CR) 10Gergő Tisza: "https://phabricator.wikimedia.org/T92701#1143733" [puppet] - 10https://gerrit.wikimedia.org/r/197240 (https://phabricator.wikimedia.org/T92701) (owner: 10Gergő Tisza) [00:38:29] !log restarting cassandra on restbase1006 [00:38:32] Logged the message, Master [00:49:05] 6operations, 5Patch-For-Review: contacts.wikimedia.org drupal unpuppetized / retire contacts - https://phabricator.wikimedia.org/T90679#1143780 (10Dzahn) we need to provide a dump of the data and provide it to @AKoval_WMF [01:04:18] 10Ops-Access-Requests, 6operations, 7database: Can't access x1-analytics-slave - https://phabricator.wikimedia.org/T93708#1143812 (10Mattflaschen) 3NEW [01:22:10] 6operations, 7Monitoring: Restrict edit rights in grafana? - https://phabricator.wikimedia.org/T93710#1143848 (10GWicke) 3NEW [01:22:21] 6operations, 7Monitoring: Restrict edit rights in grafana? - https://phabricator.wikimedia.org/T93710#1143855 (10GWicke) [01:47:33] 6operations: contacts.wikimedia.org drupal unpuppetized / retire contacts - https://phabricator.wikimedia.org/T90679#1143904 (10Dzahn) [01:48:14] !log deployed scap/scap-sync-20150324-014557 [01:48:19] Logged the message, Master [01:50:17] (03PS1) 10Spage: Redirect dev.wikimedia.org URLs [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) [01:52:48] 7Puppet, 6Multimedia, 6Release-Engineering, 6Scrum-of-Scrums, and 2 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1143911 (10Tgr) I almost forgot, but in production we should use pgsql, not mysql. The Sentry devs said mysql is not really supported - it works, but pe... [01:55:22] (03CR) 10Dzahn: "please run "refreshDomainRedirects" locally, that should create ../redirects.conf. from the .dat. then upload both" [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) (owner: 10Spage) [01:58:09] (03CR) 10Dzahn: "jenkins used to detect this in the past, as in "operations/apache-config.git , now detects rewrites.conf not being updated when redirects." [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) (owner: 10Spage) [02:00:28] 6operations, 10Continuous-Integration: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#1143917 (10Dzahn) example why this check is missed: https://gerrit.wikimedia.org/r/#/c/199182/1 [02:28:14] !log l10nupdate Synchronized php-1.25wmf21/cache/l10n: (no message) (duration: 04m 57s) [02:28:23] Logged the message, Master [02:31:46] !log LocalisationUpdate completed (1.25wmf21) at 2015-03-24 02:30:42+00:00 [02:31:50] Logged the message, Master [02:33:40] (03CR) 10jenkins-bot: [V: 04-1] Redirect dev.wikimedia.org URLs [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) (owner: 10Spage) [02:34:27] (03CR) 10Legoktm: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) (owner: 10Spage) [02:50:32] !log l10nupdate Synchronized php-1.25wmf22/cache/l10n: (no message) (duration: 05m 04s) [02:50:39] Logged the message, Master [02:54:04] !log LocalisationUpdate completed (1.25wmf22) at 2015-03-24 02:53:01+00:00 [02:54:08] Logged the message, Master [02:59:01] !log zirconium - tmp. disable puppet, tmp. enable contacts to make dump, make myself admin of civicrm [02:59:07] Logged the message, Master [03:05:16] PROBLEM - puppet last run on mw2100 is CRITICAL: CRITICAL: puppet fail [03:24:15] RECOVERY - puppet last run on mw2100 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [03:31:41] (03CR) 10Krinkle: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/199182 (https://phabricator.wikimedia.org/T372) (owner: 10Spage) [04:09:17] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:13:26] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59673 bytes in 0.056 second response time [04:16:06] PROBLEM - Disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /srv/ssd 6023 MB (3% inode=87%): [04:20:36] PROBLEM - Disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /srv/ssd 5765 MB (3% inode=87%): [04:22:35] <^d> bleh, still? [04:25:56] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [04:26:26] PROBLEM - HTTP 5xx req/min on graphite2001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [04:26:35] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [04:30:17] ^ 5xx/min is tiny peaky spikes, it's just me doing cache main stuff [04:30:23] s/main/maint/ [04:38:26] RECOVERY - Graphite Carbon on graphite2001 is OK: OK: All defined Carbon jobs are runnning. [04:42:56] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [04:43:37] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:44:06] RECOVERY - HTTP 5xx req/min on graphite2001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:44:07] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:58:56] RECOVERY - Disk space on gallium is OK: DISK OK [05:11:26] PROBLEM - Disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /srv/ssd 5169 MB (3% inode=86%): [05:26:22] (03CR) 10Glaisher: Add import sources for cawikinews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198786 (https://phabricator.wikimedia.org/T93203) (owner: 10Gerardduenas) [06:13:51] (03PS1) 10Yuvipanda: tools: Specify full path for uwsgi mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/199196 [06:13:55] ori: ^ [06:14:09] (03PS2) 10Yuvipanda: tools: Specify full path for uwsgi mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/199196 [06:14:19] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Specify full path for uwsgi mountpoint [puppet] - 10https://gerrit.wikimedia.org/r/199196 (owner: 10Yuvipanda) [06:20:54] ori: yup, that did indeed fix it :) [06:20:59] now I’m curious how it worked at all before [06:21:19] ori: I’m letting it run uwsgi itself atm :) [06:26:54] (03CR) 10Ori.livneh: "already merged, but confirming that it looks sane to me" [puppet] - 10https://gerrit.wikimedia.org/r/199196 (owner: 10Yuvipanda) [06:30:06] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:06] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:36] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:46] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:23] YuviPanda|zz: thanks! [06:31:27] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:24] ori: :) [06:33:25] PROBLEM - puppet last run on amssq46 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:46] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:26] PROBLEM - puppet last run on mw2097 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:26] PROBLEM - puppet last run on mw2079 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:16] PROBLEM - puppet last run on mw2003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:55] PROBLEM - puppet last run on mw2093 is CRITICAL: CRITICAL: Puppet has 1 failures [06:42:41] (03PS1) 10Chmarkine: donate - Enable HSTS max-age=7 days [puppet] - 10https://gerrit.wikimedia.org/r/199200 (https://phabricator.wikimedia.org/T40516) [06:45:52] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Mar 24 06:44:46 UTC 2015 (duration 44m 45s) [06:46:00] Logged the message, Master [06:46:06] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on amssq46 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:46:45] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:47:36] RECOVERY - puppet last run on mw2097 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:35] RECOVERY - puppet last run on mw2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:06] RECOVERY - puppet last run on mw2079 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:49:45] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:51:16] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:51:56] RECOVERY - puppet last run on mw2093 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:55:49] (03PS6) 10Gergő Tisza: Make vbench more generic [puppet] - 10https://gerrit.wikimedia.org/r/197240 (https://phabricator.wikimedia.org/T92701) [07:02:15] (03CR) 10Gergő Tisza: "Note that the ve role hasn't been fully tested as it cannot be installed on a labs machine due to resource conflicts." [puppet] - 10https://gerrit.wikimedia.org/r/197240 (https://phabricator.wikimedia.org/T92701) (owner: 10Gergő Tisza) [07:27:34] (03PS7) 10Gergő Tisza: Make vbench more generic [puppet] - 10https://gerrit.wikimedia.org/r/197240 (https://phabricator.wikimedia.org/T92701) [07:30:45] (03PS8) 10Gergő Tisza: Make vbench more generic [puppet] - 10https://gerrit.wikimedia.org/r/197240 (https://phabricator.wikimedia.org/T92701) [07:53:46] PROBLEM - puppet last run on db2003 is CRITICAL: CRITICAL: puppet fail [08:09:28] PROBLEM - puppet last run on mw2013 is CRITICAL: CRITICAL: Puppet has 1 failures [08:12:20] RECOVERY - puppet last run on db2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:21:39] good morning [08:25:50] RECOVERY - puppet last run on mw2013 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:35:36] !log restarting Jenkins for some plugins upgrades [08:35:39] Logged the message, Master [08:38:39] RECOVERY - Graphite Carbon on graphite2001 is OK: OK: All defined Carbon jobs are runnning. [08:42:04] PROBLEM - Graphite Carbon on graphite2001 is CRITICAL: CRITICAL: Not all configured Carbon instances are running. [08:42:54] 7Puppet, 6Multimedia, 6Release-Engineering, 6Scrum-of-Scrums, and 2 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1144467 (10Gilles) Excellent, P421 updated and sentry itself builds ok. Next I'm going to submit these debs for review. I expect that there might be som... [08:43:35] PROBLEM - nutcracker process on mw2077 is CRITICAL: Connection refused by host [08:44:03] PROBLEM - puppet last run on mw2077 is CRITICAL: Connection refused by host [08:44:14] PROBLEM - salt-minion processes on mw2077 is CRITICAL: Connection refused by host [08:44:34] PROBLEM - DPKG on mw2077 is CRITICAL: Connection refused by host [08:44:44] PROBLEM - Disk space on mw2077 is CRITICAL: Connection refused by host [08:45:33] PROBLEM - RAID on mw2077 is CRITICAL: Connection refused by host [08:45:55] PROBLEM - configured eth on mw2077 is CRITICAL: Connection refused by host [08:46:15] PROBLEM - dhclient process on mw2077 is CRITICAL: Connection refused by host [08:46:24] PROBLEM - nutcracker port on mw2077 is CRITICAL: Connection refused by host [08:54:50] 6operations, 10Continuous-Integration, 3Continuous-Integration-Isolation, 7Upstream: Create a Debian package for NodePool - https://phabricator.wikimedia.org/T89142#1144478 (10hashar) [08:55:03] 7Blocked-on-Operations, 6operations, 10Continuous-Integration, 3Continuous-Integration-Isolation, and 2 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1144479 (10hashar) [08:56:33] RECOVERY - RAID on mw2077 is OK: OK: no RAID installed [08:56:53] RECOVERY - salt-minion processes on mw2077 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:57:03] RECOVERY - configured eth on mw2077 is OK: NRPE: Unable to read output [08:57:04] RECOVERY - DPKG on mw2077 is OK: All packages OK [08:57:14] RECOVERY - dhclient process on mw2077 is OK: PROCS OK: 0 processes with command name dhclient [08:57:23] RECOVERY - Disk space on mw2077 is OK: DISK OK [08:57:24] RECOVERY - nutcracker port on mw2077 is OK: TCP OK - 0.000 second response time on port 11212 [08:57:44] RECOVERY - nutcracker process on mw2077 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [09:00:46] 6operations, 10ops-codfw: mw2050 management unreachable - https://phabricator.wikimedia.org/T93729#1144483 (10Joe) 3NEW [09:01:05] ACKNOWLEDGEMENT - puppet last run on ms-be1016 is CRITICAL: CRITICAL: Puppet has 2 failures Filippo Giunchedi https://phabricator.wikimedia.org/T93614 [09:01:08] ACKNOWLEDGEMENT - puppet last run on ms-be1017 is CRITICAL: CRITICAL: Puppet has 2 failures Filippo Giunchedi https://phabricator.wikimedia.org/T93614 [09:01:18] ACKNOWLEDGEMENT - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Puppet has 2 failures Filippo Giunchedi https://phabricator.wikimedia.org/T93614 [09:07:15] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: Puppet has 6 failures [09:08:55] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [09:14:59] 6operations, 10Continuous-Integration: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#1144521 (10hashar) Nobody bothered reviewing the patches I proposed back in October / November though :( I kind of loose interest in pushing this though. Quo... [09:16:10] 6operations, 6Analytics-Engineering, 7Graphite, 7Icinga, 5Patch-For-Review: icinga UNKNOWN Varnishkafka Delivery Errors / varnishkafka data not in graphite - https://phabricator.wikimedia.org/T92965#1144522 (10fgiunchedi) 5Open>3Resolved alarms created, resolving [09:23:19] PROBLEM - DPKG on mw2053 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:23:39] PROBLEM - Disk space on mw2053 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:24:29] PROBLEM - RAID on mw2053 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:24:59] PROBLEM - configured eth on mw2053 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:25:09] PROBLEM - dhclient process on mw2053 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:25:29] PROBLEM - nutcracker port on mw2053 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:25:39] PROBLEM - nutcracker process on mw2053 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:25:58] PROBLEM - puppet last run on mw2053 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:26:20] RECOVERY - DPKG on mw2053 is OK: All packages OK [09:26:28] RECOVERY - configured eth on mw2053 is OK: NRPE: Unable to read output [09:26:39] RECOVERY - dhclient process on mw2053 is OK: PROCS OK: 0 processes with command name dhclient [09:26:48] RECOVERY - Disk space on mw2053 is OK: DISK OK [09:26:58] RECOVERY - nutcracker port on mw2053 is OK: TCP OK - 0.000 second response time on port 11212 [09:27:09] RECOVERY - nutcracker process on mw2053 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [09:27:29] RECOVERY - RAID on mw2053 is OK: OK: no RAID installed [09:31:00] PROBLEM - Host eventlog1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:31:10] PROBLEM - Host rcs1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:31:22] hm phabricator is broken [09:31:39] PROBLEM - Host erbium is DOWN: PING CRITICAL - Packet loss = 100% [09:31:40] (03PS1) 10Giuseppe Lavagetto: mediawiki: add dsh entries for codfw [puppet] - 10https://gerrit.wikimedia.org/r/199230 [09:31:41] aude, how? [09:31:46] server error [09:31:49] PROBLEM - Host analytics1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:31:49] PROBLEM - Host osmium is DOWN: PING CRITICAL - Packet loss = 100% [09:31:59] PROBLEM - Host graphite1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:31:59] PROBLEM - Host logstash1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:31:59] PROBLEM - Host stat1003 is DOWN: PING CRITICAL - Packet loss = 100% [09:31:59] PROBLEM - Host rdb1001 is DOWN: PING CRITICAL - Packet loss = 100% [09:31:59] PROBLEM - Host hafnium is DOWN: PING CRITICAL - Packet loss = 100% [09:32:00] PROBLEM - Host platinum is DOWN: PING CRITICAL - Packet loss = 100% [09:32:05] <_joe_> what? [09:32:07] but then why are all these hosts down? [09:32:07] ah... yeah. [09:32:10] PROBLEM - Host caesium is DOWN: PING CRITICAL - Packet loss = 100% [09:32:10] PROBLEM - Host lead is DOWN: PING CRITICAL - Packet loss = 100% [09:32:10] PROBLEM - Host logstash1003 is DOWN: PING CRITICAL - Packet loss = 100% [09:32:10] PROBLEM - Host gold is DOWN: PING CRITICAL - Packet loss = 100% [09:32:10] PROBLEM - Host logstash1002 is DOWN: PING CRITICAL - Packet loss = 100% [09:32:10] PROBLEM - Host gadolinium is DOWN: PING CRITICAL - Packet loss = 100% [09:32:10] PROBLEM - Host iridium is DOWN: PING CRITICAL - Packet loss = 100% [09:32:17] 503 here too for phabricator. [09:32:19] PROBLEM - Host labsdb1006 is DOWN: PING CRITICAL - Packet loss = 100% [09:32:19] PROBLEM - Host labsdb1007 is DOWN: PING CRITICAL - Packet loss = 100% [09:32:44] <_joe_> shit [09:33:49] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:34:10] PROBLEM - Disk space on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:35:02] :( [09:35:10] phabricator down o_O [09:35:20] 503 [09:35:28] Steinsplitter: yeah [09:35:31] known, but thanks [09:36:13] :-) [09:36:29] PROBLEM - puppet last run on mw2053 is CRITICAL: CRITICAL: Puppet has 6 failures [09:39:04] <_joe_> Steinsplitter: thanks, logstash, graphite and hadoop are down as well, btw [09:39:22] <_joe_> it's a full rack that is suddenly unreachable [09:39:36] sounds like that [09:39:40] RECOVERY - puppet last run on mw2053 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:40:53] phabricator is dead :( [09:41:16] yurik: apparently some rack is down [09:41:28] it's on iridium [09:41:31] yurik: A whole rack is dead or cut of [09:41:32] according to puppet [09:41:36] most probably either power failure or a network switch [09:41:39] icinga-wm> PROBLEM - Host iridium is DOWN: PING CRITICAL - Packet loss = 100% [09:41:39] yep [09:41:44] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: add dsh entries for codfw [puppet] - 10https://gerrit.wikimedia.org/r/199230 (owner: 10Giuseppe Lavagetto) [09:42:49] i wonder where i should file a bug if phabricator dies :D i'm off into the sky, heading to NYC... see everyone soon on the other side of the pond [09:43:08] see ya! [09:43:49] <_joe_> hoo: yes, eqiad rack c4 is out of the network [09:43:57] <_joe_> I suspect some switch problem [09:45:50] _joe_: please !log it [09:46:02] <_joe_> matanya: what should I log, sorry? [09:46:31] c4 is out of network ? [09:46:37] <_joe_> nope [09:46:48] <_joe_> we log only actions that we perform out of puppet [09:46:57] <_joe_> we don't log outages to the SAL [09:47:53] ah, thanks and sorry for the noise [09:50:09] PROBLEM - HHVM busy threads on mw1034 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [86.4] [09:50:09] <_joe_> !log running scap sync-common on all codfw mw* servers so that they don't kill scap on next deploy [09:50:18] Logged the message, Master [09:50:18] PROBLEM - HHVM queue size on mw1034 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [80.0] [09:50:19] RECOVERY - Host gadolinium is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [09:50:19] RECOVERY - Host logstash1002 is UP: PING OK - Packet loss = 0%, RTA = 1.99 ms [09:50:19] RECOVERY - Host graphite1001 is UP: PING OK - Packet loss = 0%, RTA = 3.20 ms [09:50:19] RECOVERY - Host labsdb1006 is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [09:50:19] RECOVERY - Host eventlog1001 is UP: PING OK - Packet loss = 0%, RTA = 2.05 ms [09:50:20] RECOVERY - Host gold is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [09:50:20] RECOVERY - Host rdb1001 is UP: PING OK - Packet loss = 0%, RTA = 1.63 ms [09:50:21] RECOVERY - Host iridium is UP: PING OK - Packet loss = 0%, RTA = 1.12 ms [09:50:21] RECOVERY - Host erbium is UP: PING OK - Packet loss = 0%, RTA = 1.53 ms [09:50:22] <_joe_> we're back [09:50:22] RECOVERY - Host labsdb1007 is UP: PING OK - Packet loss = 0%, RTA = 1.39 ms [09:50:22] RECOVERY - Host logstash1003 is UP: PING OK - Packet loss = 0%, RTA = 0.52 ms [09:50:23] RECOVERY - Host hafnium is UP: PING OK - Packet loss = 0%, RTA = 2.43 ms [09:50:23] RECOVERY - Host rcs1001 is UP: PING OK - Packet loss = 0%, RTA = 2.27 ms [09:50:23] <_joe_> :) [09:50:29] RECOVERY - Host osmium is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [09:50:29] RECOVERY - Host stat1003 is UP: PING OK - Packet loss = 0%, RTA = 1.67 ms [09:50:29] RECOVERY - Host platinum is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [09:50:29] RECOVERY - Host analytics1001 is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [09:50:29] RECOVERY - Host lead is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [09:50:29] RECOVERY - Host logstash1001 is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [09:50:37] yay, Phab back. Thanks. [09:50:40] yupii [09:50:43] 7Puppet, 6Multimedia, 6Release-Engineering, 6Scrum-of-Scrums, and 2 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1144532 (10Gilles) Of course, I overlooked one very important issue: versions. Looking at Debian Jessie, I would have to downgrade 10 stock jessie pytho... [09:50:58] RECOVERY - Disk space on analytics1027 is OK: DISK OK [09:51:09] <_joe_> I did nothing, thank paravoid [09:51:24] <_joe_> just noticed "HHVM busy threads on mw1034 is CRITICAL" [09:51:28] <_joe_> which comes from graphite [09:53:39] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 81 threshold =0.1% breach: status: yellow, number_of_nodes: 3, unassigned_shards: 75, timed_out: False, active_primary_shards: 49, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 66, initializing_shards: 6, number_of_data_nodes: 3 [09:53:50] PROBLEM - puppet last run on platinum is CRITICAL: CRITICAL: puppet fail [09:53:58] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 81 threshold =0.1% breach: status: yellow, number_of_nodes: 3, unassigned_shards: 75, timed_out: False, active_primary_shards: 49, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 66, initializing_shards: 6, number_of_data_nodes: 3 [09:54:08] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: puppet fail [09:54:08] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: puppet fail [09:54:09] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: puppet fail [09:54:19] PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: puppet fail [09:54:19] PROBLEM - puppet last run on erbium is CRITICAL: CRITICAL: puppet fail [09:54:20] PROBLEM - puppet last run on lead is CRITICAL: CRITICAL: puppet fail [09:54:28] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 74 threshold =0.1% breach: status: yellow, number_of_nodes: 3, unassigned_shards: 68, timed_out: False, active_primary_shards: 49, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 73, initializing_shards: 6, number_of_data_nodes: 3 [09:54:28] PROBLEM - puppet last run on gold is CRITICAL: CRITICAL: puppet fail [09:54:29] PROBLEM - puppet last run on caesium is CRITICAL: CRITICAL: puppet fail [09:54:58] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: puppet fail [09:57:30] PROBLEM - puppet last run on virt1011 is CRITICAL: CRITICAL: puppet fail [09:58:39] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:58:59] RECOVERY - puppet last run on erbium is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [10:01:02] logstash ES is likely confused, I think it might be recovering tho [10:01:26] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:02:35] RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [10:02:45] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [10:02:46] RECOVERY - puppet last run on gold is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [10:03:25] RECOVERY - puppet last run on caesium is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:03:46] RECOVERY - puppet last run on platinum is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [10:03:49] thanks _joe_ ! [10:03:59] and paravoid [10:04:57] <_joe_> jobrunners look healthy now, after a 20 min hiatus [10:05:15] RECOVERY - puppet last run on lead is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:07:09] somone here familar with mailman? [10:08:11] 7Puppet, 6Multimedia, 6Release-Engineering, 6Scrum-of-Scrums, and 2 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1144543 (10Gilles) Posted the request for repo creation here https://www.mediawiki.org/wiki/Git/New_repositories/Requests [10:08:31] 6operations, 10ops-eqiad: asw-c4-eqiad hardware fault? - https://phabricator.wikimedia.org/T93730#1144544 (10faidon) 3NEW [10:09:24] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:11:25] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] servermon - Enable HSTS max-age=7 days [puppet] - 10https://gerrit.wikimedia.org/r/199134 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [10:18:26] back ... [10:30:05] hoo: Dear anthropoid, the time has come. Please deploy Deploy Capiunto on beta (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150324T1030). [10:32:31] (03PS1) 10Giuseppe Lavagetto: dsh: add more codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/199240 [10:34:49] <_joe_> hoo: look very hard for issues with hhvm and the lua extension [10:34:54] <_joe_> when deploying that [10:35:26] RECOVERY - puppet last run on virt1011 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [10:38:05] hey what's up? [10:38:38] _joe_: Ok... will do [10:38:55] what was wrong with rack c4? [10:39:17] <_joe_> mark: asw-c4-eqiad had to be rebooted again, as in https://wikitech.wikimedia.org/wiki/Incident_documentation/20141130-Eqiad-Rack-C4 [10:39:25] hm [10:39:30] aude: https://gerrit.wikimedia.org/r/198776 [10:40:33] (03PS2) 10Giuseppe Lavagetto: dsh: add more codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/199240 [10:40:59] _joe_: is anyone writing an incident report? [10:41:16] <_joe_> mark: not sure, maybe paravoid? [10:41:25] <_joe_> he did reboot the switch [10:41:31] ok [10:41:33] <_joe_> if not, I can do it for sure [10:41:38] i'll ask [10:41:50] (03CR) 10Giuseppe Lavagetto: [C: 032] dsh: add more codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/199240 (owner: 10Giuseppe Lavagetto) [10:41:54] (03CR) 10Aude: [C: 031] "looks like this should work." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198776 (https://phabricator.wikimedia.org/T93418) (owner: 10Hoo man) [10:42:14] aude: Thanks :) [10:42:36] (03PS2) 10Hoo man: Deploy Capiunto on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198776 (https://phabricator.wikimedia.org/T93418) [10:43:21] (03CR) 10Hoo man: [C: 032] Deploy Capiunto on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198776 (https://phabricator.wikimedia.org/T93418) (owner: 10Hoo man) [10:44:46] (03Merged) 10jenkins-bot: Deploy Capiunto on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198776 (https://phabricator.wikimedia.org/T93418) (owner: 10Hoo man) [10:47:58] mh... change doesn't propagate to beta [10:49:49] _joe_: Did you add the codfw servers to scap? [10:50:03] -> dsh [10:50:33] <_joe_> hoo: just now [10:50:36] <_joe_> hoo: just now? [10:50:38] Ok [10:50:47] Let me fix that [10:50:50] !log hoo Synchronized wmf-config/: Deploy Capiunto on beta, for consistency (duration: 01m 44s) [10:50:53] <_joe_> hoo: what's the problem? [10:50:54] Logged the message, Master [10:51:21] <_joe_> hoo: some of them are running sync-common right now [10:51:33] <_joe_> as per my !log earlier [10:52:29] (03PS1) 10Hoo man: Add .codfw.wmnet to tin's domain search [puppet] - 10https://gerrit.wikimedia.org/r/199243 [10:52:33] _joe_: ^ that [10:52:48] <_joe_> oh, ok [10:52:51] <_joe_> right [10:53:19] <_joe_> I tried to run sync-common until now, not to run scap [10:53:22] <_joe_> of course [10:53:48] Yeah... have an initial run of sync-common on all of these [10:53:59] than you should probably run a full scap, just to be sure [10:54:12] My change doesn't propagate to beta :S [10:54:25] Now, as I'm speaking, it did [10:56:11] <_joe_> eheh [10:56:38] <_joe_> hoo: I'll run it with the SWATters in the afternoon, I guess [10:56:48] yeah, that makes sense [11:07:13] hashar: ping [11:09:22] hashar: Looks like the automatic deploy on beta is broken... and I can't manually deploy without removing the scap lock that jenkins holds [11:10:56] 7831 Tue Mar 24 10:54:58 2015 python /usr/local/bin/scap S ? 00:00:00 python /usr/local/bin/scap beta-scap-eqiad (build #46247) [11:12:21] Ok... seems to be fine (but very slow) [11:13:00] 11:11:49 sync-common: 0% (ok: 0; fail: 0; left: 4) [11:13:00] 11:12:27 sync-common: 25% (ok: 1; fail: 0; left: 3) [11:13:12] 38 seconds to sync to a single host? :/ [11:13:15] Krenair: Yeah [11:13:22] It's cpu bound [11:13:25] weird [11:13:46] Now it's done syncing, building cdbs [11:13:55] godog: sorry for the lateness :/ [11:14:02] hoo: hello [11:14:09] hoo: which automatic deploy? [11:14:13] hashar: Not an issue, solved by no [11:14:14] * now [11:14:24] hashar: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ [11:14:28] It was just very slow [11:14:44] maybe because it had to refresh l10n cache files [11:14:50] Upgrading beta's bastion to an xlarge instance would surely help as I saw it to be CPU bound at times [11:14:52] hashar: np! [11:15:01] build time trend: https://integration.wikimedia.org/ci/job/beta-scap-eqiad/buildTimeTrend [11:15:23] hashar: That's still a lot [11:15:35] But ok, good to know it's just slow [11:15:48] it went from less than a minute (build 45582 and previous) to more than 10 minutes (since build 45585). That is a bug! [11:15:54] Oh :S [11:16:36] filling a bug :) [11:23:13] https://phabricator.wikimedia.org/T93737 [11:48:38] !log remove per-partition iostat data from graphite1001, obsolete [11:48:41] Logged the message, Master [11:53:17] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [12:04:53] !log bounce elasticsearch on logstash1001, shards unallocated/initializing [12:04:58] Logged the message, Master [12:13:53] godog: are the graphite hosts just fine since we dropped Jenkins statsd metrics? [12:14:38] hashar: yep it is, we've got new hardware too [12:16:27] nice [12:21:22] !log disabling puppet on osmium for an hour to avoid perturbing a VE benchmarking suite [12:21:27] Logged the message, Master [12:24:25] godog: when creating my packages I took some notes at https://wikitech.wikimedia.org/wiki/Cowbuilder [12:24:50] godog: which includes how to manually create a cow image with apt.wm.org pinned :) [12:31:21] bd808 manybubbles ^d after the switch crash in row c hosting logstash elasticsearch had unassigned/initializing shards (only replicas tho) restarting ES on logstash1001 didn't help, failed/corrupted index for logstash-2015.03.05 for example [12:31:29] hashar: cool! thanks for that [12:32:00] godog: oh boy! why do we have this trouble with logstash, what in the world is the deal here [12:32:11] * hashar blames java [12:33:03] (03PS2) 10coren: Labs: upgrade codfw labstores to Jessie [puppet] - 10https://gerrit.wikimedia.org/r/198728 (https://phabricator.wikimedia.org/T93740) [12:33:55] Someone do a quick +1 to ^^ pretty please? [12:34:16] godog: did you take some action? it seems to still be recovering [12:34:29] manybubbles: I tried restarting ES on logstash1001 only [12:34:51] manybubbles: [2015-03-24 12:04:20,520][INFO ][node ] [logstash1001] closed [12:35:27] manybubbles: though the messages after that are not encouraging, you can see it freaking out above at 9:3x when all nodes were by themselves [12:35:35] CorruptIndexException[codec footer mismatch: actual footer=1108984883 vs expected footer=-1071082520 [12:35:52] yeah, no route to host [12:36:40] between 9:50 and 12:04 was it doing anything? it says it found the master node so I would expect things to be better [12:37:22] manybubbles: not that I can tell, icinga alerts stayed there unchanged [12:37:55] (03CR) 10Filippo Giunchedi: [C: 031] Labs: upgrade codfw labstores to Jessie [puppet] - 10https://gerrit.wikimedia.org/r/198728 (https://phabricator.wikimedia.org/T93740) (owner: 10coren) [12:37:58] Coren: sure [12:39:24] godog: I wonder if it is https://github.com/elastic/elasticsearch/issues/8707 [12:39:45] godog: Danke. [12:40:06] (03CR) 10coren: [C: 032] Labs: upgrade codfw labstores to Jessie [puppet] - 10https://gerrit.wikimedia.org/r/198728 (https://phabricator.wikimedia.org/T93740) (owner: 10coren) [12:40:46] godog: I've moved the corrupted file out of the way to see if that causes it to recover it [12:42:08] 6operations, 10ops-eqiad: asw-c4-eqiad hardware fault? - https://phabricator.wikimedia.org/T93730#1144818 (10faidon) [12:42:57] manybubbles: ack, thanks [12:43:20] it certainly went nuts with bandwidth (recovery?) after the switch came back [12:44:00] godog: yes - remember that recovery is really really inefficient. [12:44:16] godog: there is a proposal to fix that: https://github.com/elastic/elasticsearch/issues/10032 [12:44:24] which would help a ton with logstash stuff [12:47:53] *nod* [12:55:50] 6operations: Provide dh-virtualenv 0.9 package on apt.wikimedia.org Precise and Trusty distributions - https://phabricator.wikimedia.org/T91631#1144834 (10akosiaris) 5Open>3Resolved Packages rebuilt and uploaded on apt.wikimedia.org. For jessie the package has been uploaded on the backports component while f... [12:55:52] 7Blocked-on-Operations, 6operations, 10Continuous-Integration, 3Continuous-Integration-Isolation, and 2 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1144836 (10akosiaris) [12:56:45] (03CR) 10Alexandros Kosiaris: [C: 032] Add .codfw.wmnet to tin's domain search [puppet] - 10https://gerrit.wikimedia.org/r/199243 (owner: 10Hoo man) [13:15:13] (03CR) 10JanZerebecki: [C: 031] donate - Enable HSTS max-age=7 days [puppet] - 10https://gerrit.wikimedia.org/r/199200 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [13:15:40] 6operations, 10MediaWiki-extensions-Graph, 6Services, 10service-template-node, 7service-runner: Deploy graphoid service into production - https://phabricator.wikimedia.org/T90487#1144876 (10mobrovac) [13:16:20] godog: guess what is missing in Precise? dh-python! :) [13:16:47] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 11, timed_out: False, active_primary_shards: 49, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 134, initializing_shards: 2, number_of_data_nodes: 3 [13:17:37] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 11, timed_out: False, active_primary_shards: 49, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 134, initializing_shards: 2, number_of_data_nodes: 3 [13:17:37] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 11, timed_out: False, active_primary_shards: 49, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 134, initializing_shards: 2, number_of_data_nodes: 3 [13:17:42] kart_: Are you going to prepare the extension-updating patches against core for your SWATs? You could likely combine both extensions into one core-update, since the one is just unit test stuff. [13:19:05] (03CR) 10JanZerebecki: [C: 031] iegreview - Enable HSTS max-age=7 days [puppet] - 10https://gerrit.wikimedia.org/r/199142 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [13:20:27] (03PS1) 10Andrew Bogott: Firewall for Holmium [puppet] - 10https://gerrit.wikimedia.org/r/199258 [13:20:40] (03CR) 10JanZerebecki: [C: 031] dbtree - Enable HSTS max-age=7 days [puppet] - 10https://gerrit.wikimedia.org/r/199139 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [13:25:19] hashar: precise? :) [13:25:26] paravoid: yup [13:25:35] this is like current-2 [13:25:47] why do you care about precise? [13:25:50] I am working of having the python files bytecompiled on postinst and then removed in prerm :) [13:26:08] luckily python-minimal has the underlying utilities (pycompile and pyclean) [13:26:28] paravoid: the Zuul server / scheduler is on gallium which is Precise [13:27:07] as well as several labs instance we are using to run php 5.3 [13:28:13] I wouldn't waste any more time on this platform [13:31:08] yeah [13:31:12] gottta migrate everything to jessie [13:31:19] not sure what to do with PHP 5.3 though :( [13:36:45] (03PS2) 10Andrew Bogott: Firewall for Holmium [puppet] - 10https://gerrit.wikimedia.org/r/199258 [13:38:46] (03CR) 10Andrew Bogott: [C: 032] Firewall for Holmium [puppet] - 10https://gerrit.wikimedia.org/r/199258 (owner: 10Andrew Bogott) [13:43:00] <^d> manybubbles: logstash recover(ed|ing) ok now? /me just caught scrollback [13:43:35] (03PS2) 10Giuseppe Lavagetto: scap: add codfw proxies [puppet] - 10https://gerrit.wikimedia.org/r/197885 [13:44:35] (03CR) 10Giuseppe Lavagetto: [C: 032] "codfw servers are now in the dsh list, this has to be merged" [puppet] - 10https://gerrit.wikimedia.org/r/197885 (owner: 10Giuseppe Lavagetto) [13:47:19] <_joe_> whoever is SWATting today, it may get to be a lot of fun [13:47:30] <_joe_> it's going to be the first scap run with codfw enabled :) [13:47:34] <^d> Heh, wanna test that now? [13:47:51] <^d> Could do a no-op or two before the day's swat & train begin [13:47:56] <_joe_> ^d: in 20 minutes, so that puppet has ran everywhere [13:48:04] * ^d nods [13:48:31] <_joe_> I did ran a sync-common everywhere in codfw already [13:48:39] _joe_: so, we will be victim :D [13:48:52] (03PS7) 10Alexandros Kosiaris: Don't include a node in its own seeds [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) (owner: 10GWicke) [13:48:57] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 1.22860337202e-137 [13:49:20] <_joe_> that's kind of small [13:49:33] <_joe_> kart_: I don't think there should be any issue tbh [13:49:48] _joe_: :) [13:50:05] With Opsens around, we don't need to worry :) [13:50:56] (03PS3) 10Giuseppe Lavagetto: lvs: add loadbalancers for appservers, api and rendering [puppet] - 10https://gerrit.wikimedia.org/r/195899 (https://phabricator.wikimedia.org/T92377) [13:51:33] (03CR) 10Alexandros Kosiaris: "Spotted 2 problems. I uploaded a fix for them in PS7" (032 comments) [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) (owner: 10GWicke) [13:52:15] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "LGTM" [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/195483 (https://phabricator.wikimedia.org/T91617) (owner: 10GWicke) [13:52:15] <_joe_> this ^^ may get us paged. It should not, but still... [13:52:30] <_joe_> (my change, not alex/gabriel's one) [13:52:41] yeah, I got that [13:52:47] I was like "????" [13:53:36] 7Blocked-on-Operations, 6operations, 10Continuous-Integration, 3Continuous-Integration-Isolation, and 2 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1144922 (10hashar) The Precise package has been polished following up the 1/1 with Filippo ( https://gerrit.wikimedia.o... [13:53:37] <_joe_> eheh [13:54:05] (03CR) 10Giuseppe Lavagetto: [C: 032] lvs: add loadbalancers for appservers, api and rendering [puppet] - 10https://gerrit.wikimedia.org/r/195899 (https://phabricator.wikimedia.org/T92377) (owner: 10Giuseppe Lavagetto) [13:55:44] !log reinstalling labstore2001 with Jessie [13:55:53] Logged the message, Master [13:57:37] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [14:00:17] (03PS1) 10Alexandros Kosiaris: Update cassandra submodule [puppet] - 10https://gerrit.wikimedia.org/r/199261 [14:05:41] akosiaris: thnx ! [14:16:00] <_joe_> !log restarting pybal on lvs2003 [14:16:04] Logged the message, Master [14:19:01] YuviPanda|zz: working? [14:19:27] andrewbogott: hey [14:19:32] Not atm but soon [14:19:35] andrewbogott: sup [14:19:46] um… want to ping me when you actually /are/ working? :) [14:19:53] andrewbogott: cool :) [14:20:00] In one hour? [14:20:09] This week hasn't been very productive.... [14:20:11] sure [14:20:12] Sorry [14:20:32] YuviPanda|zz: I’m amazed you’ve been showing up here at all, you have a lot to do! [14:27:27] (03PS1) 10Nikerabbit: Add pool counter config for Translate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199263 (https://phabricator.wikimedia.org/T54728) [14:28:59] (03CR) 10Nikerabbit: "There is no logic for the actual values, just what felt sensible to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199263 (https://phabricator.wikimedia.org/T54728) (owner: 10Nikerabbit) [14:33:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] import upstream 0.7.0 [debs/statsite] - 10https://gerrit.wikimedia.org/r/193095 (https://phabricator.wikimedia.org/T90111) (owner: 10Filippo Giunchedi) [14:34:06] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] import debian directory [debs/statsite] - 10https://gerrit.wikimedia.org/r/193096 (https://phabricator.wikimedia.org/T90111) (owner: 10Filippo Giunchedi) [14:34:18] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] add debian patches to upstream [debs/statsite] - 10https://gerrit.wikimedia.org/r/193097 (https://phabricator.wikimedia.org/T90111) (owner: 10Filippo Giunchedi) [14:45:37] ^d: Since you're already talking about testing scap and codfw with _j.oe_, are you going to SWAT today? [14:46:28] manybubbles: are you doing swat today? [14:46:35] or anomie [14:46:47] aude: We haven't decided who yet [14:46:49] ok [14:47:10] we have a patch but would like to deploy it myself, after you are done [14:47:56] or maybe in an ~hour from now [14:48:12] There shouldn't be any problem with that [14:48:51] ok [14:48:54] <^d> Yeah I can [14:49:06] <^d> (was planning, to, that is) [14:49:11] just need to prepare our branch update first [14:49:41] Glaisher: Has anyone sanity-checked https://gerrit.wikimedia.org/r/#/c/195938/ for performance implications? [14:51:53] anomie: no, afaik [14:52:20] Glaisher: Should they? [14:52:39] I don't think so.. [14:54:38] manybubbles: Just a heads up, I've scheduled the wgContentNamespaces for SWAT [14:54:47] or ^d [14:54:50] (03CR) 10Faidon Liambotis: [C: 031] exim4.conf.SMTP_IMAP_MM.erb local mail can cause loops [puppet] - 10https://gerrit.wikimedia.org/r/199090 (owner: 10Rush) [14:55:17] <^d> hmm dewikiversity [14:55:18] <^d> ok [14:58:08] (03CR) 10Alexandros Kosiaris: [C: 032] Update cassandra submodule [puppet] - 10https://gerrit.wikimedia.org/r/199261 (owner: 10Alexandros Kosiaris) [14:58:12] (03PS2) 10Rush: exim4.conf.SMTP_IMAP_MM.erb local mail can cause loops [puppet] - 10https://gerrit.wikimedia.org/r/199090 [14:58:20] (03CR) 10Rush: [C: 032] exim4.conf.SMTP_IMAP_MM.erb local mail can cause loops [puppet] - 10https://gerrit.wikimedia.org/r/199090 (owner: 10Rush) [14:59:20] <_joe_> ^d: whenever you want, we can do a scap dry-run [14:59:21] (03CR) 10Rush: [V: 032] exim4.conf.SMTP_IMAP_MM.erb local mail can cause loops [puppet] - 10https://gerrit.wikimedia.org/r/199090 (owner: 10Rush) [14:59:38] <^d> Well swat's starting now, so we'll just get to test it with real shit :) [14:59:46] <_joe_> yeah [14:59:55] awesome [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150324T1500). Please do the needful. [15:00:11] _joe_: how many mw's in codfw now? [15:00:22] <_joe_> greg-g: 200-ish [15:00:46] * greg-g nods [15:00:50] <_joe_> it's 214, but less than 10 still to install [15:00:54] <_joe_> ^d: http://blogs.msdn.com/cfs-filesystemfile.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-32-02-metablogapi/8054.image_5F00_thumb_5F00_35C6E986.png [15:01:11] <^d> _joe_: hell yeah [15:01:30] <^d> kart_: Merging your extension changes now [15:01:34] (03PS1) 10Eevans: send additional metrics to graphite [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/199264 (https://phabricator.wikimedia.org/T78514) [15:01:43] ^d: ack [15:02:08] <^d> Can you do the submodule updates or do you need me to do those? [15:02:29] ^d: go ahead. you'll be quicker :) [15:02:53] just noting that the CX patches are expected not to pass Jenkins before the ULS patches are merged [15:03:10] <^d> They gated behind it so qunit looks like it passed [15:03:43] I didn't expect it to be that intelligent ;) [15:03:54] <^d> I did it in the right order [15:04:27] ^d: awesome :) [15:05:01] Nikerabbit: it passes. so we're good. [15:05:30] (03CR) 10Mobrovac: [C: 031] send additional metrics to graphite [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/199264 (https://phabricator.wikimedia.org/T78514) (owner: 10Eevans) [15:08:24] (03CR) 10Chad: [C: 032] Fix typo in DismissableSiteNotice configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198542 (https://phabricator.wikimedia.org/T59732) (owner: 10Glaisher) [15:08:32] (03Merged) 10jenkins-bot: Fix typo in DismissableSiteNotice configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198542 (https://phabricator.wikimedia.org/T59732) (owner: 10Glaisher) [15:09:15] ^d: all ok. wmf22? [15:09:21] 6operations, 10ops-codfw: mw2050 management unreachable - https://phabricator.wikimedia.org/T93729#1145111 (10Papaul) @Joe i am able to login using the internal IP papaul@papaul-XPS-L322X:~$ ssh root@10.193.1.150 root@10.193.1.150's password: /admin1-> [15:09:31] <^d> kart_: Working on it [15:09:32] !log demon Synchronized wmf-config/CommonSettings.php: typofix in sitenotice (duration: 00m 11s) [15:09:34] (03PS1) 10coren: WIP: Proper labs_storage class [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) [15:09:38] Logged the message, Master [15:09:47] <^d> _joe_: Ok, lots of explosions [15:09:50] <^d> Lemme pastebin [15:09:57] <_joe_> :( [15:09:59] <_joe_> ok [15:10:28] (03CR) 10coren: [C: 04-2] "WIP do not push!" [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [15:10:51] <^d> _joe_: https://phabricator.wikimedia.org/P425 [15:11:18] <^d> mw2148 proxy failed, and 27 total apaches failed [15:12:02] :( [15:12:15] ^d: doesn't look like the patch is working. :/ [15:12:31] <_joe_> ^d: ok, kind of strange btw [15:12:58] (03PS7) 10Alexandros Kosiaris: Package builder module [puppet] - 10https://gerrit.wikimedia.org/r/194471 [15:13:03] <_joe_> ^d: ok, mw2148 I can fix [15:13:42] <^d> Glaisher: Bug in actual code? [15:13:55] might be.. [15:14:39] <^d> _joe_: Ah, I see why the rest failed [15:14:40] <_joe_> ^d: maybe the 27 apaches failed because of the proxy failing? [15:14:40] <^d> eg: Copying to mw2193.codfw.wmnet from mw2148.codfw.wmnet [15:14:47] <^d> Yep [15:14:47] <_joe_> exactly [15:15:19] <^d> (ideally they should pick another proxy, but that's another day) [15:15:19] <_joe_> ^d: in ~ 1 hour it will be ok [15:15:52] <_joe_> I can exclude it from the list for now? [15:17:08] (03PS8) 10Alexandros Kosiaris: Package builder module [puppet] - 10https://gerrit.wikimedia.org/r/194471 [15:17:28] <^d> That'd work [15:20:16] <_joe_> ^d: what would work? [15:20:32] <^d> Excluding it from the list for now [15:21:16] <_joe_> yes, doing it right away [15:24:37] (03PS1) 10Giuseppe Lavagetto: scap: remove temporarily mw2148 from the list of scap proxies [puppet] - 10https://gerrit.wikimedia.org/r/199269 [15:25:30] (03PS2) 10Gerardduenas: Add import sources for cawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198786 (https://phabricator.wikimedia.org/T93203) [15:25:47] (03CR) 10Giuseppe Lavagetto: [C: 032] scap: remove temporarily mw2148 from the list of scap proxies [puppet] - 10https://gerrit.wikimedia.org/r/199269 (owner: 10Giuseppe Lavagetto) [15:27:25] <_joe_> ^d: a few minutes and it should be fine [15:28:30] <^d> mmk [15:30:57] <_joe_> ^d: if only the puppet code for the deployment server wasn't so horrible [15:31:08] <_joe_> it takes more than one minute for a no-op run... [15:32:43] <_joe_> ^d: done, now you should be ok [15:33:01] <^d> Ok let's give it a shot [15:33:30] <_joe_> ^d: maybe you'll find a couple hosts failing (one being mw2148, btw) [15:33:40] <_joe_> but that should be it [15:33:48] !log demon Synchronized wmf-config/CommonSettings.php: once more, with feeling (duration: 00m 12s) [15:33:53] Logged the message, Master [15:33:57] <^d> Yay all succeeded \o/ [15:34:10] <_joe_> whoa [15:34:14] (03CR) 10Chad: [C: 032] Add import sources and set wgImportTargetNamespace at ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198751 (https://phabricator.wikimedia.org/T93218) (owner: 10Glaisher) [15:34:23] (03Merged) 10jenkins-bot: Add import sources and set wgImportTargetNamespace at ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198751 (https://phabricator.wikimedia.org/T93218) (owner: 10Glaisher) [15:34:32] <_joe_> ^d: am I wrong if I say that it's not significantly slower than before? [15:34:58] <^d> Not unacceptably so [15:35:11] <^d> (not like eqiad-pmtpa before) [15:35:28] (03PS3) 10Gerardduenas: Add import sources for cawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198786 (https://phabricator.wikimedia.org/T93203) [15:35:45] <_joe_> ok [15:36:10] <_joe_> well a full sync-common of the last three weeks of deploys took ~ 2 minutes per server [15:36:18] <_joe_> which is more than acceptable I guess [15:36:36] andrewbogott: hi [15:36:47] !log demon Synchronized wmf-config/InitialiseSettings.php: import sources for ptwiki (duration: 00m 11s) [15:36:50] Logged the message, Master [15:36:56] _joe_: sync-common pull from tin and not from the nearest rsync proxy [15:37:05] So it's not optimized [15:37:05] andrewbogott: internet is a little shit, but I’m going to be stuck in traffic for the next 1h, so working :) [15:37:12] <_joe_> hoo: yes, so I expect it to be better [15:37:14] so I think this is kind of expected as it's uncommon [15:37:46] <_joe_> hoo: my point is, I remember scap taking ~ 15 minutes when we had pmtpa [15:37:50] <^d> Glaisher: Your ptwiki import patch is live [15:37:51] <_joe_> but, bbiab [15:38:09] ~15 minutes? More like ~45 minutes [15:38:35] Oh ~15 per host [15:38:48] <^d> Glaisher: Sorry for the delay, scap issues. Merging the submodule updates now [15:39:08] (03CR) 10Alexandros Kosiaris: [C: 04-1 V: 04-1] "Package building fails during configure phase with:" [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/195897 (https://phabricator.wikimedia.org/T91493) (owner: 10KartikMistry) [15:41:43] YuviPanda: My question is pretty simple. I want to add a puppet var to all labs instances ‘use_dnsmasq=true’ [15:41:52] ^d: submodule update? I think you meant to ping kart_ ? :) [15:42:01] <^d> I did [15:42:03] <^d> Whoops [15:42:10] (03PS20) 10Faidon Liambotis: protoproxy/sslcert/cache: nginx ssl_stapling_file support [puppet] - 10https://gerrit.wikimedia.org/r/198110 (https://phabricator.wikimedia.org/T86666) (owner: 10BBlack) [15:42:12] YuviPanda Or, alternatively to all projects. Or both. Or something. [15:42:14] hehe [15:42:19] (03CR) 10Faidon Liambotis: [C: 031] protoproxy/sslcert/cache: nginx ssl_stapling_file support [puppet] - 10https://gerrit.wikimedia.org/r/198110 (https://phabricator.wikimedia.org/T86666) (owner: 10BBlack) [15:42:22] andrewbogott: ah, so hiera is only per-project atm. [15:42:31] I’m wondering if you have a suggestionfor a better way to do that than just inserting them into ldap wholesale [15:42:32] bblack: Task: TNNNN doesn't work (I could swear it used to); you need Bug: TNNNN [15:42:36] andrewbogott: so you can turn it off / on on a per-proiject way [15:42:45] andrewbogott: you just default it to ‘true' [15:42:57] andrewbogott: and have it be ‘do_not_use_dnsmasq’? [15:43:08] <^d> paravoid: We really should loosen that. Faux-footer things like that are ugly [15:43:10] andrewbogott: and then you can turn it to true per project, and do it per project? [15:43:18] <^d> Change-Id got us in a bad habit :p [15:43:19] ^d: what do you mean? [15:43:24] andrewbogott: o r just have hiera(‘use_dnsmasq’, true) [15:43:25] and override it to false per project [15:43:28] YuviPanda: except I want the default to be not to use dnsmasq, and for that to be the case for new projects... [15:43:35] oh, you mean just match TNNNN everywhere in the commit message? [15:43:42] <^d> Yeah [15:43:43] PROBLEM - configured eth on mw2148 is CRITICAL: Connection refused by host [15:43:50] <^d> If you mention it, the task gets a mention of the commit [15:44:03] PROBLEM - dhclient process on mw2148 is CRITICAL: Connection refused by host [15:44:23] PROBLEM - mediawiki-installation DSH group on mw2148 is CRITICAL: Host mw2148 is not in mediawiki-installation dsh group [15:44:35] YuviPanda: I don’t want to have to keep setting no_dnsmasq for everything forever [15:44:42] PROBLEM - nutcracker port on mw2148 is CRITICAL: Connection refused by host [15:44:47] 6operations, 10ops-eqiad: Replace PEM that was taken from spare ex4500 wmf5738 - https://phabricator.wikimedia.org/T93621#1145224 (10Cmjohnson) updating task with emails that I've received Attachment ASSET.iso successfully uploaded and added. Conversation opened. 1 read message. Skip to content Using Wikimed... [15:44:51] andrewbogott: oh, right, you want it to vary on ‘new project’ and ‘old project' [15:44:53] ^d: no worry. let me know when done :) [15:44:53] PROBLEM - nutcracker process on mw2148 is CRITICAL: Connection refused by host [15:45:13] PROBLEM - puppet last run on mw2148 is CRITICAL: Connection refused by host [15:45:13] andrewbogott: I think wholesale insertion into ldap is the best bet atm, can’t think of any other easy way to say ‘just old parts' [15:45:33] PROBLEM - salt-minion processes on mw2148 is CRITICAL: Connection refused by host [15:45:33] PROBLEM - DPKG on mw2148 is CRITICAL: Connection refused by host [15:45:38] YuviPanda: ok. currently does hiera override ldap or is it the other way around? [15:45:52] andrewbogott: it doesn’t do globals at all, so they don’t interact in any way [15:45:53] PROBLEM - Disk space on mw2148 is CRITICAL: Connection refused by host [15:45:57] we can make that happen if we want, of course [15:46:08] hiera(‘variable’, $::ldap_variable) [15:46:13] so then hiera variable will override ldap_variable [15:46:22] Ah, I see. ok. [15:46:27] 6operations, 10ops-eqiad: Replace PEM that was taken from spare ex4500 wmf5738 - https://phabricator.wikimedia.org/T93621#1145242 (10Cmjohnson) Attachment ASSET.iso successfully uploaded and added. Conversation opened. 4 messages. All messages read. Skip to content Using Wikimedia Foundation Mail with screen... [15:46:29] Fair enough, I’ll think a bit more. [15:47:02] PROBLEM - RAID on mw2148 is CRITICAL: Connection refused by host [15:51:27] <^d> Glaisher: I'm leery of $wmgAbuseFilterEmergencyDisableThreshold for commonswiki [15:51:47] hm? [15:51:55] <^d> 5% and 30% are a big difference [15:52:25] <^d> I think there's a performance impact with this switch [15:52:52] ah, well. I will discuss that in the task [15:53:06] <^d> Yeah let's just clarify [15:53:12] <^d> If we can be sure it's ok we'll do it next swat [15:53:27] andrewbogott: cool :) [15:53:31] andrewbogott: we use the [15:53:41] andrewbogott: we use the ‘hiera with fallback to ldap if hiera variable is not defined’ a fair bit [15:54:35] (03CR) 10Chad: [C: 032] Add 'Kurs' (106) to $wgContentNamespaces at dewikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197938 (https://phabricator.wikimedia.org/T93071) (owner: 10Glaisher) [15:56:01] 6operations: detail blockers for procurement in phabricator - https://phabricator.wikimedia.org/T93760#1145273 (10RobH) 3NEW [15:56:07] 6operations: detail blockers for procurement in phabricator - https://phabricator.wikimedia.org/T93760#1145280 (10RobH) a:3RobH [15:56:23] PROBLEM - puppet last run on mw2154 is CRITICAL: CRITICAL: puppet fail [15:58:31] (03PS2) 10Alexandros Kosiaris: Ganeti module/role introduced [puppet] - 10https://gerrit.wikimedia.org/r/198794 (https://phabricator.wikimedia.org/T87258) [16:01:02] (03Merged) 10jenkins-bot: Add 'Kurs' (106) to $wgContentNamespaces at dewikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197938 (https://phabricator.wikimedia.org/T93071) (owner: 10Glaisher) [16:01:55] (03CR) 10Glaisher: [C: 04-1] "see task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195938 (https://phabricator.wikimedia.org/T87431) (owner: 10Glaisher) [16:03:33] PROBLEM - Disk space on palladium is CRITICAL: DISK CRITICAL - free space: / 4046 MB (3% inode=94%): [16:04:23] <_joe_> mmmh [16:04:25] !log demon Synchronized wmf-config/InitialiseSettings.php: dewikiversity content namespaces (duration: 00m 12s) [16:04:30] Logged the message, Master [16:04:53] <_joe_> I'll look at palladium [16:05:04] <^d> Glaisher: content namespaces updated [16:05:08] <^d> Running the cirrus fixer thingie now [16:05:13] _joe_: looking as well [16:05:33] <_joe_> akosiaris: oh ok, then I'll let it to hyou [16:05:33] ^d: updateArticles.php too, right? [16:05:42] updateArticleCount.php* [16:05:53] (03PS1) 10Papaul: add mgmt asset tag info for wtp200(1-20) [dns] - 10https://gerrit.wikimedia.org/r/199275 [16:06:47] <_joe_> akosiaris: /var is 92 G [16:06:58] figures [16:07:12] !log demon Synchronized php-1.25wmf21/extensions/UniversalLanguageSelector: (no message) (duration: 00m 11s) [16:07:15] Logged the message, Master [16:07:29] !log demon Synchronized php-1.25wmf22/extensions/UniversalLanguageSelector: (no message) (duration: 00m 11s) [16:07:32] Logged the message, Master [16:07:42] <_joe_> akosiaris: 86 GB of /var/lib/puppet/reports [16:07:51] !log demon Synchronized php-1.25wmf21/extensions/ContentTranslation: (no message) (duration: 00m 11s) [16:07:52] <_joe_> lol [16:07:54] Logged the message, Master [16:08:09] !log demon Synchronized php-1.25wmf22/extensions/ContentTranslation: (no message) (duration: 00m 12s) [16:08:12] Logged the message, Master [16:09:23] <^d> Glaisher: That just finished [16:09:28] <^d> The cirrus thing will take a little longer [16:09:31] <^d> kart_: You're all live [16:09:49] ^d: confirmed. [16:09:52] Thanks! [16:10:57] ^d: Cool. Thanks :) [16:11:21] _joe_: haha. I wonder if that’s codfw’s impact [16:11:22] (03PS1) 10Faidon Liambotis: Re-add maerlant @ esams, format as jessie [puppet] - 10https://gerrit.wikimedia.org/r/199277 [16:11:31] <_joe_> YuviPanda: it is [16:11:35] yeah it is [16:11:39] _joe_: we can reduce the amount of time we keep them. I don’t know if anyone looks at them [16:11:48] is codfw going to get its own puppetmaster? [16:11:49] <_joe_> YuviPanda: no one [16:11:58] also are we moving them to jessie / trusty at some point? [16:12:00] <_joe_> YuviPanda: eventually-ish [16:12:06] <_joe_> to both questions [16:12:19] (03CR) 10Faidon Liambotis: [C: 032] Re-add maerlant @ esams, format as jessie [puppet] - 10https://gerrit.wikimedia.org/r/199277 (owner: 10Faidon Liambotis) [16:12:24] heh right [16:12:55] hmm, gtg, I trust you guys will take care of palladium :-) [16:13:05] sorry [16:14:16] (03PS2) 10Gerardduenas: Create and modify groups in eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198749 (https://phabricator.wikimedia.org/T93371) [16:15:13] RECOVERY - puppet last run on mw2154 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:15:29] (03PS1) 10Yuvipanda: test [puppet] - 10https://gerrit.wikimedia.org/r/199278 [16:15:31] (03PS1) 10Yuvipanda: test2 [puppet] - 10https://gerrit.wikimedia.org/r/199279 [16:15:33] (03PS1) 10Yuvipanda: puppetmaster: Keep reports only for 16 hours [puppet] - 10https://gerrit.wikimedia.org/r/199280 [16:15:39] _joe_: ^ [16:15:40] bah, stupid test commits [16:16:03] (03PS2) 10Yuvipanda: puppetmaster: Keep reports only for 16 hours [puppet] - 10https://gerrit.wikimedia.org/r/199280 [16:16:23] (03Abandoned) 10Yuvipanda: test [puppet] - 10https://gerrit.wikimedia.org/r/199278 (owner: 10Yuvipanda) [16:16:31] (03PS3) 10Giuseppe Lavagetto: puppetmaster: Keep reports only for 16 hours [puppet] - 10https://gerrit.wikimedia.org/r/199280 (owner: 10Yuvipanda) [16:16:41] ^d: are you done? [16:16:41] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: Keep reports only for 16 hours [puppet] - 10https://gerrit.wikimedia.org/r/199280 (owner: 10Yuvipanda) [16:16:56] (03Abandoned) 10Yuvipanda: test2 [puppet] - 10https://gerrit.wikimedia.org/r/199279 (owner: 10Yuvipanda) [16:16:59] <^d> aude: I am, thanks for waiting [16:17:02] ok [16:17:17] * aude will proceed to deploy our patch (for test.wikidata) [16:22:25] <_joe_> !log manually deleting puppet reports [16:22:28] 6operations, 10Fundraising Dash: Create sandbox site for Dash - https://phabricator.wikimedia.org/T87809#1145397 (10Jgreen) I don't expect to have time to look at this until after we've dealt with all of the issues raised in the PCI gap assessment. [16:22:30] Logged the message, Master [16:23:16] <^d> Glaisher: cirrus should be all fixed up [16:23:31] gerrit is slow... [16:23:33] neat :) [16:23:35] err jenkins [16:23:53] RECOVERY - Disk space on palladium is OK: DISK OK [16:24:32] aude: Is it loading stuff from packagist? [16:24:47] Because that's apparently down/ very slow [16:27:29] it's mediawiki core submodule update [16:27:47] _joe_: ty [16:28:15] (03PS3) 10Negative24: Remove dummy redirect for fab-01 [puppet] - 10https://gerrit.wikimedia.org/r/198535 [16:28:34] <_joe_> YuviPanda: thank you! [16:29:02] (03PS3) 10Negative24: Configure Labs Phabricators with default local repo store [puppet] - 10https://gerrit.wikimedia.org/r/198769 (https://phabricator.wikimedia.org/T93615) [16:31:13] 6operations, 6Commons, 10Wikimedia-Site-requests, 5Patch-For-Review: Change the value of wmgAbuseFilterEmergencyDisableCount for commonswiki - https://phabricator.wikimedia.org/T87431#1145414 (10Steinsplitter) [16:31:15] zzzz [16:31:27] zzZzz [16:31:32] :) [16:31:41] Morning Ops! It's Chip from OIT. Does one of you have a moment to private message me about a simple but urgent task? [16:32:10] Sorry to ask the favor, you all are the only ones with the credentials to do this quickly. [16:32:14] cndiv: sure [16:32:28] YuviPanda: thanks, dm coming... [16:32:56] cool [16:32:57] <_joe_> if it's something security-sensitive [16:33:05] <_joe_> I won't use IRC [16:33:11] <_joe_> or use OTR at least [16:33:59] irc isn't save. always use OTR for PM :-P [16:34:18] 6operations, 10hardware-requests, 3Continuous-Integration-Isolation: eqiad: 2 hardware access request for CI isolation on labsnet - https://phabricator.wikimedia.org/T93076#1145415 (10hashar) + Dan Duvall If need be, we can have a hangout together to refine the procurement ticket. [16:34:24] maybe somone has tim to merge this https://gerrit.wikimedia.org/r/#/c/198242/ ? Needef for editathon at 26 [16:34:31] *tim = time [16:35:05] Steinsplitter: looking [16:35:18] thanks :-) [16:36:10] (03CR) 10Aude: [C: 032] Whitelisting domain for Nordiska museet to allow GWT upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198242 (https://phabricator.wikimedia.org/T93104) (owner: 10Steinsplitter) [16:36:13] ok :) [16:37:04] (03Merged) 10jenkins-bot: Whitelisting domain for Nordiska museet to allow GWT upload [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198242 (https://phabricator.wikimedia.org/T93104) (owner: 10Steinsplitter) [16:37:10] and jenkins is done [16:37:15] thx :-) [16:39:18] 6operations, 10ops-codfw, 3wikis-in-codfw: PXE doesn't work on mc2017-18 - https://phabricator.wikimedia.org/T90586#1145431 (10Joe) [16:39:19] 6operations, 3wikis-in-codfw: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1145430 (10Joe) [16:39:30] 6operations, 3wikis-in-codfw: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1145432 (10Joe) 5Open>3Resolved [16:39:31] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#1145433 (10Joe) [16:39:54] 6operations, 3wikis-in-codfw: Setup jobrunners cluster in codfw - https://phabricator.wikimedia.org/T86889#1145434 (10Joe) 5Open>3Resolved a:3Joe [16:39:55] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#979019 (10Joe) [16:40:40] 6operations, 5Patch-For-Review, 3codfw-appserver-setup, 3wikis-in-codfw: Set up load balancing for appservers in dallas - https://phabricator.wikimedia.org/T92377#1145439 (10Joe) 5Open>3Resolved [16:40:41] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Setup the main appservers cluster in codfw - https://phabricator.wikimedia.org/T86893#1145441 (10Joe) [16:41:01] !log aude Synchronized php-1.25wmf22/extensions/Wikidata: Update Wikidata - includes security fix and bug fixes (duration: 00m 19s) [16:41:06] Logged the message, Master [16:41:25] 6operations, 10MediaWiki-Configuration, 5Patch-For-Review, 3codfw-appserver-setup, 3wikis-in-codfw: Configure mediawiki to operate in the Dallas DC - https://phabricator.wikimedia.org/T91754#1145446 (10Joe) [16:41:26] 6operations, 5Patch-For-Review, 3codfw-appserver-setup, 7database, 3wikis-in-codfw: Grant access to the databases to codfw appserver networks - https://phabricator.wikimedia.org/T93211#1145445 (10Joe) 5Open>3Resolved [16:42:09] !log aude Synchronized wmf-config/InitialiseSettings.php: Whitelist domain for GWT (duration: 00m 13s) [16:42:10] Steinsplitter: done :) [16:42:12] Logged the message, Master [16:42:21] 6operations, 5Patch-For-Review, 3wikis-in-codfw: setup & deploy rdb2001-2004 - https://phabricator.wikimedia.org/T92011#1145448 (10Joe) This has been solved thanks to @Faidon's patch to the debian-installer [16:42:39] 6operations, 5Patch-For-Review, 3wikis-in-codfw: setup & deploy rdb2001-2004 - https://phabricator.wikimedia.org/T92011#1145458 (10Joe) 5Open>3Resolved a:5RobH>3Joe [16:43:01] 7Blocked-on-Operations, 6operations, 6Scrum-of-Scrums, 3Continuous-Integration-Isolation: Review Jenkins isolation architecture with Antoine - https://phabricator.wikimedia.org/T92324#1145462 (10hashar) I have lost track of whom I should review this with or whether any action is needed on my side. The ta... [16:44:05] 7Puppet, 6operations, 5Patch-For-Review, 3wikis-in-codfw: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#1145465 (10Joe) 5Open>3Resolved [16:44:06] 6operations, 5Patch-For-Review, 3wikis-in-codfw: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#1145466 (10Joe) [16:44:11] <_joe_> sorry, spring cleanings [16:44:33] RECOVERY - configured eth on mw2148 is OK: NRPE: Unable to read output [16:44:53] RECOVERY - RAID on mw2148 is OK: OK: no RAID installed [16:44:54] RECOVERY - DPKG on mw2148 is OK: All packages OK [16:44:54] RECOVERY - salt-minion processes on mw2148 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:44:54] RECOVERY - dhclient process on mw2148 is OK: PROCS OK: 0 processes with command name dhclient [16:45:13] RECOVERY - Disk space on mw2148 is OK: DISK OK [16:45:33] RECOVERY - nutcracker port on mw2148 is OK: TCP OK - 0.000 second response time on port 11212 [16:45:44] RECOVERY - nutcracker process on mw2148 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:46:21] 6operations, 5Patch-For-Review, 3wikis-in-codfw: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#1145476 (10Joe) 5Open>3Resolved a:3Joe [16:46:22] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#1145478 (10Joe) [16:46:39] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#979019 (10Joe) [16:46:40] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Setup the main appservers cluster in codfw - https://phabricator.wikimedia.org/T86893#1145481 (10Joe) 5Open>3Resolved [16:47:20] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#979019 (10Joe) [16:47:21] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Setup the api appservers cluster in codfw - https://phabricator.wikimedia.org/T86892#1145486 (10Joe) 5Open>3Resolved a:3Joe [16:48:23] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#1145493 (10Joe) [16:48:26] 6operations, 3wikis-in-codfw: Setup videoscalers cluster in codfw - https://phabricator.wikimedia.org/T86891#1145491 (10Joe) 5Open>3Resolved a:3Joe [16:48:46] 6operations, 3wikis-in-codfw: Setup imagescalers cluster in codfw - https://phabricator.wikimedia.org/T86890#1145503 (10Joe) 5Open>3Resolved a:3Joe [16:48:47] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#979019 (10Joe) [16:56:54] RECOVERY - puppet last run on mw2148 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:57:04] (03CR) 10GWicke: [C: 031] send additional metrics to graphite [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/199264 (https://phabricator.wikimedia.org/T78514) (owner: 10Eevans) [17:00:05] maxsem, kaldari: Dear anthropoid, the time has come. Please deploy Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150324T1700). [17:03:53] (03CR) 10JanZerebecki: "Looks good, should wait until: 2015-03-30" [puppet] - 10https://gerrit.wikimedia.org/r/199126 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [17:13:33] 6operations, 10ops-codfw: mw2088 has a faulty RAM - https://phabricator.wikimedia.org/T93370#1145569 (10Papaul) Dell will send a replacement memory stick tomorrow. [17:18:33] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:22:13] ^d: ^ speak of the devil, and it goes down :P [17:22:27] <^d> it should restart itself in a minute [17:26:08] 6operations, 7Monitoring: Restrict edit rights in grafana? - https://phabricator.wikimedia.org/T93710#1143848 (10GWicke) [17:26:44] 6operations, 7Monitoring: Restrict edit rights in grafana / enable dashboard deletion - https://phabricator.wikimedia.org/T93710#1145612 (10GWicke) [17:28:50] (03PS1) 10Faidon Liambotis: Kill esams dead/non-existent hosts [dns] - 10https://gerrit.wikimedia.org/r/199287 [17:28:52] (03PS1) 10Faidon Liambotis: Kill toolserver IPv4/IPv6 subnets [dns] - 10https://gerrit.wikimedia.org/r/199288 [17:28:53] (03PS1) 10Faidon Liambotis: [RFC] Move maerlant out of .esams.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/199289 [17:28:58] (03PS1) 10Faidon Liambotis: Fix wikimedia.org typo under palladium's hiera [puppet] - 10https://gerrit.wikimedia.org/r/199291 [17:29:00] (03PS1) 10Faidon Liambotis: ipsec: drop the custom $site detection [puppet] - 10https://gerrit.wikimedia.org/r/199292 [17:29:02] (03PS1) 10Faidon Liambotis: autoinstall: properly set up misc esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/199293 [17:29:04] (03PS1) 10Faidon Liambotis: [RFC] maerlant.esams.wikimedia.org -> maerlant.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/199294 [17:29:06] (03CR) 10jenkins-bot: [V: 04-1] [RFC] Move maerlant out of .esams.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/199289 (owner: 10Faidon Liambotis) [17:30:03] 6operations, 7Monitoring: Restrict edit rights in grafana / enable dashboard deletion - https://phabricator.wikimedia.org/T93710#1145616 (10GWicke) [17:30:24] (03PS2) 10Faidon Liambotis: [RFC] Move maerlant out of .esams.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/199289 [17:31:25] (03CR) 10Faidon Liambotis: [C: 032] Fix wikimedia.org typo under palladium's hiera [puppet] - 10https://gerrit.wikimedia.org/r/199291 (owner: 10Faidon Liambotis) [17:32:03] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [17:33:19] (03CR) 10Gage: [C: 032] ipsec: drop the custom $site detection [puppet] - 10https://gerrit.wikimedia.org/r/199292 (owner: 10Faidon Liambotis) [17:33:24] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Puppet has 1 failures [17:35:32] (03PS1) 10coren: Labs: Monitor network staturation on labstores [puppet] - 10https://gerrit.wikimedia.org/r/199297 (https://phabricator.wikimedia.org/T92629) [17:35:41] YuviPanda: Is that right? ^^ [17:36:37] (03CR) 10jenkins-bot: [V: 04-1] autoinstall: properly set up misc esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/199293 (owner: 10Faidon Liambotis) [17:38:03] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1145677 (10GWicke) @eevans, icinga alerts on graphite data are set up in puppet [like this](https://github.com/wikimedia/oper... [17:38:08] (03CR) 10Yuvipanda: [C: 04-1] "You can also test this on neon by running the check_graphite commandline and passing the appropriate values (look at the define to see whi" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/199297 (https://phabricator.wikimedia.org/T92629) (owner: 10coren) [17:38:14] Coren: :D needs testing, and also a typo [17:39:37] 6operations, 6Phabricator: Phabricator's phd can't sudo to user phd - https://phabricator.wikimedia.org/T93477#1145717 (10chasemp) [17:40:41] (03PS1) 10Rush: phab phd user owns repo management scripts [puppet] - 10https://gerrit.wikimedia.org/r/199298 [17:41:45] 7Puppet, 6Labs, 5Patch-For-Review: puppet-run is confused by stale lock files - https://phabricator.wikimedia.org/T92766#1145775 (10BBlack) 5Open>3Resolved a:3BBlack I'm assuming the fixup from a week ago worked for labs as well, closing. Re-open if not! :) [17:42:17] (03PS2) 10Rush: phab phd user owns repo management scripts [puppet] - 10https://gerrit.wikimedia.org/r/199298 [17:42:34] please someone kick git.w.o it is not responding to me [17:43:57] ^d: it seems it still did not restart itself [17:44:07] (03CR) 10Rush: [C: 032] phab phd user owns repo management scripts [puppet] - 10https://gerrit.wikimedia.org/r/199298 (owner: 10Rush) [17:45:07] !log restart gitblit on antimony [17:45:09] jzerebecki: ^ [17:45:11] Logged the message, Master [17:45:25] thx [17:45:40] <^d> gitblit sucks [17:46:05] yup [17:46:06] use github, etc [17:46:11] yea why was cgit not used? [17:46:22] 10Ops-Access-Requests, 6operations: Access request: +2 on cassandra submodule for services team members - https://phabricator.wikimedia.org/T93775#1145807 (10GWicke) 3NEW [17:46:28] i mean it works on kernel.org which certainly has more traffic [17:46:40] <^d> I wanted to at some point. [17:46:43] <^d> I can't remember why not [17:46:53] (03CR) 10Faidon Liambotis: [C: 04-1] "First pass!" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/198794 (https://phabricator.wikimedia.org/T87258) (owner: 10Alexandros Kosiaris) [17:46:53] <^d> gitweb was terrible [17:47:00] 10Ops-Access-Requests, 6operations: Access request: +2 on cassandra submodule for services team members - https://phabricator.wikimedia.org/T93775#1145815 (10GWicke) [17:47:42] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:47:43] 10Ops-Access-Requests, 6operations: Access request: +2 on cassandra submodule for services team members - https://phabricator.wikimedia.org/T93775#1145807 (10GWicke) [17:47:44] gitweb is not designed for traffic (caching and stuff) [17:48:12] <^d> Yeah, gitweb doesn't know what caching is [17:48:26] (03CR) 10GWicke: "Alternative idea (grant +2 on puppet/cassandra repo to services members) at https://phabricator.wikimedia.org/T93775 ." [puppet] - 10https://gerrit.wikimedia.org/r/196335 (https://phabricator.wikimedia.org/T92560) (owner: 10Eevans) [17:48:38] <^d> gitblit seemed nice, but falls over at our scale. it's really designed for a few dozen repos and like workgroup-level traffic [17:48:57] <^d> it's really the # of repos it barfs on [17:49:23] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [17:49:53] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59665 bytes in 0.136 second response time [17:49:55] ok git.w.o works again [17:50:06] * jzerebecki goes recheck tons of patches [17:50:21] (03PS2) 10coren: Labs: Monitor network staturation on labstores [puppet] - 10https://gerrit.wikimedia.org/r/199297 (https://phabricator.wikimedia.org/T92629) [17:50:36] <^d> jzerebecki: Why on earth are patches hitting gitblit? [17:50:41] YuviPanda: Works on neon [17:50:42] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:50:57] Hi all. What is the settings that could make me change the Expires HTTP header to be in a date in the future instead of the UINX epoch date. [17:51:18] Coren: so I usually test them fully by reducing the threshold and make sure there’s a crit and warn output as well. [17:51:24] Coren: can you try that as well? [17:51:26] ^d: because nobody works on the bug that says to not get mediawiki from git.w.o for testing [17:51:44] Yep. I did. Dropping them by an order of magnitude pops up a critical. [17:52:02] <^d> jzerebecki: You know I set up git replication to integration hosts for /exactly/ this reason right? [17:52:17] Coren: sweet. [17:52:18] <^d> Hitting gerrit/gitblit for full clones to CI is batshit. [17:52:27] I see notes about wgUseXVO and others in https://wikitech.wikimedia.org/wiki/MediaWiki_caching but nothing else. [17:52:41] ^d: it gets the zip or tar, let me find the bugreport [17:52:59] <^d> Ugh, pull that crap from github! [17:53:03] <^d> They actually don't suck at it [17:53:12] <^d> gitblit explodes on tar generation. zero caching [17:54:30] (03CR) 10Yuvipanda: [C: 031] Labs: Monitor network staturation on labstores [puppet] - 10https://gerrit.wikimedia.org/r/199297 (https://phabricator.wikimedia.org/T92629) (owner: 10coren) [17:54:42] Will merge after meeting. [17:54:47] Coren: ^ lgtm, but I want paravoid’s opinions on the numbers [17:55:04] Ah, good point. I may have been overly conservative. [17:55:56] (Or maybe insufficiently paranoid) :-) [17:56:11] ^d: https://phabricator.wikimedia.org/T74001 says to use zuul-cloner, it seems two scripts and then wikidata/base stuff still needs to be converted [17:56:20] !log set email for User:ProGTX@global, attached enwiki [17:56:25] Logged the message, Master [17:56:51] (03CR) 10Legoktm: [C: 031] Add a couple of missing extensions from the entry list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198813 (owner: 10Chad) [17:57:43] (03PS1) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [17:57:47] (03CR) 10Chad: [C: 032] Add a couple of missing extensions from the entry list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198813 (owner: 10Chad) [17:57:48] 7Puppet, 6Phabricator, 5Patch-For-Review: Phabricator labs repo management isn't configured - https://phabricator.wikimedia.org/T93615#1145908 (10Negative24) [17:57:56] 6operations: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1145909 (10Dzahn) 3NEW [17:57:58] 7Puppet, 6Phabricator, 5Patch-For-Review: Phabricator labs repo management isn't configured - https://phabricator.wikimedia.org/T93615#1142020 (10Negative24) [17:58:31] (03Merged) 10jenkins-bot: Add a couple of missing extensions from the entry list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/198813 (owner: 10Chad) [17:58:37] 10Ops-Access-Requests, 6operations, 6Phabricator, 6Release-Engineering, 5Patch-For-Review: Mukunda needs sudo on iridium (phab host) - https://phabricator.wikimedia.org/T93151#1145918 (10RobH) This access request has been granted via ops meeting review. I'm owning this task and implementing it later today. [17:58:44] (03CR) 10jenkins-bot: [V: 04-1] VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) (owner: 1020after4) [17:59:24] <^d> twentyafterfour: You're about to start the train, right? [17:59:46] ^d: yeah [18:00:04] twentyafterfour, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150324T1800). Please do the needful. [18:00:15] <^d> twentyafterfour: Ok. We started having scap sync to all of codfw too. We should've ironed out any problems but if you see anything look odd give a shout [18:00:46] !log demon Synchronized wmf-config/extension-list: (no message) (duration: 00m 12s) [18:00:51] Logged the message, Master [18:00:58] <^d> (actually, _joe_ did all the work, I just shouted when it broke :p) [18:01:03] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: puppet fail [18:02:05] (03CR) 10Yuvipanda: [C: 04-1] VarnishStatusCollector for diamond. (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) (owner: 1020after4) [18:02:37] (03PS2) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [18:02:59] (03PS4) 10Rush: Configure Labs Phabricators with default local repo store [puppet] - 10https://gerrit.wikimedia.org/r/198769 (https://phabricator.wikimedia.org/T93615) (owner: 10Negative24) [18:03:18] (03CR) 10Rush: [C: 032 V: 032] Configure Labs Phabricators with default local repo store [puppet] - 10https://gerrit.wikimedia.org/r/198769 (https://phabricator.wikimedia.org/T93615) (owner: 10Negative24) [18:04:04] twentyafterfour: btw, I left lots of comments on PS1 :) [18:06:05] 7Blocked-on-Operations, 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: move cassandra submodule into puppet repo - https://phabricator.wikimedia.org/T92560#1145949 (10faidon) Yeah, I was initially supportive but these are becoming a major PITA. I'd like us to kill the submodules. For... [18:06:30] (03CR) 10Faidon Liambotis: [C: 031] move cassandra submodule into puppet repo [puppet] - 10https://gerrit.wikimedia.org/r/196335 (https://phabricator.wikimedia.org/T92560) (owner: 10Eevans) [18:06:33] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [18:07:33] 6operations, 5Patch-For-Review: Setup poolcounter servers for codfw - https://phabricator.wikimedia.org/T93261#1145961 (10Dzahn) [18:07:34] 6operations, 10ops-codfw: subra/wmf5816 - relabel system / setup mgmt / update racktables - https://phabricator.wikimedia.org/T93272#1145960 (10Dzahn) 5Resolved>3Open [18:08:08] !log manually attached User:Secret@enwiki to global [18:08:12] Logged the message, Master [18:08:46] 6operations, 10ops-codfw: subra/wmf5816 - relabel system / setup mgmt / update racktables - https://phabricator.wikimedia.org/T93272#1133458 (10Dzahn) the OS installer fails to detect any disks here when it does hardware detection, unlike with suhail where it worked normal. could you check if the hardware is a... [18:08:51] 7Puppet, 6Phabricator, 5Patch-For-Review: Phabricator labs repo management isn't configured - https://phabricator.wikimedia.org/T93615#1145968 (10Negative24) [18:08:59] 6operations, 10ops-codfw: subra/wmf5816 - relabel system / setup mgmt / update racktables - https://phabricator.wikimedia.org/T93272#1145969 (10Dzahn) a:5Papaul>3RobH [18:09:49] 6operations, 10ops-codfw: subra/wmf5816 - relabel system / setup mgmt / update racktables - https://phabricator.wikimedia.org/T93272#1133458 (10Dzahn) p:5Normal>3High [18:10:23] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [18:10:57] (03PS4) 10Nuria: [WIP] Testing pageviews, logster and wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/197411 [18:12:23] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Testing pageviews, logster and wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/197411 (owner: 10Nuria) [18:12:27] 7Blocked-on-Operations, 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: move cassandra submodule into puppet repo - https://phabricator.wikimedia.org/T92560#1145996 (10GWicke) Another option for consideration: {T93775}. [18:17:07] (03PS5) 10Nuria: [WIP] Testing pageviews, logster and wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/197411 [18:18:26] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Testing pageviews, logster and wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/197411 (owner: 10Nuria) [18:18:44] !log Starting deployment train: group1 to 1.25wmf22 [18:18:49] Logged the message, Master [18:18:52] (03CR) 10Yuvipanda: [C: 04-1] "Should probably do this for labstore1003 as well?" [puppet] - 10https://gerrit.wikimedia.org/r/199297 (https://phabricator.wikimedia.org/T92629) (owner: 10coren) [18:18:55] Coren: ^ just realized [18:19:42] YuviPanda: labstore1003 is dumps; we probably want to do it a little differently - or at least have different sensitivity settings. [18:19:53] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [18:20:01] It's not going to be a labs_storage::server box; probably labs_storage::dumps [18:20:28] Speaking of: paravoid, do you have an opinion on the values chosen for https://gerrit.wikimedia.org/r/199297 ? [18:21:22] Coren: it's 1gbps, not 100mbps [18:21:27] that value says tx_bytes, though [18:21:32] which is why this works probably [18:21:35] this is set to 400mbps right now [18:22:03] 6operations, 10ops-codfw: mw2088 has a faulty RAM - https://phabricator.wikimedia.org/T93370#1146024 (10RobH) a:3Papaul Reassigning this to @Papaul so he can replace the memory. @Papaul: When you have this in, please coordinate with someone in IRC to have the system shutdown properly for you to swap the mem... [18:22:08] (03PS1) 1020after4: Group1 wikis to 1.25wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199303 [18:22:11] 6operations, 10ops-codfw: mw2088 has a faulty RAM - https://phabricator.wikimedia.org/T93370#1146026 (10RobH) p:5Triage>3High [18:22:28] I mean Bps not bps. I'll fix the comment accordingly. But 50m/75m over 10% sounds right for the thresholds? [18:22:33] 6operations: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1146028 (10RobH) a:3Dzahn [18:22:45] 100MBps is still wrong [18:22:51] also MBps is not a unit :) [18:22:53] (03CR) 1020after4: [C: 032] Group1 wikis to 1.25wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199303 (owner: 1020after4) [18:23:03] (03PS6) 10Nuria: [WIP] Testing pageviews, logster and wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/197411 [18:23:17] I meant fix to 1gbps. [18:23:28] (03Merged) 10jenkins-bot: Group1 wikis to 1.25wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199303 (owner: 1020after4) [18:23:35] 10Ops-Access-Requests, 6operations, 7database: Can't access x1-analytics-slave - https://phabricator.wikimedia.org/T93708#1146032 (10RobH) a:3Springle [18:24:06] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Testing pageviews, logster and wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/197411 (owner: 10Nuria) [18:24:26] (03CR) 10Faidon Liambotis: [C: 04-1] Labs: Monitor network staturation on labstores (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199297 (https://phabricator.wikimedia.org/T92629) (owner: 10coren) [18:25:26] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group1 to php-1.25wmf22 [18:25:35] Logged the message, Master [18:27:07] (03CR) 10coren: Labs: Monitor network staturation on labstores (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/199297 (https://phabricator.wikimedia.org/T92629) (owner: 10coren) [18:27:40] ok, looks like the deployment went smoothly [18:28:08] (03PS2) 10Faidon Liambotis: autoinstall: properly set up misc esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/199293 [18:28:10] (03PS2) 10Faidon Liambotis: maerlant.esams.wikimedia.org -> maerlant.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/199294 [18:28:30] (03PS3) 10coren: Labs: Monitor network staturation on labstores [puppet] - 10https://gerrit.wikimedia.org/r/199297 (https://phabricator.wikimedia.org/T92629) [18:28:41] has something site notice related been deployed right now? [18:28:49] (03CR) 10Faidon Liambotis: [C: 032] autoinstall: properly set up misc esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/199293 (owner: 10Faidon Liambotis) [18:28:49] the hide link no longer works [18:29:00] (03CR) 10RobH: [C: 032] make twentyafterfour a phabricator-admin [puppet] - 10https://gerrit.wikimedia.org/r/197798 (https://phabricator.wikimedia.org/T93151) (owner: 10Dzahn) [18:29:12] (03PS3) 10Faidon Liambotis: autoinstall: properly set up misc esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/199293 [18:29:13] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:30:19] (03CR) 10Faidon Liambotis: [V: 032] autoinstall: properly set up misc esams hosts [puppet] - 10https://gerrit.wikimedia.org/r/199293 (owner: 10Faidon Liambotis) [18:30:41] 10Ops-Access-Requests, 6operations, 6Phabricator, 6Release-Engineering, 5Patch-For-Review: Mukunda needs sudo on iridium (phab host) - https://phabricator.wikimedia.org/T93151#1146054 (10RobH) 5Open>3Resolved https://gerrit.wikimedia.org/r/#/c/197798/ is now live, and @mmodell has the same sudu right... [18:30:56] twentyafterfour: your sudo rights on phab just merged, live on palladium, will be live on server after its next puppet run [18:31:25] (03PS1) 10Dzahn: correct einsteinium node name [puppet] - 10https://gerrit.wikimedia.org/r/199304 [18:33:09] (03PS2) 10Dzahn: correct einsteinium node name [puppet] - 10https://gerrit.wikimedia.org/r/199304 [18:34:16] (03PS2) 10coren: WIP: Proper labs_storage class [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) [18:36:01] 6operations, 7Mail, 7Monitoring: Mailing lists alerts - https://phabricator.wikimedia.org/T93783#1146071 (10faidon) 3NEW [18:36:17] 6operations, 5Patch-For-Review: Setup poolcounter servers for codfw - https://phabricator.wikimedia.org/T93261#1146082 (10RobH) [18:36:19] 6operations, 10ops-codfw: subra/wmf5816 - relabel system / setup mgmt / update racktables - https://phabricator.wikimedia.org/T93272#1146080 (10RobH) 5Open>3Resolved So this is an older system with H310 raid controller (not really raid, just crappy controller). So I've set them to non raid disks and reboo... [18:36:28] (03PS5) 10BBlack: add generic nrpe script check-fresh-files-in-dir.py [puppet] - 10https://gerrit.wikimedia.org/r/198387 [18:36:40] (03CR) 10BBlack: [C: 032 V: 032] add generic nrpe script check-fresh-files-in-dir.py [puppet] - 10https://gerrit.wikimedia.org/r/198387 (owner: 10BBlack) [18:38:22] (03PS21) 10BBlack: protoproxy/sslcert/cache: nginx ssl_stapling_file support [puppet] - 10https://gerrit.wikimedia.org/r/198110 (https://phabricator.wikimedia.org/T86666) [18:38:41] (03CR) 10BBlack: [C: 032 V: 032] protoproxy/sslcert/cache: nginx ssl_stapling_file support [puppet] - 10https://gerrit.wikimedia.org/r/198110 (https://phabricator.wikimedia.org/T86666) (owner: 10BBlack) [18:38:46] (03CR) 10Dzahn: [C: 032] "einsteinium.eqiad.wmnet has address 10.64.48.94" [puppet] - 10https://gerrit.wikimedia.org/r/199304 (owner: 10Dzahn) [18:39:48] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1146106 (10GWicke) [18:40:44] (03CR) 10Dzahn: "einsteinium now got the role: 05-role-Titan-test-host]/ensure: created etc." [puppet] - 10https://gerrit.wikimedia.org/r/199304 (owner: 10Dzahn) [18:41:14] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Puppet has 1 failures [18:41:23] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet has 1 failures [18:41:52] PROBLEM - puppet last run on mw1075 is CRITICAL: CRITICAL: Puppet has 1 failures [18:42:13] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: Puppet has 1 failures [18:42:14] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Puppet has 1 failures [18:42:23] PROBLEM - puppet last run on mw2172 is CRITICAL: CRITICAL: Puppet has 1 failures [18:42:24] bblack: i can confirm both changes at once. einsteinium got a /usr/local/sbin/update-ocsp file [18:42:32] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Puppet has 1 failures [18:42:32] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures [18:42:33] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: Puppet has 1 failures [18:42:33] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 1 failures [18:42:33] PROBLEM - puppet last run on wtp1014 is CRITICAL: CRITICAL: Puppet has 1 failures [18:42:43] PROBLEM - puppet last run on mc1008 is CRITICAL: CRITICAL: Puppet has 1 failures [18:42:43] PROBLEM - puppet last run on db2041 is CRITICAL: CRITICAL: Puppet has 1 failures [18:42:43] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [18:42:48] (03PS6) 10BBlack: test OCSP Stapling on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/198388 [18:42:52] PROBLEM - puppet last run on mw2035 is CRITICAL: CRITICAL: Puppet has 1 failures [18:43:02] PROBLEM - puppet last run on ocg1002 is CRITICAL: CRITICAL: Puppet has 1 failures [18:43:02] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:43:02] PROBLEM - puppet last run on mw1184 is CRITICAL: CRITICAL: Puppet has 1 failures [18:43:03] PROBLEM - puppet last run on mw2098 is CRITICAL: CRITICAL: Puppet has 1 failures [18:43:05] uhm [18:43:13] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: Puppet has 1 failures [18:43:13] PROBLEM - puppet last run on analytics1024 is CRITICAL: CRITICAL: Puppet has 1 failures [18:43:13] PROBLEM - puppet last run on mw1127 is CRITICAL: CRITICAL: Puppet has 1 failures [18:43:32] PROBLEM - puppet last run on labsdb1007 is CRITICAL: CRITICAL: Puppet has 1 failures [18:43:32] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Puppet has 1 failures [18:43:32] PROBLEM - puppet last run on mw2198 is CRITICAL: CRITICAL: Puppet has 1 failures [18:43:33] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: Puppet has 1 failures [18:43:33] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: Puppet has 1 failures [18:43:43] Error: /Stage[main]/Base::Monitoring::Host/File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/base/monitoring/check-fresh-files-in-dir.py [18:43:48] looks like strontium thing? [18:43:52] PROBLEM - puppet last run on mw1083 is CRITICAL: CRITICAL: Puppet has 1 failures [18:43:53] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 1 failures [18:44:01] yes, it does, because it works here: [18:44:02] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: Puppet has 1 failures [18:44:03] PROBLEM - puppet last run on mw2174 is CRITICAL: CRITICAL: Puppet has 1 failures [18:44:03] PROBLEM - puppet last run on mw2025 is CRITICAL: CRITICAL: Puppet has 1 failures [18:44:11] Notice: /Stage[main]/Base::Monitoring::Host/File[/usr/lib/nagios/plugins/check-fresh-files-in-dir.py]/ensure: defined content [18:44:13] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [18:44:20] and that's mw1136 which was in the list of failures [18:44:26] says already up to date [18:44:39] yea, puppet run finished fine when i started it [18:44:57] maybe it's just that all those hosts hit a race condition where one master had the resource defined and the other didn't see the source file yet [18:45:00] sad [18:45:09] it seems like that is the case [18:45:20] it jut worked on mw1184 one of teh failed hosts [18:45:26] so it's transient too [18:45:35] yeah just a race that's already over [18:45:43] did someone push a mass pupppet update? [18:45:57] it does seem odd that so many landed on that one short window [18:46:12] RECOVERY - puppet last run on mw1184 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [18:46:52] actually I guess it's not that odd, my perceptions are wrong [18:47:04] 6operations: detail blockers for procurement in phabricator - https://phabricator.wikimedia.org/T93760#1146124 (10RobH) [18:47:20] if puppet averages a minute per run, and we're running 20 minute spacing, we'd expect 5% of the fleet to hit a race like that [18:47:25] 6operations: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1146126 (10RobH) p:5Triage>3High [18:47:34] * ^d pokes _joe_ gently about https://gerrit.wikimedia.org/r/#/c/197533/ [18:47:37] :) [18:47:38] true [18:47:44] 6operations, 7Monitoring: Restrict edit rights in grafana / enable dashboard deletion - https://phabricator.wikimedia.org/T93710#1146129 (10Eevans) I think we need to have graphs setup for most/all of these metrics we are collecting, (including those to be added by https://gerrit.wikimedia.org/r/199264). I kn... [18:47:46] ~30 failures about would be 5% of 600. it's ballpark-ish [18:48:03] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1146131 (10chasemp) [18:48:27] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1146132 (10RobH) a:5RobH>3chasemp @chasemp: I think the above details what I need for procurement in phabricator. Please review and provide feedback. [18:48:43] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [18:48:51] and then it's about icinga refresh interval too [18:49:22] (03PS3) 10Faidon Liambotis: maerlant.esams.wikimedia.org -> maerlant.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/199294 [18:49:24] (03PS1) 10Faidon Liambotis: autoinstall: switch public1-esams to Debian [puppet] - 10https://gerrit.wikimedia.org/r/199306 [18:49:26] (03CR) 10BBlack: [C: 032] test OCSP Stapling on cp1008 [puppet] - 10https://gerrit.wikimedia.org/r/198388 (owner: 10BBlack) [18:50:26] (03PS2) 10Faidon Liambotis: Kill esams dead/non-existent hosts [dns] - 10https://gerrit.wikimedia.org/r/199287 [18:50:28] (03PS2) 10Faidon Liambotis: Kill toolserver IPv4/IPv6 subnets [dns] - 10https://gerrit.wikimedia.org/r/199288 [18:50:30] (03PS3) 10Faidon Liambotis: Move maerlant out of .esams.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/199289 [18:50:37] (03PS2) 10Faidon Liambotis: autoinstall: switch public1-esams to Debian [puppet] - 10https://gerrit.wikimedia.org/r/199306 [18:50:45] (03CR) 10Faidon Liambotis: [C: 032 V: 032] autoinstall: switch public1-esams to Debian [puppet] - 10https://gerrit.wikimedia.org/r/199306 (owner: 10Faidon Liambotis) [18:53:00] (03PS1) 10BBlack: OCSP bugfix for 8aa4e450 [puppet] - 10https://gerrit.wikimedia.org/r/199307 [18:53:05] Steinsplitter: no longer works? [18:53:22] (03CR) 10BBlack: [C: 032 V: 032] OCSP bugfix for 8aa4e450 [puppet] - 10https://gerrit.wikimedia.org/r/199307 (owner: 10BBlack) [18:54:18] klicking on the hide link = no effect. But the functionallety seems back, looks like it was a tenporary cach isuisse. (Purge) [18:56:44] (03PS1) 10BBlack: bugfix for cp1008 test (puppet scoping?) [puppet] - 10https://gerrit.wikimedia.org/r/199308 [18:57:23] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [18:57:43] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:58:03] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [18:58:12] (03CR) 10BBlack: [C: 032] bugfix for cp1008 test (puppet scoping?) [puppet] - 10https://gerrit.wikimedia.org/r/199308 (owner: 10BBlack) [18:58:23] RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:58:32] RECOVERY - puppet last run on ocg1002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [18:58:52] RECOVERY - puppet last run on analytics1024 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [18:58:53] RECOVERY - puppet last run on mw1075 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [18:59:03] RECOVERY - puppet last run on labsdb1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:59:04] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:59:04] RECOVERY - puppet last run on mw2198 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [18:59:23] RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:59:23] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:59:32] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:59:33] RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:59:33] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:59:52] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [18:59:52] RECOVERY - puppet last run on wtp1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:59:53] RECOVERY - puppet last run on mc1008 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [18:59:53] RECOVERY - puppet last run on db2041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:00:03] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:00:03] RECOVERY - puppet last run on mw2035 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [19:00:14] RECOVERY - puppet last run on mw2098 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:00:23] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:00:46] RECOVERY - puppet last run on mw1127 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:01:36] RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:02:26] RECOVERY - puppet last run on mw2174 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:03:08] (03PS1) 10BBlack: update-ocsp-all: fix varname, kill normal output [puppet] - 10https://gerrit.wikimedia.org/r/199310 [19:03:26] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:03:35] RECOVERY - puppet last run on mw2025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:03:37] (03CR) 10BBlack: [C: 032 V: 032] update-ocsp-all: fix varname, kill normal output [puppet] - 10https://gerrit.wikimedia.org/r/199310 (owner: 10BBlack) [19:07:29] 6operations, 7HTTPS, 3HTTPS-by-default, 5Patch-For-Review, 7Performance: HTTPS performance tuning - https://phabricator.wikimedia.org/T86666#1146172 (10BBlack) OCSP Stapling testing on cp1008 looks good so far. I want to leave it on just the test host for a day or two first, though, to observe at least... [19:08:50] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1146180 (10chasemp) @mark are you ok with us approaching this? Previously you wanted to be more comfortable with Phab permissions. What do you think? [19:08:53] 6operations, 7HTTPS, 3HTTPS-by-default, 7Performance: HTTPS performance tuning - https://phabricator.wikimedia.org/T86666#1146182 (10BBlack) [19:09:01] (03CR) 10Faidon Liambotis: [C: 031] "Verified with nmap." [dns] - 10https://gerrit.wikimedia.org/r/199287 (owner: 10Faidon Liambotis) [19:10:35] 6operations, 10ops-codfw: mw2050 management unreachable - https://phabricator.wikimedia.org/T93729#1146194 (10RobH) If I am on iron, and attempt to ssh to mw2050.mgmt.codfw.wmnet, the password for root doesnt work. If I ssh to the IP, it does. debug1: Server host key: RSA f4:4e:df:88:65:1a:89:37:66:a8:be:d9:... [19:11:28] 6operations, 10RESTBase: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1146174 (10GWicke) [19:16:26] 6operations, 10RESTBase: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1146212 (10GWicke) [19:17:33] 6operations, 10RESTBase: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1146224 (10GWicke) p:5Triage>3Normal [19:19:48] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1146243 (10RobH) I suppose the next steps are: add procurement project, add said project to direct email allowance settings, test with drop down [19:20:22] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1146244 (10RobH) We also need to confirm with @mark that we are fine with all users in the #wmf-nda group viewing procurement. [19:21:08] 6operations, 10RESTBase: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1146246 (10GWicke) [19:26:44] 6operations, 6Phabricator, 6Project-Creators: create procurement project - https://phabricator.wikimedia.org/T93796#1146285 (10RobH) 3NEW [19:29:33] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: install/deploy codfw appservers - https://phabricator.wikimedia.org/T85227#1146296 (10RobH) a:5RobH>3None [19:33:44] (03PS2) 10Chmarkine: donate - Enable HSTS max-age=7 days [puppet] - 10https://gerrit.wikimedia.org/r/199200 (https://phabricator.wikimedia.org/T40516) [19:33:45] 6operations, 6Labs, 10hardware-requests: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1146302 (10RobH) virt1000 is a dual X5647 @ 2.93GHz w/ 32GB. Also if the replacement has to be under warranty, it'll be slightly more challenging. Is there a particular entry... [19:35:20] Hah! Amusing. After I start the rsync I'll probably have a much easier time of catching the 5 minute bandit as /its/ performance will also drop noticably! :-) [19:37:38] (03PS2) 10Chmarkine: gdash - Enable HSTS max-age=7 days [puppet] - 10https://gerrit.wikimedia.org/r/198469 (https://phabricator.wikimedia.org/T40516) [19:46:29] 6operations, 7Monitoring: Restrict edit rights in grafana / enable dashboard deletion - https://phabricator.wikimedia.org/T93710#1146333 (10GWicke) @eevans, it might make sense to add this comment to T88585 / T78514 or create a new task for pre-generating the JSON definition for the grafana dashboards of a cla... [19:48:50] paravoid: https://gerrit.wikimedia.org/r/#/c/199297/ now with fixes! [19:51:07] 6operations, 10Continuous-Integration: fix failures of jenkins job operations-puppet-puppetlint-strict - https://phabricator.wikimedia.org/T93642#1146346 (10Dzahn) This is not really done though. [19:53:36] (03CR) 10Dzahn: [C: 032] dbtree - Enable HSTS max-age=7 days [puppet] - 10https://gerrit.wikimedia.org/r/199139 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [19:54:44] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1146356 (10Krenair) >>! In T93760#1146244, @RobH wrote: > We also need to confirm with @mark that we are fine with all users in the #wmf-nda group viewing procurement. That conflicts with this: > Pr... [19:54:54] (03CR) 10Dzahn: "jenkins-bot: you already verified it, why "needs verified", it has it" [puppet] - 10https://gerrit.wikimedia.org/r/199139 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [19:55:18] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/199139 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [19:57:59] (03PS1) 10BryanDavis: Copy l10n CDB files to rebuildLocalisationCache.php tmp dir [tools/scap] - 10https://gerrit.wikimedia.org/r/199318 (https://phabricator.wikimedia.org/T93737) [19:58:36] (03PS1) 10Chmarkine: Add "always" flag when add HSTS header in Apache [puppet] - 10https://gerrit.wikimedia.org/r/199319 [20:00:13] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra compaction is getting behind - https://phabricator.wikimedia.org/T93140#1146385 (10GWicke) We also enabled tricke_fsync, which made a big difference to latency under heavy write load by writing changes out continuously rather than... [20:00:21] (03PS2) 10Chmarkine: Add "always" flag when add HSTS header in Apache [puppet] - 10https://gerrit.wikimedia.org/r/199319 (https://phabricator.wikimedia.org/T40516) [20:00:27] (03PS1) 10Cenarium: Tweaks to testwiki/enwiki flaggedrevs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199321 [20:02:05] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1146395 (10Dzahn) The default access policy should be to be open (to people who already signed NDAs) unless we have $reason why these are different from other private things. [20:03:23] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1146400 (10Eevans) I think we need to have graphs setup for most/all of these metrics we are collecting, (including those to... [20:04:18] 6operations, 7Monitoring: Restrict edit rights in grafana / enable dashboard deletion - https://phabricator.wikimedia.org/T93710#1143848 (10Eevans) @Gwicke whoops, yeah I meant to apply that to T78514 [20:06:05] 7Blocked-on-Operations, 6operations, 10Continuous-Integration, 6Scrum-of-Scrums: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#1146416 (10greg) [20:08:15] (03PS2) 10Chmarkine: doc - Enable HSTS max-age=7 days [puppet] - 10https://gerrit.wikimedia.org/r/198819 (https://phabricator.wikimedia.org/T40516) [20:08:21] 7Blocked-on-Operations, 6operations, 10Continuous-Integration, 6Scrum-of-Scrums: Jenkins is using php-luasandbox 1.9-1 for zend unit tests; precise should be upgraded to 2.0-7+wmf2.1 or equivalent - https://phabricator.wikimedia.org/T88798#1146425 (10greg) [20:10:51] 6operations, 10ops-codfw: subra/wmf5816 - relabel system / setup mgmt / update racktables - https://phabricator.wikimedia.org/T93272#1146439 (10Dzahn) the installer now detects disks and formats them. after being done it keeps PXE booting though, so it's a cycle of reinstalling and rebooting. i went to BIOS a... [20:11:45] RECOVERY - Host labstore2001 is UP: PING OK - Packet loss = 0%, RTA = 42.94 ms [20:15:19] akosiaris: Do you know who in the maps project would best know what should/should not need to be backed up in their project? (Like, can some rendered tiles be excluded, etc) [20:16:16] PROBLEM - salt-minion processes on labstore2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:17:17] (03PS2) 10Chmarkine: scholarships - Increase HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/199126 (https://phabricator.wikimedia.org/T40516) [20:17:24] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1146449 (10Eevans) Also, I think we need availability monitoring of Cassandra that goes beyond monitoring the process. Thoug... [20:31:17] Coren: file a bug and cc him? :) [20:31:18] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1146464 (10Eevans) As for alerts, we also need a small number of threshold-generated notifications for some of these performa... [20:31:43] YuviPanda: It's not bugworthy; it's in the "nice to have" category [20:32:17] Coren: hmm, alright. but I guess having it ‘fixed’ might increase efficiency of the codfw rsyncs / backups? [20:33:15] It will, but then so would rm -rf. :-) I'm hoping the prodding on the lists + informal chats will do. :-) [20:33:33] I don't want to mandate it in anyways because the process has to be robust enough even if nobody does it. [20:33:59] Coren: idk, I’d rather have us get this up on phab and have it documented. More visibility, and you aren’t the only bottleneck, etc :) but ok if you think it’s not worth the effort. [20:35:02] That's odd. I seem to be missing a disk in labstore2001 [20:35:19] !log rebooting labstore2001 to look at its bios [20:35:24] Logged the message, Master [20:37:06] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra compaction is getting behind - https://phabricator.wikimedia.org/T93140#1146484 (10Eevans) @GWicke, I think it should be closed; The original issue is for all intents solved. We can track the threshold alert in T78514, or create a... [20:37:30] 6operations, 10RESTBase: (nodetool) cleanup needed on restbase1006 - https://phabricator.wikimedia.org/T93079#1146489 (10Eevans) [20:37:31] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Cassandra compaction is getting behind - https://phabricator.wikimedia.org/T93140#1146487 (10Eevans) 5Open>3Resolved [20:37:56] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [20:40:38] Ah, replaced disk that was not onlined. [20:42:24] ... that was part of the os array. [20:42:27] * Coren curses. [20:47:00] (03PS3) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [20:47:57] YuviPanda: addressed your comments on the diamond collector...except still no error handling [20:49:06] (03CR) 10jenkins-bot: [V: 04-1] VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) (owner: 1020after4) [20:49:26] RECOVERY - Host labstore2001 is UP: PING OK - Packet loss = 0%, RTA = 43.16 ms [20:51:07] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: puppet fail [20:59:20] (03PS4) 1020after4: VarnishStatusCollector for diamond. [puppet] - 10https://gerrit.wikimedia.org/r/199302 (https://phabricator.wikimedia.org/T88705) [21:01:53] 6operations, 10Deployment-Systems, 6Services: Evaluate Docker as a container deployment tool - https://phabricator.wikimedia.org/T93439#1146530 (10GWicke) [21:08:26] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [21:20:22] (03CR) 10Hashar: "This might have broken the l10nupdate on the beta cluster. Since March 17th 17:00 UTC, it keeps rebuilding the whole l10n cache :( T93737" [puppet] - 10https://gerrit.wikimedia.org/r/197355 (https://phabricator.wikimedia.org/T88442) (owner: 10Yuvipanda) [21:25:26] PROBLEM - puppet last run on vanadium is CRITICAL: CRITICAL: puppet fail [21:29:53] (03PS1) 10RobH: setting francium install params + raid10 partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/199434 (https://phabricator.wikimedia.org/T93113) [21:32:23] (03CR) 10RobH: [C: 032] setting francium install params + raid10 partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/199434 (https://phabricator.wikimedia.org/T93113) (owner: 10RobH) [21:39:06] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1484 bytes in 0.304 second response time [21:42:37] RECOVERY - puppet last run on vanadium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:42:44] (03PS1) 10BBlack: explicit service pathname for update-ocsp-all [puppet] - 10https://gerrit.wikimedia.org/r/199506 [21:43:03] (03CR) 10BBlack: [C: 032 V: 032] explicit service pathname for update-ocsp-all [puppet] - 10https://gerrit.wikimedia.org/r/199506 (owner: 10BBlack) [21:45:28] 6operations, 5Patch-For-Review: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113#1146653 (10RobH) [21:54:52] /join mediawiki [21:54:52] /join wikimedia [21:54:52] /join mediawiki-parsoid [21:54:53] /join mediawiki-visualeditor [21:54:53] /join wikimedia-collaboration [21:54:53] /join wikimedia-labs [21:54:53] /join wikimedia-operations [21:54:54] /join mediawiki [21:54:54] /join wikimedia [21:54:55] /join mediawiki-parsoid [21:54:55] /join mediawiki-visualeditor [21:54:56] /join wikimedia-collaboration [21:54:56] /join wikimedia-labs [21:54:57] /join wikimedia-operations [21:55:08] whoops [21:55:15] fail, Negative24 ;) [21:55:29] indeed [21:56:07] so fail [21:57:32] Platonides: That is an example of how to not do scripting in CIRC :) [21:59:11] 6operations, 10ops-eqiad: install 4 * 3TB disks in francium - https://phabricator.wikimedia.org/T93114#1146683 (10RobH) 5Resolved>3Open SDC is showing some kind of error in the installer. It also shows some init error during post, but I cannot make it out via redirection (as one line overwrites the others... [21:59:12] 6operations, 5Patch-For-Review: deploy francium for html/zim dumps - https://phabricator.wikimedia.org/T93113#1146685 (10RobH) [21:59:24] 6operations, 10ops-eqiad: install 4 * 3TB disks in francium - sdc error - https://phabricator.wikimedia.org/T93114#1146688 (10RobH) [22:03:46] 7Blocked-on-Operations, 6Scrum-of-Scrums, 6Zero, 6Zero-Team, 7Varnish: Some traffic is not identified as Zero in Varnish - https://phabricator.wikimedia.org/T88366#1146696 (10DFoy) [22:05:15] 6operations, 10ops-eqiad: install 4 * 3TB disks in francium - sdc error - https://phabricator.wikimedia.org/T93114#1146704 (10RobH) I cannot quite get the error off the POST, but installer shows: │ Input/output error during read on /dev/sdc │ [22:05:48] deploying WikiGrok updates with Greg's permission... [22:07:25] 6operations, 5Patch-For-Review: Make puppet the sole manager of user keys - https://phabricator.wikimedia.org/T92475#1146721 (10RobH) We had gone and removed all the rogue accounts on the cluster, so this seemed like the next logical step to locking down user accounts and access. Mostly though so someone woul... [22:20:44] !log Created wikigrok_claims and wikigrok_responses tables on wikidatawiki and testwikidatawiki. Before that, accidentally created on enwiki, so had to uncreate. [22:20:51] Logged the message, Master [22:20:58] (03PS1) 10BBlack: improved error capture for OCSP updater [puppet] - 10https://gerrit.wikimedia.org/r/199512 [22:21:52] 6operations, 6MediaWiki-Core-Team, 7Wikimedia-log-errors: rbf1001 and rbf1002 are timing out / dropping clients for Redis - https://phabricator.wikimedia.org/T92591#1146814 (10chasemp) p:5Unbreak!>3High [22:21:55] fyi aude, hoo and Lydia_WMDE, I'm deploying WikiGrok updates that introduce response storage (not aggregation or anything) [22:22:24] MaxSem: thanks for the heads up [22:22:46] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet last ran 10 hours ago [22:24:05] Puppet on osmium is me, I just re-enabled it after disabling it last night. [22:24:10] 6operations, 10hardware-requests, 3wikis-in-codfw: setup deployment server in codfw (tin equivalent) - https://phabricator.wikimedia.org/T91678#1146838 (10RobH) a:3mark @mark, Can you offer some insight on how to proceed on this? I recall that you proposed it sit in a VM, but the reasoning from @bd808 su... [22:25:46] 6operations, 10MediaWiki-extensions-Sentry, 6Multimedia, 10hardware-requests: Procure hardware for Sentry - placeholder (not a live request) - https://phabricator.wikimedia.org/T93138#1146863 (10RobH) I'm pulling this off hardware-requests, as that project is for active hardware requests. Once you guys ha... [22:25:55] 6operations, 10MediaWiki-extensions-Sentry, 6Multimedia: Procure hardware for Sentry - placeholder (not a live request) - https://phabricator.wikimedia.org/T93138#1146864 (10RobH) [22:26:11] 7Blocked-on-Operations, 7Puppet, 6operations, 10Beta-Cluster: Setup a mediawiki03 (or what not) on Beta Cluster that we can direct the security scanning work to - https://phabricator.wikimedia.org/T72181#1146865 (10greg) [22:26:38] (03CR) 10BBlack: [C: 032 V: 032] improved error capture for OCSP updater [puppet] - 10https://gerrit.wikimedia.org/r/199512 (owner: 10BBlack) [22:27:48] !log maxsem Synchronized php-1.25wmf21/extensions/WikiGrok/: bump (duration: 00m 11s) [22:27:56] Logged the message, Master [22:28:10] o_0, we are syncing to codfw apaches! [22:28:17] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail [22:28:21] (also, it's slower:P) [22:30:21] (03PS1) 10Chmarkine: noc - redirect HTTP to HTTPS; enable HSTS 7 days [puppet] - 10https://gerrit.wikimedia.org/r/199515 [22:30:35] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [22:31:42] (03PS2) 10Chmarkine: noc - redirect HTTP to HTTPS; enable HSTS 7 days [puppet] - 10https://gerrit.wikimedia.org/r/199515 (https://phabricator.wikimedia.org/T40516) [22:34:02] !log maxsem Synchronized php-1.25wmf22/extensions/WikiGrok/: bump (duration: 00m 11s) [22:34:07] Logged the message, Master [22:40:13] 10Ops-Access-Requests, 6operations, 7database: Can't access x1-analytics-slave - https://phabricator.wikimedia.org/T93708#1146914 (10Springle) Fixed. The box handling x1-analytics-slave CNAME fell off the radar for `research` user password updates. Leaving this ticket open until I fix puppet. [22:41:06] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:43:08] 10Ops-Access-Requests, 6operations, 7database: Can't access x1-analytics-slave - https://phabricator.wikimedia.org/T93708#1146917 (10Springle) Actually, I think the `research` grant might be slightly wrong too for X1 as it uses `%wiki%` rather than `%wik%` (no `i`). That means presently databases like `enwik... [22:44:16] PROBLEM - dhclient process on labstore2001 is CRITICAL: Connection refused by host [22:44:16] PROBLEM - configured eth on labstore2001 is CRITICAL: Connection refused by host [22:44:26] PROBLEM - puppet last run on labstore2001 is CRITICAL: Connection refused by host [22:44:45] PROBLEM - NTP on labstore2001 is CRITICAL: NTP CRITICAL: No response from NTP server [22:44:46] PROBLEM - RAID on labstore2001 is CRITICAL: Connection refused by host [22:45:22] PROBLEM - DPKG on labstore2001 is CRITICAL: Connection refused by host [22:45:26] PROBLEM - Disk space on labstore2001 is CRITICAL: Connection refused by host [22:46:12] 6operations, 6Mobile-Apps, 6Services: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1146921 (10bearND) The repos are in Gerrit now: https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/services/mobileapps https://gerrit.wikimedia.org/r/#/admin/projects/... [22:47:06] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:48:37] Bah. Missed my window. [22:51:02] (03PS1) 10Chmarkine: transparency: make it HTTPS only and enable HSTS [puppet] - 10https://gerrit.wikimedia.org/r/199517 (https://phabricator.wikimedia.org/T40516) [22:58:39] Is gwicke the only one with a patch for swat today? Or is that list about to get a ton of stuff added? [22:58:43] jouncebot, next [22:58:43] In 0 hour(s) and 1 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150324T2300) [22:59:11] Krenair: you know how it goes :) [22:59:28] gwicke, want to make the submodule updates? [23:00:04] RoanKattouw, ^d, Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150324T2300). [23:00:08] Krenair: I might not call it 'wanting'.. but I'll do it [23:00:11] haha [23:00:19] I can do it if you don't want to [23:00:40] Krenair: I would appreciate it [23:01:48] Krenair, I need to push a fix urgently, need 2 mins [23:01:53] okay, go ahead [23:03:56] !log maxsem Synchronized php-1.25wmf21/extensions/WikiGrok: (no message) (duration: 00m 13s) [23:04:01] Logged the message, Master [23:04:42] !log maxsem Synchronized php-1.25wmf22/extensions/WikiGrok: (no message) (duration: 00m 12s) [23:04:47] Logged the message, Master [23:05:20] Krenair, I'm done, thanks [23:05:24] yw [23:07:13] MaxSem: rolled back or fixed it with new patch? [23:07:31] greg-g, fixed [23:07:38] * greg-g nods [23:07:52] gwicke, I might have broken my local branch trying to do that [23:08:23] 6operations, 10Analytics, 6Scrum-of-Scrums, 10Wikipedia-App-Android-App, and 3 others: Avoid cache fragmenting URLs for Share a Fact shares - https://phabricator.wikimedia.org/T90606#1147030 (10dr0ptp4kt) https://gerrit.wikimedia.org/r/#/c/198805/ submitted for review for Varnish. [23:09:42] Krenair: ouch, broken branches are painful [23:09:59] Krenair: let me break my own [23:11:59] gwicke, try wmf21 [23:12:02] I think my wmf22 is still ok [23:12:57] oh ok [23:13:01] looks like you got it [23:13:12] Krenair: https://gerrit.wikimedia.org/r/#/c/199527/ and https://gerrit.wikimedia.org/r/#/c/199527/ [23:13:26] eh, stupid Chrome copy bug [23:13:28] https://gerrit.wikimedia.org/r/#/c/199526/ [23:13:49] let me add those on the deployments page [23:14:14] when will RestBaseUpdateJobs be using the proper system of extension deployment branching? [23:15:11] Krenair: right now it would just add another step for us [23:15:34] would that be easier to handle for you? [23:15:53] I'd prefer it to just be consistent with everyone else [23:16:47] I think Jenkins is stuck [23:17:27] it's processing jobs... [23:17:28] or at least we're a long way behind in the queue [23:18:05] it's all waiting on this: https://integration.wikimedia.org/ci/job/mediawiki-phpunit-zend/4092/C [23:18:08] -C [23:18:26] like 9th [23:18:29] sigh [23:18:42] 7320 / 9705 ( 75%) [23:18:55] that is making progress though [23:18:57] 93% [23:18:57] so I'll let it run [23:19:02] * greg-g nods [23:19:03] please do [23:19:21] (let's just hope it passes :) ) [23:19:25] 7Puppet, 6Phabricator, 5Patch-For-Review: Phabricator labs repo management isn't configured - https://phabricator.wikimedia.org/T93615#1147090 (10Negative24) 5Open>3Resolved [23:19:30] if it fails, everything will requeue :) [23:20:20] ... I think it just re-queued everything anyway greg-g? [23:20:54] huh... not sure what happened there, that job completed successfully [23:21:03] 23:19:48 Finished: SUCCESS [23:33:39] gwicke, zuul restarted, we're at the front of the queue now [23:35:29] Krenair: pole position ftw! [23:37:12] (03CR) 10Tim Landscheidt: Ensure that apt preferences are named *.pref (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/195081 (https://phabricator.wikimedia.org/T60681) (owner: 10Tim Landscheidt) [23:37:53] bit ridiculous we're 25 minutes in though [23:39:12] yeah; I don't envy the masters of CI, having to deal with arbitrary code with (currently) very little in the way of resource limitations or isolation [23:40:25] I don't either [23:40:34] I just don't think we should depend on it so heavily while it's in this state [23:41:21] at least we have a override button [23:41:40] Which people don't like us to use [23:42:14] understandably; but if all breaks at least we aren't completely paralyzed [23:45:19] wtf [23:45:33] ohhh [23:45:38] different case in file names [23:45:42] that's... helpful >_> [23:46:05] gwicke, wmf22 syncing [23:46:10] yeah, it's not very consistent [23:46:10] !log krenair Synchronized php-1.25wmf22/extensions/RestBaseUpdateJobs/RestbaseUpdateJob.php: https://gerrit.wikimedia.org/r/#/c/199527/1 (duration: 00m 11s) [23:46:17] Logged the message, Master [23:46:19] -rw-rw-r-- 1 krenair wikidev 4354 Mar 23 23:22 RestbaseUpdate.hooks.php [23:46:20] -rw-rw-r-- 1 krenair wikidev 8113 Mar 24 23:43 RestbaseUpdateJob.php [23:46:20] -rw-rw-r-- 1 krenair wikidev 130 Mar 23 23:22 RestBaseUpdateJobs.php [23:46:20] -rw-rw-r-- 1 krenair wikidev 2951 Mar 23 23:22 RestbaseUpdate.php [23:46:54] RestBase - I was wondering why the hell I was getting jobS.php rather than job.php [23:46:58] * gwicke keeps an eye on logstash [23:47:23] Krenair: I missed that in the initial review [23:47:43] should rename it really to reflect the extension directory [23:47:55] still waiting on jenkins for wmf21 [23:48:01] let me know when you're happy with wmf22 [23:48:08] looks good so far [23:50:45] Aaaaaaugh PERC! [23:51:00] Well, "megaraid in general" [23:51:38] Krenair: still no issues in logstash [23:52:15] I just spent four fscking hours trying to debug a boot issue before I noticed that the H700 controller "helpfully" swapped sda and sdb because it remembered some other other previously. [23:52:28] gwicke, happy for wmf21 then? [23:52:34] Krenair: yup [23:52:36] go for it [23:52:55] the changes are quite small & were tested [23:53:16] syncing [23:53:25] !log krenair Synchronized php-1.25wmf21/extensions/RestBaseUpdateJobs/RestbaseUpdateJob.php: https://gerrit.wikimedia.org/r/#/c/199526/ (duration: 00m 14s) [23:53:33] Logged the message, Master [23:53:48] wow, we deploy to >450 hosts now? I think that's larger than a few days ago? [23:53:59] is codfw in that set maybe? [23:54:16] Krenair: feel the power? [23:54:17] it is [23:54:21] hah [23:54:29] ah that'll be it, codfw [23:54:40] (03Draft1) 10Negative24: Configure Puppet to use phd group setting [puppet] - 10https://gerrit.wikimedia.org/r/199538 [23:55:19] (03CR) 10Alex Monk: [C: 032] Use elseif instead of else if [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192894 (owner: 10Southparkfan) [23:55:34] (03Merged) 10jenkins-bot: Use elseif instead of else if [mediawiki-config] - 10https://gerrit.wikimedia.org/r/192894 (owner: 10Southparkfan) [23:55:52] Krenair: looks all good, job request continue at same rate etc [23:55:58] great [23:56:07] doing some trivial config change from my list [23:56:50] !log krenair Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/192894/ - should be a noop (duration: 00m 11s) [23:56:55] Logged the message, Master [23:57:48] ok [23:58:01] Krenair: thanks for handling the deploy, despite all the waiting! [23:58:06] RECOVERY - DPKG on labstore2001 is OK: All packages OK [23:58:06] RECOVERY - Disk space on labstore2001 is OK: DISK OK [23:58:12] hah, thanks for being patient [23:58:27] RECOVERY - dhclient process on labstore2001 is OK: PROCS OK: 0 processes with command name dhclient [23:58:27] RECOVERY - configured eth on labstore2001 is OK: NRPE: Unable to read output [23:58:45] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [23:59:06] RECOVERY - RAID on labstore2001 is OK: OK: optimal, 60 logical, 60 physical