[00:24:10] (03PS3) 10Alexandros Kosiaris: ores: Collapse the redis configs into one stanza [puppet] - 10https://gerrit.wikimedia.org/r/278836 (https://phabricator.wikimedia.org/T124200) [00:24:12] (03PS6) 10Alexandros Kosiaris: Apply the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) [00:24:14] (03PS4) 10Alexandros Kosiaris: Add the role::ores::redis class [puppet] - 10https://gerrit.wikimedia.org/r/278758 (https://phabricator.wikimedia.org/T124200) [00:34:46] (03PS4) 10Alexandros Kosiaris: ores: Collapse the redis configs into one stanza [puppet] - 10https://gerrit.wikimedia.org/r/278836 (https://phabricator.wikimedia.org/T124200) [00:34:48] (03PS7) 10Alexandros Kosiaris: Apply the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) [00:34:50] (03PS5) 10Alexandros Kosiaris: Add the role::ores::redis class [puppet] - 10https://gerrit.wikimedia.org/r/278758 (https://phabricator.wikimedia.org/T124200) [00:39:04] (03PS5) 10Alexandros Kosiaris: ores: Collapse the redis configs into one stanza [puppet] - 10https://gerrit.wikimedia.org/r/278836 (https://phabricator.wikimedia.org/T124200) [00:39:06] (03PS8) 10Alexandros Kosiaris: Apply the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) [00:39:08] (03PS6) 10Alexandros Kosiaris: Add the role::ores::redis class [puppet] - 10https://gerrit.wikimedia.org/r/278758 (https://phabricator.wikimedia.org/T124200) [00:39:57] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [00:41:47] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [00:44:12] (03PS6) 10Alexandros Kosiaris: ores: Collapse the redis configs into one stanza [puppet] - 10https://gerrit.wikimedia.org/r/278836 (https://phabricator.wikimedia.org/T124200) [00:44:14] (03PS9) 10Alexandros Kosiaris: Apply the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) [00:44:16] (03PS7) 10Alexandros Kosiaris: Add the role::ores::redis class [puppet] - 10https://gerrit.wikimedia.org/r/278758 (https://phabricator.wikimedia.org/T124200) [00:47:07] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [00:47:37] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [00:47:38] akosiaris: it's late! [00:48:42] (03CR) 10Alexandros Kosiaris: "I 'd like to merge this. Just tested it on labs VMs and works as expected. Merging it however does mean a brief period of downtime for ORE" [puppet] - 10https://gerrit.wikimedia.org/r/278836 (https://phabricator.wikimedia.org/T124200) (owner: 10Alexandros Kosiaris) [00:48:56] ori: yes it is [00:48:58] point taken [00:49:05] I am going to sleep. c ya [00:49:22] :) [00:49:25] good night [00:52:16] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [00:53:56] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [01:03:06] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [01:03:28] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.044 second response time on port 9042 [01:08:38] PROBLEM - Apache HTTP on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:09:27] PROBLEM - HHVM rendering on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:16] PROBLEM - salt-minion processes on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:10:26] PROBLEM - nutcracker port on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:10:27] PROBLEM - Check size of conntrack table on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:10:36] PROBLEM - SSH on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:47] PROBLEM - RAID on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:10:58] PROBLEM - dhclient process on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:07] PROBLEM - configured eth on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:07] PROBLEM - nutcracker process on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:08] PROBLEM - Disk space on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:37] PROBLEM - HHVM processes on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:37] PROBLEM - puppet last run on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:11:46] PROBLEM - DPKG on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:29:38] hey, ori, if you have time, please check this patch: https://gerrit.wikimedia.org/r/#/c/278762/ [01:29:46] it's very simple [01:36:28] RECOVERY - salt-minion processes on mw1126 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:36:37] RECOVERY - nutcracker port on mw1126 is OK: TCP OK - 0.000 second response time on port 11212 [01:41:58] PROBLEM - nutcracker port on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:43:37] PROBLEM - salt-minion processes on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:44:29] (03PS3) 10Ori.livneh: Flake8 for HHVM [puppet] - 10https://gerrit.wikimedia.org/r/278762 (owner: 10Ladsgroup) [01:45:40] (03CR) 10Ori.livneh: [C: 032 V: 032] "The change to the gdb printer is not safe, because the fact that the variable is unused in the file does not mean that it cannot be import" [puppet] - 10https://gerrit.wikimedia.org/r/278762 (owner: 10Ladsgroup) [02:02:25] thanks ori :) [02:05:06] PROBLEM - nutcracker port on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:20:26] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [02:21:57] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [02:32:37] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [02:32:47] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.040 second response time on port 9042 [02:35:26] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:36:08] PROBLEM - HHVM rendering on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:36:18] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:36:27] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:36:56] PROBLEM - DPKG on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:37:37] PROBLEM - puppet last run on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:37:56] PROBLEM - puppet last run on mw1121 is CRITICAL: CRITICAL: puppet fail [02:37:57] PROBLEM - Check size of conntrack table on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:16] PROBLEM - nutcracker port on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:16] PROBLEM - Disk space on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:26] PROBLEM - SSH on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:38:27] PROBLEM - salt-minion processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:27] PROBLEM - HHVM processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:36] RECOVERY - DPKG on mw1140 is OK: All packages OK [02:38:46] PROBLEM - dhclient process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:46] PROBLEM - configured eth on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:56] PROBLEM - RAID on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:38:56] PROBLEM - nutcracker process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:39:06] PROBLEM - DPKG on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:44:58] RECOVERY - Disk space on mw1136 is OK: DISK OK [02:44:58] RECOVERY - nutcracker port on mw1136 is OK: TCP OK - 0.000 second response time on port 11212 [02:45:17] RECOVERY - SSH on mw1136 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [02:45:17] RECOVERY - salt-minion processes on mw1136 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:45:17] RECOVERY - HHVM processes on mw1136 is OK: PROCS OK: 6 processes with command name hhvm [02:45:28] RECOVERY - dhclient process on mw1136 is OK: PROCS OK: 0 processes with command name dhclient [02:45:36] RECOVERY - configured eth on mw1136 is OK: OK - interfaces up [02:45:37] RECOVERY - RAID on mw1136 is OK: OK: no RAID installed [02:45:46] RECOVERY - nutcracker process on mw1136 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:45:57] RECOVERY - DPKG on mw1136 is OK: All packages OK [02:46:16] RECOVERY - Disk space on mw1126 is OK: DISK OK [02:46:16] RECOVERY - dhclient process on mw1126 is OK: PROCS OK: 0 processes with command name dhclient [02:46:16] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 26 minutes ago with 0 failures [02:46:17] RECOVERY - nutcracker process on mw1126 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:46:36] RECOVERY - Check size of conntrack table on mw1136 is OK: OK: nf_conntrack is 0 % full [02:50:07] RECOVERY - HHVM processes on mw1126 is OK: PROCS OK: 6 processes with command name hhvm [02:51:27] PROBLEM - dhclient process on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:51:27] PROBLEM - nutcracker process on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:51:28] PROBLEM - Disk space on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:53:07] RECOVERY - nutcracker process on mw1126 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:53:07] RECOVERY - dhclient process on mw1126 is OK: PROCS OK: 0 processes with command name dhclient [02:53:16] RECOVERY - Disk space on mw1126 is OK: DISK OK [02:54:06] RECOVERY - nutcracker port on mw1126 is OK: TCP OK - 0.000 second response time on port 11212 [02:54:06] RECOVERY - SSH on mw1126 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [02:55:18] PROBLEM - HHVM processes on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:58:27] PROBLEM - nutcracker process on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:58:27] PROBLEM - dhclient process on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:58:28] PROBLEM - Disk space on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:17] PROBLEM - nutcracker port on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:59:27] PROBLEM - SSH on mw1126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:00:07] RECOVERY - dhclient process on mw1126 is OK: PROCS OK: 0 processes with command name dhclient [03:00:07] RECOVERY - nutcracker process on mw1126 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [03:00:16] RECOVERY - Disk space on mw1126 is OK: DISK OK [03:00:17] RECOVERY - configured eth on mw1126 is OK: OK - interfaces up [03:00:37] RECOVERY - HHVM processes on mw1126 is OK: PROCS OK: 6 processes with command name hhvm [03:00:38] RECOVERY - DPKG on mw1126 is OK: All packages OK [03:04:08] RECOVERY - puppet last run on mw1121 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [03:05:37] PROBLEM - nutcracker process on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:37] PROBLEM - dhclient process on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:46] PROBLEM - Disk space on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:05:46] PROBLEM - configured eth on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:06] PROBLEM - DPKG on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:06:06] PROBLEM - HHVM processes on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:11:07] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [03:11:07] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [03:19:37] PROBLEM - configured eth on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:19:37] PROBLEM - Disk space on mw1126 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:32:16] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042 [03:32:17] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [04:18:03] (03CR) 10Andrew Bogott: [C: 031] Flake8 on openstack, part I [puppet] - 10https://gerrit.wikimedia.org/r/278761 (owner: 10Ladsgroup) [04:18:09] (03PS2) 10Andrew Bogott: Flake8 on openstack, part I [puppet] - 10https://gerrit.wikimedia.org/r/278761 (owner: 10Ladsgroup) [04:19:39] (03CR) 10Andrew Bogott: [C: 032] Flake8 on openstack, part I [puppet] - 10https://gerrit.wikimedia.org/r/278761 (owner: 10Ladsgroup) [04:19:58] PROBLEM - NTP on mw1126 is CRITICAL: NTP CRITICAL: No response from NTP server [04:44:47] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:47:17] PROBLEM - puppet last run on mw1122 is CRITICAL: CRITICAL: Puppet has 75 failures [04:51:15] (03Abandoned) 10KartikMistry: CX: Use config.yaml to read registry [puppet] - 10https://gerrit.wikimedia.org/r/260575 (owner: 10KartikMistry) [04:51:52] (03Abandoned) 10KartikMistry: Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 (owner: 10KartikMistry) [05:57:38] RECOVERY - DPKG on mw1126 is OK: All packages OK [05:57:56] RECOVERY - HHVM processes on mw1126 is OK: PROCS OK: 12 processes with command name hhvm [05:58:07] RECOVERY - SSH on mw1126 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [05:58:07] RECOVERY - NTP on mw1126 is OK: NTP OK: Offset -0.02866220474 secs [05:58:07] RECOVERY - salt-minion processes on mw1126 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:58:07] RECOVERY - nutcracker port on mw1126 is OK: TCP OK - 0.000 second response time on port 11212 [05:58:08] RECOVERY - Check size of conntrack table on mw1126 is OK: OK: nf_conntrack is 0 % full [05:58:58] RECOVERY - nutcracker process on mw1126 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:58:58] RECOVERY - dhclient process on mw1126 is OK: PROCS OK: 0 processes with command name dhclient [05:58:58] RECOVERY - configured eth on mw1126 is OK: OK - interfaces up [05:58:58] RECOVERY - RAID on mw1126 is OK: OK: no RAID installed [05:58:58] RECOVERY - Disk space on mw1126 is OK: DISK OK [05:59:16] RECOVERY - HHVM rendering on mw1126 is OK: HTTP OK: HTTP/1.1 200 OK - 67302 bytes in 1.087 second response time [05:59:57] RECOVERY - Apache HTTP on mw1126 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.046 second response time [06:01:37] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:04:50] <_joe_> !log restarted hhvm on mw1122, memory leak [06:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:14:48] RECOVERY - puppet last run on mw1122 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:15:34] Hoi, https://en.wikipedia.org/wiki/Category:West_Virginia_University_faculty has NULL as output [06:15:39] what is wrong ? [06:15:41] _joe_: You seeing #wikimedia-tech ? [06:18:52] 6Operations, 10Wikimedia-General-or-Unknown: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575#2139824 (10Peachey88) [06:23:31] <_joe_> Leah: nope, and it's 7 AM here, what's up? [06:23:59] _joe_: Multiple users reporting content pages displaying "NULL" compared to desired content [06:24:39] <_joe_> yes I'm reading now [06:25:00] <_joe_> GerardM: it seems to be some form of cache corruption [06:25:11] <_joe_> so just purging the pages works [06:25:32] How do I do that? ? CNTRL F5 ? [06:25:40] 6Operations, 10Wikimedia-General-or-Unknown: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575#2139810 (10Legoktm) Somehow "NULL" got cached in varnish. ``` legoktm@terbium:~$ echo "https://commons.wikimedia.org/wiki/Special:Recent... [06:25:49] _joe_: yeah, legoktm just tried that with varnish and the first test case seems to have worked [06:25:53] thanks [06:25:59] GerardM: &action=purge iirc [06:26:06] <_joe_> GerardM: ^^ [06:26:11] that should work for articles, but not special pages [06:26:22] I can do that server-side [06:26:27] (special pages) [06:28:05] 6Operations, 10Wikimedia-General-or-Unknown: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575#2139830 (10Legoktm) If you ?action=purge on the affected articles, they should be fixed. We'll (probably not me though) figure out some w... [06:28:24] I guess the next step is figuring out whether MediaWiki is outputting the NULL or if MediaWiki's output is getting corrupted in Varnish. [06:30:08] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:27] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:37] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:09] 6Operations, 10Wikimedia-General-or-Unknown: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575#2139833 (10Legoktm) Please don't purge https://pl.wikisource.org/wiki/Dyskusja_wikiskryby:Nawider [06:31:26] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:46] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:36] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:36] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:47] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:51:22] 6Operations, 10Wikimedia-General-or-Unknown: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575#2139810 (10Vort) In ru-wiki, this problem hit the Special:Watchlist page. So I think this bug needs a critical priority. [06:54:27] <_joe_> !log banning all pages with content-length of 25 from the caches, T130575 [06:54:28] T130575: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575 [06:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:56:21] 6Operations, 10Wikimedia-General-or-Unknown: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575#2139845 (10Joe) p:5Triage>3Unbreak! [06:56:26] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:57] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:56:57] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:57:16] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:57:39] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:57:57] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:28] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:37] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:35] 6Operations, 10Wikimedia-General-or-Unknown: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575#2139810 (10Joe) My hackish ban should've removed all the current pages with NULL content (at least if they're gzipped). The fact remain... [07:00:17] 6Operations, 10Wikimedia-General-or-Unknown: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575#2139848 (10Joe) Please let me know if you see this again, I'm not confident at all this was a "permanent" fix (which would mean the error... [07:00:59] 6Operations, 10Wikimedia-General-or-Unknown: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575#2139810 (10John_of_Reading) https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#Weird_NULL_issue_when_attempting_to_view... [07:04:24] 6Operations, 10Wikimedia-General-or-Unknown: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575#2139857 (10Joe) @John_of_Reading I think my ban of the cached content should've stopped this from happening now, so let's see if more re... [07:05:05] (03PS2) 10KartikMistry: WIP: cxserver: Read config from cxserver/deploy [puppet] - 10https://gerrit.wikimedia.org/r/278235 [07:45:34] 6Operations, 10Wikimedia-General-or-Unknown: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575#2139894 (10Joe) p:5Unbreak!>3High [07:46:14] 6Operations, 10Wikimedia-General-or-Unknown: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575#2139810 (10Joe) I tested one of the affected urls against all appservers, and they're all responding correctly now. [07:48:57] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [07:49:37] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [07:51:02] !log restarting hhvm on mw1119, mw1121, mw1136 and mw1140, all got stuck over night [07:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:52:26] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 7.410 second response time [07:52:28] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 67324 bytes in 0.553 second response time [07:52:47] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.098 second response time [07:53:07] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 1.796 second response time [07:53:16] RECOVERY - Apache HTTP on mw1121 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.061 second response time [07:53:31] <_joe_> moritzm: ouch I was looking into those :) [07:53:37] RECOVERY - HHVM rendering on mw1121 is OK: HTTP OK: HTTP/1.1 200 OK - 67324 bytes in 0.511 second response time [07:53:39] oh, sorry [07:53:47] RECOVERY - HHVM rendering on mw1136 is OK: HTTP OK: HTTP/1.1 200 OK - 67332 bytes in 0.379 second response time [07:53:48] <_joe_> it's a memory leak anyways, it will happen again :) [07:54:08] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 67324 bytes in 0.430 second response time [07:55:06] I'll also restart cassandra on restbase2004, it's out of heap memory again [07:56:36] !log restarted cassandra on restbase2004, it ran out of heap memory [07:56:37] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [07:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:57:46] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.037 second response time on port 9042 [07:59:25] !log installing git security updates [07:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:15:27] PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: Puppet has 1 failures [08:17:17] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:19:12] (03PS10) 10Alexandros Kosiaris: Apply the role::ores::redis class to oresdb100{1,2} [puppet] - 10https://gerrit.wikimedia.org/r/278759 (https://phabricator.wikimedia.org/T125562) [08:19:14] (03PS8) 10Alexandros Kosiaris: Add the role::ores::redis class [puppet] - 10https://gerrit.wikimedia.org/r/278758 (https://phabricator.wikimedia.org/T124200) [08:20:47] RECOVERY - DPKG on labmon1001 is OK: All packages OK [08:25:04] (03Abandoned) 10Elukey: Update Analytics cdh submodule after https://gerrit.wikimedia.org/r/#/c/277984/ [puppet] - 10https://gerrit.wikimedia.org/r/278713 (https://phabricator.wikimedia.org/T129838) (owner: 10Elukey) [08:26:47] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [08:27:48] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [08:29:48] (03CR) 10Elukey: HDFS Namenode automatic failover support - bug fixes. (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/278748 (https://phabricator.wikimedia.org/T129838) (owner: 10Elukey) [08:32:17] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [08:32:31] (03PS2) 10Elukey: HDFS Namenode automatic failover support - bug fixes. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/278748 (https://phabricator.wikimedia.org/T129838) [08:33:18] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.037 second response time on port 9042 [08:41:58] RECOVERY - puppet last run on mw2151 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:43:01] (03CR) 10Elukey: [C: 032] HDFS Namenode automatic failover support - bug fixes. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/278748 (https://phabricator.wikimedia.org/T129838) (owner: 10Elukey) [08:45:59] !log installing squid/jessie security updates [08:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:50:13] 6Operations, 15User-mobrovac, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2139965 (10Joe) a:3Joe [08:51:16] 6Operations, 15User-mobrovac, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#1973660 (10Joe) Over the weekend, I hacked on conftool a bit and I think... [08:59:39] (03CR) 10Ema: [C: 031] Fix varnishkafka cronspam due to non existent rsyslog action. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/278750 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [09:00:05] moritzm: Respected human, time to deploy terbium/tin maintenance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160322T0900). Please do the needful. [09:05:58] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [09:06:56] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [09:08:19] (03CR) 10Filippo Giunchedi: [C: 031] "to be merged during SWAT" [puppet] - 10https://gerrit.wikimedia.org/r/278330 (https://phabricator.wikimedia.org/T130393) (owner: 10Eevans) [09:09:48] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM! to answer your question, yeah one of the things to pay attention to is not start sending too many metrics to graphite" [puppet] - 10https://gerrit.wikimedia.org/r/278278 (https://phabricator.wikimedia.org/T130365) (owner: 10Gehel) [09:11:46] (03CR) 10Filippo Giunchedi: [C: 031] "today's puppet SWAT" [puppet] - 10https://gerrit.wikimedia.org/r/277265 (https://phabricator.wikimedia.org/T128787) (owner: 10Eevans) [09:12:22] godog: ^ about your comment, what is too much? In this case, we are going to add quite a few metrics, multiplied by the number of elasticsearch servers, that's already > 100... [09:14:56] !log rebooting tin for kernel upgrade [09:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:18:45] 6Operations, 10Ops-Access-Requests, 6Discovery, 10Maps, 13Patch-For-Review: Requesting maps-admins access for Eric Evans - https://phabricator.wikimedia.org/T130412#2135290 (10Gehel) I support that as well... [09:19:06] gehel: the biggest limiting factor ATM for graphite is disk space, so unless you are adding say tens of thousands of metrics it should be fine [09:19:30] godog: Ok, so I'll start thinking about merging that change... [09:19:31] gehel: but we're soon going to add more space, so that'll be no longer a problem for a bit [09:19:43] * gehel needs to have a closer look into how our graphite servers work... [09:19:55] godog: limiting factor is space? not IO ? [09:21:54] gehel: there's a fair bit of IO but not pegged, there's 4x ssd in raid10 [09:22:08] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad [09:23:36] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: Puppet has 1 failures [09:24:37] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Puppet has 1 failures [09:24:46] godog: If I read that graph correctly, we receive about about 30K data points per second? I was expecting a lot more... [09:25:48] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [09:26:46] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042 [09:27:09] (03CR) 10Elukey: [C: 032] Fix varnishkafka cronspam due to non existent rsyslog action. [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/278750 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [09:29:37] <_joe_> gehel: it's the request received i think; every request can have multiple data points iirc [09:30:54] <_joe_> "points per update" might be related [09:32:20] if we're talking about statsd traffic _joe_ is correct, the 30k/s figure is "one datapoint per line" traffic to the carbon servers [09:32:38] statsd traffic is around 150k metrics/s [09:34:58] commited points is at 30K/s ... [09:37:16] (03PS1) 10Giuseppe Lavagetto: conftoool: remove the debug appservers from the pool [puppet] - 10https://gerrit.wikimedia.org/r/278849 [09:37:52] (03CR) 10Ema: [C: 031] conftoool: remove the debug appservers from the pool [puppet] - 10https://gerrit.wikimedia.org/r/278849 (owner: 10Giuseppe Lavagetto) [09:37:54] (03PS2) 10Jcrespo: Depool db1024 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278264 (https://phabricator.wikimedia.org/T130351) [09:38:25] hashar: FYI in https://phabricator.wikimedia.org/T130446 I take it that's a typo ldap-labs-codfw. vs ldap-labs.codfw ? [09:38:42] (03PS2) 10Giuseppe Lavagetto: conftoool: remove the debug appservers from the pool [puppet] - 10https://gerrit.wikimedia.org/r/278849 (https://phabricator.wikimedia.org/T130575) [09:43:30] 6Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575#2140042 (10Joe) I just found out that three out of four of the [[ https://wikitech.wikimedia.org/wiki/X-Wikimedia-D... [09:43:49] (03CR) 10Giuseppe Lavagetto: [C: 032] conftoool: remove the debug appservers from the pool [puppet] - 10https://gerrit.wikimedia.org/r/278849 (https://phabricator.wikimedia.org/T130575) (owner: 10Giuseppe Lavagetto) [09:44:05] (03PS1) 10Muehlenhoff: Revert temporary bump of connection table, underlying bug has been fixed [puppet] - 10https://gerrit.wikimedia.org/r/278850 [09:46:10] !log rolling reboot of sca* for kernel upgrades [09:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:46:19] (03CR) 10Jcrespo: [C: 032] Depool db1024 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278264 (https://phabricator.wikimedia.org/T130351) (owner: 10Jcrespo) [09:47:58] 6Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575#2140075 (10Joe) I am also resolving the ticket as it seems like no more incidents have been reported [09:48:10] 6Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: Multiple users reporting content pages displaying "NULL" compared to desired content - https://phabricator.wikimedia.org/T130575#2140076 (10Joe) 5Open>3Resolved a:3Joe [09:49:06] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:49:26] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1024 for maintenance (duration: 01m 15s) [09:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:50:06] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:56:10] (03PS4) 10Gehel: Adding metric collection to nginx in the context of elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/278278 (https://phabricator.wikimedia.org/T130365) [09:56:34] (03CR) 10Gehel: [C: 032] Adding metric collection to nginx in the context of elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/278278 (https://phabricator.wikimedia.org/T130365) (owner: 10Gehel) [09:57:42] (03PS3) 10Ema: Port varnishreqstats and varnishstatsd to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/277790 (https://phabricator.wikimedia.org/T128788) [09:59:07] !log stopping and cloning db1024 to db1074 and db1076 [09:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:59:43] (03CR) 10Ema: Port varnishreqstats and varnishstatsd to new VSL API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/277790 (https://phabricator.wikimedia.org/T128788) (owner: 10Ema) [10:07:54] (03PS1) 10Elukey: Update the varnishkafka module for https://gerrit.wikimedia.org/r/#/c/278750/1 [puppet] - 10https://gerrit.wikimedia.org/r/278855 (https://phabricator.wikimedia.org/T129344) [10:09:07] (03CR) 10Elukey: [C: 032] Update the varnishkafka module for https://gerrit.wikimedia.org/r/#/c/278750/1 [puppet] - 10https://gerrit.wikimedia.org/r/278855 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [10:09:26] (03CR) 10Elukey: [V: 032] Update the varnishkafka module for https://gerrit.wikimedia.org/r/#/c/278750/1 [puppet] - 10https://gerrit.wikimedia.org/r/278855 (https://phabricator.wikimedia.org/T129344) (owner: 10Elukey) [10:14:17] 6Operations, 10Continuous-Integration-Config: Switch CI from jsduck deb package to a gemfile/bundler system - https://phabricator.wikimedia.org/T109005#2140162 (10hashar) [10:20:54] !log rolling reboot of ocg* for kernel upgrades [10:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:25:38] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 301 MB (3% inode=62%) [10:26:46] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: / 267 MB (3% inode=62%) [10:27:23] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Enable metric collection on nginx for elasticsearch - https://phabricator.wikimedia.org/T130365#2140192 (10Gehel) nginx metrics are now published to Graphite. I added some basic nginx metrics to the [[ https://grafana-admin.wi... [10:34:40] it is /var/log/ocg.log [10:35:16] moritzm, can you take a look at those when they come up, if you can? [10:35:34] rotate them, check issues, open a ticket? [10:35:37] 6Operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2140215 (10hashar) @Krinkle thanks for your detailed explanation of node/npm version matching on T124474#2135319 . Yesterday I triggered the `npm-node-4.3` on MediaWiki extensions and all of them passed just fine with the node... [10:35:53] jynus: yeah, will have a look [10:35:53] !log Updated WikibaseQualityConstraints data on wikidata (wikidatawiki.wbqc_constraints) [10:35:57] 6Operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2140218 (10hashar) [10:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:37:09] we have a bunch of servers with 9GB of / [10:37:34] maybe deleting the .deb cache, too [10:37:56] RECOVERY - Disk space on ocg1002 is OK: DISK OK [10:39:44] <_joe_> open a ticket on ocg, good luck [10:39:59] yeah, I'll open a more general ticket on that, the partinioning has room for improvement, approx 400 GB free on /srv, but only 9G for / [10:40:09] _joe_, well, for us [10:40:22] one like that^ [10:40:28] RECOVERY - Disk space on ocg1003 is OK: DISK OK [10:42:56] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [10:43:37] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [10:55:00] I've downtimed those while we're investigating in T130254 [10:55:00] T130254: Investigate recent OOM events on restbase2004 - https://phabricator.wikimedia.org/T130254 [11:00:07] 6Operations, 6Discovery, 7Elasticsearch: Have dedicated master nodes for elasticsearch - https://phabricator.wikimedia.org/T130590#2140227 (10Gehel) [11:01:37] 6Operations: Increase size of root partition on ocg* servers - https://phabricator.wikimedia.org/T130591#2140241 (10MoritzMuehlenhoff) [11:01:45] (03PS1) 10ArielGlenn: https redirect for dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) [11:02:57] 6Operations, 10Analytics, 10Datasets-General-or-Unknown, 10Traffic, 13Patch-For-Review: http://dumps.wikimedia.org should redirect to https:// - https://phabricator.wikimedia.org/T128587#2140256 (10ArielGlenn) I remember there was a discussion and I don't remember why it was wontfixed then. But we can s... [11:03:17] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [11:03:56] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.038 second response time on port 9042 [11:06:55] 6Operations, 6Project-Admins, 3DevRel-March-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2140271 (10Aklapper) >>! In T119944#2135724, @RobH wrote: >>>! In T119944#2129274, @Aklapper wrote: >> I created H140: //Add DC-Ops project to ops-$site projects//:... [11:07:04] 6Operations, 6Project-Admins, 3DevRel-March-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2140272 (10Aklapper) 5Resolved>3Open [11:13:19] (03CR) 10Elukey: "+1 for the effort but I left a comment for the nginx config (might be irrelevant)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/278861 (https://phabricator.wikimedia.org/T128587) (owner: 10ArielGlenn) [11:19:42] 6Operations, 6Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2140296 (10fgiunchedi) [11:24:20] 6Operations, 10Analytics, 10Datasets-General-or-Unknown, 10Traffic, 13Patch-For-Review: http://dumps.wikimedia.org should redirect to https:// - https://phabricator.wikimedia.org/T128587#2079753 (10hashar) Some past tasks: * {T83675} * Declined: {T60292} The later had: > https://download.wikimedia.org/m... [11:36:51] (03PS1) 10Alex Monk: [WIP] openstack: Add proxy panel files [puppet] - 10https://gerrit.wikimedia.org/r/278871 (https://phabricator.wikimedia.org/T129245) [11:37:34] 6Operations, 6Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2140368 (10MoritzMuehlenhoff) There's a similar report for openldap 2.4.40 at http://www.openldap.org/lists/openldap-technical/201504/msg00005.html There are three memory leak fixes related t... [11:38:21] (03CR) 10jenkins-bot: [V: 04-1] [WIP] openstack: Add proxy panel files [puppet] - 10https://gerrit.wikimedia.org/r/278871 (https://phabricator.wikimedia.org/T129245) (owner: 10Alex Monk) [11:40:51] (03PS2) 10Alex Monk: [WIP] openstack: Add proxy panel files [puppet] - 10https://gerrit.wikimedia.org/r/278871 (https://phabricator.wikimedia.org/T129245) [11:42:12] (03CR) 10jenkins-bot: [V: 04-1] [WIP] openstack: Add proxy panel files [puppet] - 10https://gerrit.wikimedia.org/r/278871 (https://phabricator.wikimedia.org/T129245) (owner: 10Alex Monk) [11:53:12] PROBLEM - MariaDB Slave IO: s2 on db1024 is CRITICAL: CRITICAL slave_io_state could not connect [11:53:31] PROBLEM - mysqld processes on db1024 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [11:53:37] hmm? [11:53:38] PROBLEM - MariaDB Slave Lag: s2 on db1024 is CRITICAL: CRITICAL slave_sql_lag could not connect [11:53:39] it is ok [11:53:52] downtime expired due to issues, but it is depooled [11:54:12] I was fixing the issues, forgot about extending the downtime [11:54:24] so the issues are real, but not user facing [11:54:26] PROBLEM - MariaDB Slave SQL: s2 on db1024 is CRITICAL: CRITICAL slave_sql_state could not connect [11:54:29] see last log [11:54:53] alright [11:56:24] I think there are issues with some TCP packets with the firewall [11:56:43] sometimes connections get "stuck" [11:57:20] (03PS4) 10Filippo Giunchedi: diamond: send labs instance metrics via graphite/carbon [puppet] - 10https://gerrit.wikimedia.org/r/268360 (https://phabricator.wikimedia.org/T121861) [11:57:39] not 99.9% of them, but on opening or closing TCP connections [12:02:05] RECOVERY - MariaDB Slave SQL: s2 on db1074 is OK: OK slave_sql_state not a slave [12:02:11] RECOVERY - MariaDB Slave IO: s2 on db1074 is OK: OK slave_io_state not a slave [12:11:37] (03PS1) 10MarcoAurelio: Configuring wgMetaNamespace for an.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278876 (https://phabricator.wikimedia.org/T130599) [12:12:09] (03CR) 10jenkins-bot: [V: 04-1] Configuring wgMetaNamespace for an.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278876 (https://phabricator.wikimedia.org/T130599) (owner: 10MarcoAurelio) [12:12:50] hmm [12:12:53] forgot > [12:14:15] !log nodetool decommission restbase1003 T125842 [12:14:15] T125842: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842 [12:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:14:38] (03PS2) 10MarcoAurelio: Configuring wgMetaNamespace for an.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278876 (https://phabricator.wikimedia.org/T130599) [12:16:41] godog: rb1003 has been depooled ? [12:17:28] mobrovac: no, but rb shouldn't be affected by the decommission [12:17:51] kk [12:20:55] (03PS3) 10MarcoAurelio: Enable signature button at NS:102 for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272479 (https://phabricator.wikimedia.org/T127688) [12:22:21] 6Operations, 10Reading-Web, 10Traffic, 6Wikipedia-iOS-App-Product-Backlog, 13Patch-For-Review: Setup up Site Association file for Universal Link Support - https://phabricator.wikimedia.org/T111829#2140453 (10jcrespo) 5Resolved>3Open I am seeing half a million requests per minute to /.well-known/apple... [12:23:41] (03CR) 10MarcoAurelio: "Scheduled for deployement. If I am not mistaken, deployer should later run namespaceDupes.php once this is merged, although not sure about" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278876 (https://phabricator.wikimedia.org/T130599) (owner: 10MarcoAurelio) [12:24:34] (03CR) 10MarcoAurelio: "Scheduled for deployement. Thanks for the review, Dereckson." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272479 (https://phabricator.wikimedia.org/T127688) (owner: 10MarcoAurelio) [12:24:49] (03PS5) 10Filippo Giunchedi: diamond: send labs instance metrics via graphite/carbon [puppet] - 10https://gerrit.wikimedia.org/r/268360 (https://phabricator.wikimedia.org/T121861) [12:24:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] diamond: send labs instance metrics via graphite/carbon [puppet] - 10https://gerrit.wikimedia.org/r/268360 (https://phabricator.wikimedia.org/T121861) (owner: 10Filippo Giunchedi) [12:51:33] PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: Puppet last ran 5 days ago [12:53:49] ores redis things [12:54:00] Woops. Accidental paste :) [12:54:02] lol [12:56:51] (03PS1) 10Hashar: beta: drop references to ArticleCreationWorkflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278883 [13:08:13] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [13:10:52] PROBLEM - puppet last run on mw2104 is CRITICAL: CRITICAL: Puppet has 1 failures [13:14:21] (03CR) 10BBlack: [C: 031] Port varnishreqstats and varnishstatsd to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/277790 (https://phabricator.wikimedia.org/T128788) (owner: 10Ema) [13:16:03] PROBLEM - puppet last run on mw1125 is CRITICAL: CRITICAL: Puppet has 11 failures [13:26:45] I'm looking for data on pool counter usage (for Cirrus -> Elasticsearch in my case). Does anyone knows if we have something like that in Graphite? Or somewhere else? [13:27:48] gehel: what kind of data are you looking for ? [13:28:50] akosiaris: when switching to codfw, I expect additional latency, so tuning required on the pool counter protecting elasticsearch. It also seems that we have not tuned it for a fairly long time. [13:29:24] akosiaris: so basically, I'd like to have a view of the usage over time, see if we have peaks, how close to the limit they are... [13:30:22] gehel: ebernhardson and I have talked about this a few times with a sort of gridlock iirc on this woudl be best as a general pool stats things outside of even elastic and the pragmatic but we need it here now, I can't recall where we left off but i know he had boiler plate for it [13:30:43] !log installing various security updates on mediawiki canary servers (along with HHVM restarts): graphite2, libldap, pixman, sqlite, pygments, gnutls26 [13:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:31:33] chasemp: I'll check with ebernhardson when he arrives. Thanks! And yes, I was expecting something fairly generic, metrics per pool or something similar... [13:32:07] chasemp: hey :) [13:32:22] chasemp: not sure if you're already aware of it, there is a critical labstore alert about nfs-exports [13:32:25] we may have even done it, I know we invested time into tracking errors across elastic requests gehel like timeout vs failed search [13:32:35] paravoid: ah no I'll look [13:36:44] RECOVERY - puppet last run on mw2104 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [13:37:42] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is OK: OK - nfs-exports is active [13:39:39] gehel: what kind of tuning do you expect to need to do ? because apart from the timeout on the mediawiki side to account for the extra 36ms, I don't see much else that would really require tuning [13:40:48] akosiaris: in the past they (search and really nik) used the pool counter limit as the poor mans load backoff for the elastic cluster [13:41:09] so in theory if we have issues on codfw cluster they may be thinking of reducing that limit to accomodate [13:41:34] there are a few other knobs but that was one historically [13:41:37] <_joe_> it's not poor man's backoff, poolcounter is a nice, scalable and reliable shared lock service [13:41:48] ah, so tuning on the client side for elasticsearch ? well that is not documented in https://wikitech.wikimedia.org/wiki/PoolCounter [13:41:51] akosiaris: the size of the pool. We expect to have more request in flight, so to have the same usage of the E/S server, we might need to increase the pool size [13:41:55] only mediawiki is [13:42:10] <_joe_> gehel: I think you're ovestating the problem [13:42:23] <_joe_> let me explain this better [13:42:32] honestly, I don't understand all the details about pool counter, so I mainly want to dig into whatever we have to understand it better [13:42:50] <_joe_> if the cluster is able to respond with the same response times per se [13:43:13] <_joe_> the only thing that is longer is the mediawiki request, by a little bit [13:43:45] <_joe_> so yes, you will have higher counts on poolcounter, but I would not expect it to be a significant change [13:43:57] <_joe_> gehel: what is the mean latency of an ES request? [13:44:35] _joe_: very small ~15ms, so 36ms increase due to net latency might cause some troubles [13:44:39] _joe_: aroung 100ms [13:44:54] (for autocomplete queries) [13:44:57] <_joe_> err, which is the correct one? :P [13:45:01] er, those 2 numbers don't match up. The have an order of magnitude difference [13:45:07] They* [13:45:50] my bad, was looking at the wrong curve, mean is around 10ms [13:45:58] p50 is around 8ms [13:46:16] https://grafana-admin.wikimedia.org/dashboard/db/elasticsearch-percentiles?from=1458049570580&to=1458654370580&var-cluster=eqiad [13:46:22] <_joe_> so in the first case, it would make sense to raise the poolcounter numbers, yes [13:46:26] the morelike queries on codfw are more like 100ms [13:47:24] in the end, it is mainly an exercise for me to try to understand all this better and not assume it is working as best as it can [13:49:06] (03PS1) 10Elukey: Remove rdb1001 from the Media Wiki Job Queue pool for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278888 (https://phabricator.wikimedia.org/T123675) [13:49:18] hmm, so we can assume that any operation that needs to acquire a lock from poolcounter will suffer some latency [13:49:33] adding 36ms to 10ms is quite a bit [13:49:51] assuming that happening is actually making sense [13:50:13] akosiaris: we should have HTTP pooling in place before the switch, to the SSL handshake overhead should not be an issue. [13:50:37] still, I'd like to know where to look *before* we actually have an issue... [13:51:25] gehel: if we don't have any data from poolcounter then I'm afraid that the only thing we can monitor is the number of poolcounter rejections :/ [13:51:56] well, poolcounter has had no diamond up to now. Best possible date to write one... is today ;-) [13:52:05] :) [13:52:06] diamond collector, that is [13:52:16] the protocol is really easy [13:52:37] https://wikitech.wikimedia.org/wiki/PoolCounter#Testing [13:54:26] (03PS1) 10Elukey: Remove rdb1001 from the Job runners config for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/278889 (https://phabricator.wikimedia.org/T123675) [13:54:51] akosiaris: those stats are global, not per counter (as I see it in the page you linked). Do you know if there is a way to get stats per counter? [13:54:58] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove rdb1001 from the Media Wiki Job Queue pool for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278888 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [13:54:59] * gehel needs to read some code... [13:55:12] gehel: sorry, I dont :-( [13:55:53] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove rdb1001 from the Job runners config for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/278889 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [13:56:00] do you know where the code for poolcounter is? How is the project named? [13:56:53] the mediawiki extension is here https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/extensions/PoolCounter . the server side code here: http://apt.wikimedia.org/wikimedia/pool/main/p/poolcounter/ [13:57:03] for what I see we don't have a gerrit repo for it yet [13:57:16] which means only one thing. It works quite flawlessly all these years [13:57:22] It seems that the server side code is inside the extension [13:57:30] <_joe_> gehel: yes [13:57:47] (03Abandoned) 10Hashar: Move default config into a file [dumps] - 10https://gerrit.wikimedia.org/r/43156 (owner: 10Awight) [13:57:56] <_joe_> gehel: under daemon/ [13:58:04] heh, I had never noticed that [13:58:05] _joe_: thanks! Found it. [14:00:40] (03CR) 10Elukey: [C: 032] Remove rdb1001 from the Media Wiki Job Queue pool for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278888 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [14:03:13] !log elukey@tin Synchronized wmf-config/jobqueue-eqiad.php: Remove rdb1001 from the Redis Job Queues for maintenance (duration: 00m 25s) [14:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:23] (03CR) 10Elukey: [C: 032] Remove rdb1001 from the Job runners config for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/278889 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [14:08:00] (03PS2) 10Muehlenhoff: Tools: Add add_environment/keep_environment to exim configurations [puppet] - 10https://gerrit.wikimedia.org/r/277866 (owner: 10Tim Landscheidt) [14:09:31] !log removed rdb1001 from the JobRunner config (hieradata/eqiad/mediawiki/jobrunner). Forcing also a puppet run and a jobchron restart on all the Job Runners and VideoScalers in eqiad. [14:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:00] (03CR) 10Muehlenhoff: [C: 032 V: 032] Tools: Add add_environment/keep_environment to exim configurations [puppet] - 10https://gerrit.wikimedia.org/r/277866 (owner: 10Tim Landscheidt) [14:19:33] (03PS4) 10Halfak: Adds compute node role to ores [puppet] - 10https://gerrit.wikimedia.org/r/278455 (https://phabricator.wikimedia.org/T130461) [14:25:54] (03CR) 10Andrew Bogott: "Added several in-line comments, mostly things you're probably already doing :)" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/278871 (https://phabricator.wikimedia.org/T129245) (owner: 10Alex Monk) [14:26:03] (03PS1) 10Giuseppe Lavagetto: Allow defining the conftool entities via a schema file [software/conftool] - 10https://gerrit.wikimedia.org/r/278892 [14:27:30] (03CR) 10jenkins-bot: [V: 04-1] Allow defining the conftool entities via a schema file [software/conftool] - 10https://gerrit.wikimedia.org/r/278892 (owner: 10Giuseppe Lavagetto) [14:27:44] (03PS1) 10Mobrovac: RESTBase: Use the cassandra back-end [puppet] - 10https://gerrit.wikimedia.org/r/278893 [14:27:46] (03PS1) 10Ema: varnishlog4: reduce sleep time when no events are ready to 10ms [puppet] - 10https://gerrit.wikimedia.org/r/278894 (https://phabricator.wikimedia.org/T128788) [14:33:40] 6Operations, 10Monitoring, 13Patch-For-Review: switch diamond to use graphite line protocol - https://phabricator.wikimedia.org/T121861#2140771 (10fgiunchedi) merged this now for labs instances, metrics from `tcp` collector will change type as counters don't get derived metrics like in statsd ``` 309088 Mar... [14:36:06] 6Operations, 6Discovery, 7Elasticsearch: Collect metrics on pool counter usage - https://phabricator.wikimedia.org/T130617#2140792 (10Gehel) [14:37:17] !log puppet disabled on rdb1001/rdb1002 (redis master slave) as part of the rdb1001 re-image. rdb1002 set with SLAVEOF NO ONE as precaution. [14:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:45] (03CR) 10BBlack: [C: 031] varnishlog4: reduce sleep time when no events are ready to 10ms [puppet] - 10https://gerrit.wikimedia.org/r/278894 (https://phabricator.wikimedia.org/T128788) (owner: 10Ema) [14:42:29] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, and 2 others: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#2140811 (10Andrew) [14:42:42] (03PS1) 10Halfak: Adds https only redirect to ores-web [puppet] - 10https://gerrit.wikimedia.org/r/278898 [14:42:48] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, and 2 others: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367#2140814 (10yuvipanda) [14:43:15] (03PS1) 10Muehlenhoff: Set exim environment for labs instances [puppet] - 10https://gerrit.wikimedia.org/r/278899 [14:43:50] (03PS2) 10Muehlenhoff: Set exim environment for labs instances [puppet] - 10https://gerrit.wikimedia.org/r/278899 [14:43:57] (03CR) 10jenkins-bot: [V: 04-1] Adds https only redirect to ores-web [puppet] - 10https://gerrit.wikimedia.org/r/278898 (owner: 10Halfak) [14:44:17] (03CR) 10Ema: [C: 032 V: 032] varnishlog4: reduce sleep time when no events are ready to 10ms [puppet] - 10https://gerrit.wikimedia.org/r/278894 (https://phabricator.wikimedia.org/T128788) (owner: 10Ema) [14:44:44] jouncebot: ntext [14:44:47] jouncebot: next [14:44:47] In 0 hour(s) and 15 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160322T1500) [14:45:17] (03CR) 10Hashar: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/278892 (owner: 10Giuseppe Lavagetto) [14:45:57] (03PS2) 10BBlack: Varnish: stream all pass traffic [puppet] - 10https://gerrit.wikimedia.org/r/276475 [14:46:26] (03PS2) 10Giuseppe Lavagetto: Allow defining the conftool entities via a schema file [software/conftool] - 10https://gerrit.wikimedia.org/r/278892 [14:46:48] (03CR) 10Hashar: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) (owner: 10Giuseppe Lavagetto) [14:48:14] (03CR) 10jenkins-bot: [V: 04-1] Allow defining the conftool entities via a schema file [software/conftool] - 10https://gerrit.wikimedia.org/r/278892 (owner: 10Giuseppe Lavagetto) [14:48:19] (03PS3) 10Giuseppe Lavagetto: Allow defining the conftool entities via a schema file [software/conftool] - 10https://gerrit.wikimedia.org/r/278892 [14:48:22] (03PS2) 10Halfak: Adds https only redirect to ores-web [puppet] - 10https://gerrit.wikimedia.org/r/278898 [14:48:31] (03CR) 10jenkins-bot: [V: 04-1] Add select mode [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) (owner: 10Giuseppe Lavagetto) [14:49:41] (03CR) 10BBlack: [C: 032] Varnish: stream all pass traffic [puppet] - 10https://gerrit.wikimedia.org/r/276475 (owner: 10BBlack) [14:49:58] (03CR) 10jenkins-bot: [V: 04-1] Allow defining the conftool entities via a schema file [software/conftool] - 10https://gerrit.wikimedia.org/r/278892 (owner: 10Giuseppe Lavagetto) [14:52:22] (03CR) 10Yuvipanda: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/278898 (owner: 10Halfak) [14:52:28] (03PS1) 10BBlack: Varnish: protect against external streampass header setting [puppet] - 10https://gerrit.wikimedia.org/r/278901 [14:52:45] (03CR) 10Jforrester: [C: 031] beta: drop references to ArticleCreationWorkflow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278883 (owner: 10Hashar) [14:53:23] (03CR) 10BBlack: [C: 032 V: 032] Varnish: protect against external streampass header setting [puppet] - 10https://gerrit.wikimedia.org/r/278901 (owner: 10BBlack) [14:54:41] (03CR) 10Yuvipanda: [C: 04-1] Adds compute node role to ores (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/278455 (https://phabricator.wikimedia.org/T130461) (owner: 10Halfak) [14:55:22] (03PS5) 10Halfak: Adds compute node role to ores [puppet] - 10https://gerrit.wikimedia.org/r/278455 (https://phabricator.wikimedia.org/T130461) [14:56:12] (03PS6) 10Yuvipanda: Adds compute node role to ores [puppet] - 10https://gerrit.wikimedia.org/r/278455 (https://phabricator.wikimedia.org/T130461) (owner: 10Halfak) [14:56:15] !log deploying VCL changes for do_stream on all pass traffic [14:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:56:21] (03CR) 10Yuvipanda: [C: 032 V: 032] Adds compute node role to ores [puppet] - 10https://gerrit.wikimedia.org/r/278455 (https://phabricator.wikimedia.org/T130461) (owner: 10Halfak) [14:58:36] (03PS1) 10Jdrewniak: Bumping portals to master. Deploying A/B test T124111 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278902 (https://phabricator.wikimedia.org/T124112) [15:00:04] anomie ostriches thcipriani marktraceur Krenair aude: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160322T1500). Please do the needful. [15:00:04] kart_ mafk: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:10] here [15:00:42] I can SWAT this morning. [15:00:49] morning [15:01:49] Nikerabbit: Morning :) [15:02:56] maf...no mafk :\ [15:04:50] blerg. Looks like https://gerrit.wikimedia.org/r/#/c/278843/ has a failing test now that it's trying to merge :( https://integration.wikimedia.org/ci/job/npm-node-4.3/1327/ [15:06:11] is it voting? [15:07:03] ah [15:07:26] seems as though it magically re-ran? https://integration.wikimedia.org/ci/job/npm-node-4.3/1331/ [15:07:33] look like it passed! [15:07:35] :) [15:08:47] 6Operations, 10ops-eqiad: db1067 degraded RAID - https://phabricator.wikimedia.org/T130517#2140868 (10Cmjohnson) a:3Cmjohnson [15:11:05] (03Abandoned) 10Andrew Bogott: Move horizon.wikimedia.org to labsconsole.wikimedia.org. [puppet] - 10https://gerrit.wikimedia.org/r/276262 (owner: 10Andrew Bogott) [15:12:31] (03PS1) 10Jcrespo: Repool db1024; pool db1074 and db1076 for the first time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278906 (https://phabricator.wikimedia.org/T130351) [15:13:13] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:15:36] I have https://gerrit.wikimedia.org/r/278906 pending, ping me when you are finished, please [15:16:22] (03PS1) 10Ottomata: Create eventschemas module, use this in eventbus role [puppet] - 10https://gerrit.wikimedia.org/r/278907 (https://phabricator.wikimedia.org/T127099) [15:17:48] (03CR) 10Jcrespo: [C: 04-1] "there is lag on db1024 right now. Needs some minutes to be ready." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278906 (https://phabricator.wikimedia.org/T130351) (owner: 10Jcrespo) [15:20:14] 7Blocked-on-Operations, 6Operations, 10EventBus, 6Services, and 3 others: New Service Request - Change Propagation - https://phabricator.wikimedia.org/T128463#2140922 (10mobrovac) p:5Triage>3High a:3akosiaris [15:21:32] * mafk here for swat [15:25:04] mafk: hello :) [15:25:25] (03PS4) 10Giuseppe Lavagetto: Allow defining the conftool entities via a schema file [software/conftool] - 10https://gerrit.wikimedia.org/r/278892 [15:25:27] (03PS4) 10Giuseppe Lavagetto: Add select mode [software/conftool] - 10https://gerrit.wikimedia.org/r/278552 (https://phabricator.wikimedia.org/T128199) [15:25:45] thcipriani: has it started yet? [15:25:53] mafk: yes [15:26:07] k [15:26:09] mafk: yup. Just getting the ContentTranslation patch setup now. [15:28:26] !log thcipriani@tin Synchronized php-1.27.0-wmf.17/extensions/ContentTranslation/modules/tools/ext.cx.tools.mt.js: SWAT: Fix JS error in MT tool: MTControlCard.providers undefined [[gerrit:278843]] (duration: 00m 25s) [15:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:32] ^ kart_ Nikerabbit check please [15:29:13] checking. [15:29:34] works for me [15:30:14] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278876 (https://phabricator.wikimedia.org/T130599) (owner: 10MarcoAurelio) [15:30:36] * mafk feels pinged [15:30:46] (03Merged) 10jenkins-bot: Configuring wgMetaNamespace for an.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278876 (https://phabricator.wikimedia.org/T130599) (owner: 10MarcoAurelio) [15:31:49] thcipriani: seems good. Thanks! [15:32:09] kart_: Nikerabbit cool, thanks for checking! [15:34:13] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272479 (https://phabricator.wikimedia.org/T127688) (owner: 10MarcoAurelio) [15:34:24] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Configuring wgMetaNamespace for an.wiktionary [[gerrit:278876]] (duration: 00m 29s) [15:34:25] (03CR) 10BBlack: [C: 032] move most of esams to standard layout [dns] - 10https://gerrit.wikimedia.org/r/270285 (owner: 10BBlack) [15:34:26] ^ mafk check please [15:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:39] thcipriani: did you run namespacesDupes? [15:34:53] (03CR) 10BBlack: [C: 032] remove corp ORIGIN statement [dns] - 10https://gerrit.wikimedia.org/r/278722 (owner: 10BBlack) [15:35:04] (03Merged) 10jenkins-bot: Enable signature button at NS:102 for frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272479 (https://phabricator.wikimedia.org/T127688) (owner: 10MarcoAurelio) [15:35:22] at first it looks it works [15:35:54] (03CR) 10BBlack: [C: 032] remove esams ORIGIN statement [dns] - 10https://gerrit.wikimedia.org/r/278721 (owner: 10BBlack) [15:36:14] mafk: namespaceDupes.php run [15:36:33] ktnx thcipriani - namespace change works [15:36:48] mafk: cool, thanks for checking it :) [15:36:54] (03CR) 10BBlack: [C: 032] remove redundant wikimedia.org. trailers [dns] - 10https://gerrit.wikimedia.org/r/278723 (owner: 10BBlack) [15:36:58] 6Operations, 10ops-eqiad: db1067 degraded RAID - https://phabricator.wikimedia.org/T130517#2140946 (10Cmjohnson) The servers has 1 year remaining in warranty. New disk has been ordered from Dell. Congratulations: Work Order SR927112290 was successfully submitted. [15:38:08] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable signature button at NS:102 for frwiki [[gerrit:272479]] (duration: 00m 30s) [15:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:12] ^ mafk check please [15:38:20] doing [15:39:06] thcipriani: signature button displays for me at NS:102@frwiki, as requested [15:39:25] will close both tasks as resolved [15:39:34] mafk: great, thank you :) [15:39:36] since Stashbot looks lazy today [15:40:50] (03PS2) 10Jcrespo: Repool db1024; pool db1074 and db1076; depool db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278906 (https://phabricator.wikimedia.org/T130351) [15:43:25] (03PS3) 10BBlack: maps DNS 2/2: enable geodns routing [dns] - 10https://gerrit.wikimedia.org/r/268240 (https://phabricator.wikimedia.org/T109162) [15:52:40] 6Operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2141001 (10RobH) So it turns out the CPU we have picked out cannot work with the 12 * 4TB, due to both drawing too much voltage. However, it can work with 24 * 2TB, so we're having that quote generate... [15:56:11] !log elasticsearch: creating wikimania2017wiki indices in codfw [15:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:00:03] 7Puppet, 6Labs: Receiving puppet run failure alert for instance where manual puppet runs complete fine - https://phabricator.wikimedia.org/T129403#2141024 (10dschwen) 5Open>3Resolved a:3dschwen No further emails received. Closing. Thanks! [16:00:04] akosiaris moritzm: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160322T1600). [16:00:04] urandom twentyafterfour mobrovac: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:20] * urandom is available! [16:00:24] me too [16:03:29] 6Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2141036 (10Ottomata) Bump. [16:03:33] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2141037 (10Ottomata) Bump. [16:03:35] 6Operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2141038 (10Ottomata) Bump. [16:03:37] 6Operations, 10hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2141039 (10Ottomata) Bump. [16:05:22] urandom: I looked into https://gerrit.wikimedia.org/r/#/c/277265 earlier the day and looks good to me, but wondering about your comment at https://phabricator.wikimedia.org/T128787#2135047, do you plan to initiate a rolling restart after this is merged? [16:06:06] moritzm: i do yeah [16:06:32] moritzm: but i was going to wait for that, and https://gerrit.wikimedia.org/r/#q,278330,n,z [16:06:53] they both touch the same file (actually, one will have to be rebased after the other is merged) [16:07:25] * twentyafterfour too [16:07:36] moritzm: technically, Cassandra/logback should reload the config when the file changes, but i was going to rolling restart the Cassandra clusters anyway [16:09:37] urandom: ok, will merge 277265 and 278330 now [16:09:50] (03PS2) 10Muehlenhoff: make logstash messages separable by cluster [puppet] - 10https://gerrit.wikimedia.org/r/278330 (https://phabricator.wikimedia.org/T130393) (owner: 10Eevans) [16:09:50] moritzm: great! [16:10:49] (03CR) 10Muehlenhoff: [C: 032 V: 032] make logstash messages separable by cluster [puppet] - 10https://gerrit.wikimedia.org/r/278330 (https://phabricator.wikimedia.org/T130393) (owner: 10Eevans) [16:10:50] so, the deployment calendar is obviously wrong re: deployment times, it's 16:10 UTC now, not 17:10 UTC [16:11:17] it's affected by daylight stupid time [16:12:11] mobrovac: deployment calendar is pinned to SF time :) [16:13:34] moritzm: hrmm, i guess https://gerrit.wikimedia.org/r/#/c/277265 doesn't want to rebase cleaning [16:13:38] at least not through the ui [16:13:56] yeah, stupid gerrit. currently rebasing manually [16:13:56] (03CR) 1020after4: [C: 031] Add a deployment source & target class for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [16:15:35] (03PS9) 1020after4: Add a deployment source & target class for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) [16:15:49] https://gerrit.wikimedia.org/r/#/c/274502/ rebased cleanly [16:16:01] (03PS2) 10Muehlenhoff: Filter StatusLogger messages from UDP appender [puppet] - 10https://gerrit.wikimedia.org/r/277265 (https://phabricator.wikimedia.org/T128787) (owner: 10Eevans) [16:16:17] (03PS1) 10Elukey: Revert "Remove rdb1001 from the Media Wiki Job Queue pool for maintenance." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278922 [16:16:34] (03PS2) 10Elukey: Revert "Remove rdb1001 from the Media Wiki Job Queue pool for maintenance." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278922 [16:17:11] (03CR) 10Eevans: [C: 031] Filter StatusLogger messages from UDP appender [puppet] - 10https://gerrit.wikimedia.org/r/277265 (https://phabricator.wikimedia.org/T128787) (owner: 10Eevans) [16:17:20] SWAT team: can I proceed with https://gerrit.wikimedia.org/r/#/c/278922/ ? [16:17:41] (probably Cc: urandom twentyafterfour) [16:17:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Filter StatusLogger messages from UDP appender [puppet] - 10https://gerrit.wikimedia.org/r/277265 (https://phabricator.wikimedia.org/T128787) (owner: 10Eevans) [16:18:08] mafk, can you point to the exact discussion you reference in https://phabricator.wikimedia.org/T100070#2141006 ? [16:18:10] 6Operations, 6Discovery, 10Kartotherian, 10Maps, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#2141170 (10BBlack) [16:18:12] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: Set up proper edge Varnish caching for maps cluster - https://phabricator.wikimedia.org/T109162#2141169 (10BBlack) [16:18:16] moritzm: thanks! [16:18:44] Krenair: was on this irc channel I think, though I don't remember the data [16:18:46] *date [16:18:58] (03PS1) 10Filippo Giunchedi: varnish: redirect upload.wikimedia.org to commons [puppet] - 10https://gerrit.wikimedia.org/r/278924 (https://phabricator.wikimedia.org/T130449) [16:18:59] elukey: looks ok to me [16:19:15] super, didn't want to interfere with your work :) [16:19:23] adding rdb1001 back to the pool then [16:19:32] Jan 31 22:41:19 csteipp_afk: is it possible to block a certain user-agent from editting? [16:19:35] This one mafk? [16:19:43] wow [16:19:45] yep [16:19:47] how fast [16:19:49] 6Operations, 10Traffic, 7Design, 13Patch-For-Review: Do something better than an "Unauthorized" error page at https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2136256 (10fgiunchedi) yeah what @Bawolff said! please see related patch to serve 301s out of varnish and redirect to commo... [16:19:51] (03CR) 10Elukey: [C: 032] Revert "Remove rdb1001 from the Media Wiki Job Queue pool for maintenance." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278922 (owner: 10Elukey) [16:19:59] Jan 31 22:41:34 mafk: No, not at this time [16:21:05] !log Restarting Cassandra on restbase1007.eqiad.wmnet (canary) : T130393, T128787 [16:21:06] T130393: Create separate Kibana dashboards for production Cassandra clusters - https://phabricator.wikimedia.org/T130393 [16:21:06] T128787: Reduce Cassandra logstash output - https://phabricator.wikimedia.org/T128787 [16:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:21:17] !log elukey@tin Synchronized wmf-config/jobqueue-eqiad.php: REVERT - Remove rdb1001 from the Redis Job Queues for maintenance (duration: 00m 25s) [16:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:00] uhoh, for puppet swat I just noticed strontium reporting unmerged changes, moritzm [16:23:07] I'm assuming it is fine to force it [16:24:20] 6Operations, 10Traffic, 7Design, 13Patch-For-Review: Do something better than an "Unauthorized" error page at https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2141217 (10Krenair) @fgiunchedi: Can you please point to where the message I saw comes from? [16:24:31] godog: oh, seems the push triggered by palladium failed, how can I trigger a manually resync? [16:25:09] moritzm: on palladium, sudo -u gitpuppet ssh strontium.eqiad.wmne [16:25:13] t [16:25:55] (03CR) 10BBlack: [C: 031] "Looks approximately correct to human eyes, need to test/validate a bit." [puppet] - 10https://gerrit.wikimedia.org/r/278924 (https://phabricator.wikimedia.org/T130449) (owner: 10Filippo Giunchedi) [16:26:05] 6Operations, 10Reading-Web, 10Traffic, 6Wikipedia-iOS-App-Product-Backlog, 13Patch-For-Review: Setup up Site Association file for Universal Link Support - https://phabricator.wikimedia.org/T111829#2141234 (10Krenair) Should really be a separate ticket. We could probably just add some extra apache rule to... [16:27:20] 6Operations, 10Traffic, 7Design, 13Patch-For-Review: Do something better than an "Unauthorized" error page at https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2141247 (10fgiunchedi) @krenair sure, that comes from `root` container in swift eqiad, our `rewrite.py` middleware is servi... [16:27:22] godog: why would /srv/deployment/cassandra/logstash-logback-encoder/lib/janino-2.7.8.jar be a git-fact ASCII file? [16:27:38] godog: what did i miss? [16:27:46] urandom: not hydrated by git-fat possibly, only on one host? [16:28:12] godog: checking... [16:28:19] fat doesn't get hydrated, but you get the idea [16:28:31] (03PS3) 10Jcrespo: Repool db1024; pool db1074 and db1076; depool db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278906 (https://phabricator.wikimedia.org/T130351) [16:28:31] godog: looks like everywhere [16:28:40] godog: well, no [16:28:49] godog: half of everywhere [16:29:15] thcipriani: can you check if namespaceDupes.php returned any error? cause I'm getting an error when processing with bot a page in the NS_PROJECT namespace [16:29:18] godog: what is the solution? [16:29:26] or Krenair [16:29:35] urandom: mhh ok, perhaps a git deploy start / git deploy sync [16:29:48] (03CR) 10Jcrespo: [C: 032] Repool db1024; pool db1074 and db1076; depool db1044 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278906 (https://phabricator.wikimedia.org/T130351) (owner: 10Jcrespo) [16:29:51] from the affected machines? or tin? [16:30:07] 6Operations, 10Traffic, 7Design, 13Patch-For-Review: Do something better than an "Unauthorized" error page at https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2141254 (10Krenair) Okay, but where is the HTML itself (within the puppet repo somewhere, I would hope)? It should get remo... [16:30:14] mafk: when I ran it this morning, I got a message like, "Looks good!" i can run it again... [16:30:40] mafk, is there a namespace problem somewhere? [16:30:40] thcipriani: look at the message I got [16:30:46] Krenair: an.wiktionary [16:30:56] will paste error from py console [16:30:59] give me a sec [16:31:13] urandom: on tin [16:31:28] godog: ok [16:32:00] mafk: Krenair https://phabricator.wikimedia.org/P2800 [16:32:01] (03PS1) 10Elukey: Revert "Remove rdb1001 from the Job runners config for maintenance." [puppet] - 10https://gerrit.wikimedia.org/r/278929 [16:32:11] (03PS2) 10Elukey: Revert "Remove rdb1001 from the Job runners config for maintenance." [puppet] - 10https://gerrit.wikimedia.org/r/278929 [16:32:43] !log jynus@tin Synchronized wmf-config/db-codfw.php: Add db1074 and db1076 eqiad databases (duration: 00m 31s) [16:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:49] godog: that doesn't seem to have helped :( [16:33:12] thcipriani, Krenair https://phabricator.wikimedia.org/P2801 [16:33:38] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1024; pool db1074 and db1076; depool db1044 (duration: 00m 28s) [16:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:53] thcipriani, I'll deal with this [16:33:54] godog: I keep getting this: [16:33:54] 24/28 minions completed checkout [16:33:55] Continue? ([d]etailed/[C]oncise report,[y]es,[n]o,[r]etry): [16:35:00] godog: wait... maybe it worked [16:35:10] * urandom lights a black candle [16:35:19] (03CR) 10Elukey: [C: 032] Revert "Remove rdb1001 from the Job runners config for maintenance." [puppet] - 10https://gerrit.wikimedia.org/r/278929 (owner: 10Elukey) [16:35:29] mafk, 0 pages to fix, 0 were resolvable. [16:35:40] when browsing to https://an.wiktionary.org/wiki/Biquizionario:Tabierna I get a valid page [16:35:50] maybe a python issue then [16:35:58] will try again [16:36:13] mysql:wikiadmin@db1035 [anwiktionary]> select * from page where page_title = 'Biquizionario:Tabierna'; [16:36:14] Empty set (0.02 sec) [16:36:29] urandom: continue pressing "c" until you get 28/28 [16:36:33] (03CR) 10Gehel: Filter StatusLogger messages from UDP appender (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/277265 (https://phabricator.wikimedia.org/T128787) (owner: 10Eevans) [16:37:04] !log re-added rdb1001 (Job Queue master) to the Jobrunners' config. Forcing puppet agent and restarting jobchron on all jobrunners/videoscalers in eqiad. [16:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:48] the palladium -> strontium sync broke for some reason: The initial run triggered by palladium threw a "error: unable to create file modules/role/manifests/labs/ores/compute.pp (Permission denied)" [16:38:52] I tried a manual sync, but that only returns https://phabricator.wikimedia.org/P2802 [16:38:56] ^^^ urandom moritzm: About the gerrit change above, why do we need an evaluator and not a simple logger configuration (change already merged, I'm just trying to learn something today) [16:39:44] gehel: if i understand the question, then it's to remove it from one appender (UDP), but not the other (FILE) [16:39:50] urandom: did turning it off and on again work? [16:40:00] godog: :) [16:40:05] godog: in a manner [16:40:16] 6Operations, 10Traffic, 7Design, 13Patch-For-Review: Do something better than an "Unauthorized" error page at https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2141266 (10Krenair) Also that code references robots.txt which is also broken in the same way And requests for favicon.ico... [16:40:30] anyone familiar with the sync setup knows how that happened and how to fix it? [16:40:39] urandom: than we should probably use 'additivity' [16:40:51] gehel: not sure i follow [16:42:29] https://www.irccloud.com/pastebin/kncs5Wqy/ [16:43:20] urandom: probably something similar to the paste above. We send this logger to a specific appender, and with 'additivity=false' we ensure it is not falling back to parent logger configuration. [16:43:50] it probably works either way, but for a simple use case I find it surprising to need to add 2 additional jars as dependencies... [16:44:46] gehel: huh, you might be right, i'll circle back around to this when i have my hands less full [16:45:12] urandom: ping me if you need some help [16:45:22] * gehel has been playing with Java logging for a long long time... [16:45:24] gehel: thanks! [16:45:27] (03PS2) 10Ottomata: Create eventschemas module, use this in eventbus role [puppet] - 10https://gerrit.wikimedia.org/r/278907 (https://phabricator.wikimedia.org/T127099) [16:46:53] (03PS3) 10Alex Monk: [WIP] openstack: Add proxy panel files [puppet] - 10https://gerrit.wikimedia.org/r/278871 (https://phabricator.wikimedia.org/T129245) [16:47:33] (03CR) 10Alex Monk: [C: 04-1] "PS3 should hopefully deal with jenkins failures, but not Andrew's comments" [puppet] - 10https://gerrit.wikimedia.org/r/278871 (https://phabricator.wikimedia.org/T129245) (owner: 10Alex Monk) [16:49:00] (03PS1) 10Elukey: Notify Jobchron when jobrunner.conf changes. [puppet] - 10https://gerrit.wikimedia.org/r/278931 (https://phabricator.wikimedia.org/T123675) [16:49:17] (03CR) 10jenkins-bot: [V: 04-1] [WIP] openstack: Add proxy panel files [puppet] - 10https://gerrit.wikimedia.org/r/278871 (https://phabricator.wikimedia.org/T129245) (owner: 10Alex Monk) [16:49:49] (03CR) 10Ppchelko: [C: 031] RESTBase: Use the cassandra back-end [puppet] - 10https://gerrit.wikimedia.org/r/278893 (owner: 10Mobrovac) [16:49:55] !log stopping db1044 mysql to clone to db1075 [16:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:44] 6Operations, 13Patch-For-Review: Reinstall redis servers (Job queues) with Jessie (NOTE: rdb1002 is special and is excluded!) - https://phabricator.wikimedia.org/T123675#2141331 (10elukey) All Redis Job Queue are on Debian! [16:51:02] 6Operations, 13Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2141335 (10elukey) [16:53:16] 6Operations, 10Traffic, 7Design, 13Patch-For-Review: Do something better than an "Unauthorized" error page at https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2141340 (10fgiunchedi) @Krenair in swift eqiad ATM ``` root@ms-fe1001:~# swift list root crossdomain.xml favicon.ico index... [16:53:42] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [16:54:02] PROBLEM - puppet last run on mw2202 is CRITICAL: CRITICAL: puppet fail [16:54:55] !log Performing rolling restart of AQS Cassandra cluster : T130393, T128787 [16:54:56] T130393: Create separate Kibana dashboards for production Cassandra clusters - https://phabricator.wikimedia.org/T130393 [16:54:56] T128787: Reduce Cassandra logstash output - https://phabricator.wikimedia.org/T128787 [16:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:14] (03CR) 10Ottomata: [C: 032] "tested in beta" [puppet] - 10https://gerrit.wikimedia.org/r/278907 (https://phabricator.wikimedia.org/T127099) (owner: 10Ottomata) [16:56:19] 6Operations, 10Traffic, 7Design, 13Patch-For-Review: Do something better than an "Unauthorized" error page at https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2141350 (10Krenair) Agreed. [16:57:16] 6Operations, 10Traffic, 7Design, 13Patch-For-Review: Do something better than an "Unauthorized" error page at https://upload.wikimedia.org/ - https://phabricator.wikimedia.org/T130449#2141353 (10Krenair) 5Open>3Resolved a:3fgiunchedi [16:59:01] PROBLEM - puppet last run on mw1139 is CRITICAL: CRITICAL: Puppet has 63 failures [16:59:10] (03PS2) 10Yuvipanda: labs: Add support for custom cnames in labs recursor [puppet] - 10https://gerrit.wikimedia.org/r/278705 (https://phabricator.wikimedia.org/T118758) [17:00:04] yurik gwicke cscott arlolra subbu mdholloway: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160322T1700). Please do the needful. [17:00:13] nothing for parsoid [17:00:25] urandom: mobrovac ok to merge https://gerrit.wikimedia.org/r/#/c/278893/ ? [17:00:48] i plan on doing a mobileapps deployment in a few minutes [17:01:08] (03CR) 10Yuvipanda: [C: 032] labs: Add support for custom cnames in labs recursor [puppet] - 10https://gerrit.wikimedia.org/r/278705 (https://phabricator.wikimedia.org/T118758) (owner: 10Yuvipanda) [17:01:55] akosiaris: yup, it's a no-op [17:02:00] ok [17:02:06] just making sure [17:02:10] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: Use the cassandra back-end [puppet] - 10https://gerrit.wikimedia.org/r/278893 (owner: 10Mobrovac) [17:02:16] (03PS2) 10Alexandros Kosiaris: RESTBase: Use the cassandra back-end [puppet] - 10https://gerrit.wikimedia.org/r/278893 (owner: 10Mobrovac) [17:02:22] (03CR) 10Alexandros Kosiaris: [V: 032] RESTBase: Use the cassandra back-end [puppet] - 10https://gerrit.wikimedia.org/r/278893 (owner: 10Mobrovac) [17:02:24] thnx akosiaris [17:02:33] PROBLEM - logstash process on logstash1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [17:03:18] !log Rolling restart of AQS Cassandra cluster complete : : T130393, T128787 [17:03:19] T130393: Create separate Kibana dashboards for production Cassandra clusters - https://phabricator.wikimedia.org/T130393 [17:03:20] T128787: Reduce Cassandra logstash output - https://phabricator.wikimedia.org/T128787 [17:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:05:00] !log starting mobileapps deployment [17:05:01] (03CR) 10Halfak: [C: 031] "Seems to work in practice too." [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [17:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:05:25] (03CR) 10Giuseppe Lavagetto: [C: 031] Notify Jobchron when jobrunner.conf changes. [puppet] - 10https://gerrit.wikimedia.org/r/278931 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [17:07:38] 6Operations, 15User-mobrovac, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2141416 (10Joe) So, to circumstantiate my ideas a bit more: * Services... [17:08:20] (03PS1) 10Ottomata: Clone mediawiki/event-schemas in refinery role [puppet] - 10https://gerrit.wikimedia.org/r/278937 (https://phabricator.wikimedia.org/T126501) [17:12:35] 6Operations, 10hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2141444 (10RobH) p:5Triage>3High This has been pending approval for a week, so I'm setting it to high priority (was needs triage). [17:19:21] (03CR) 10Dzahn: [C: 031] dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [17:19:23] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [17:21:34] (03CR) 10Dzahn: [C: 031] "literally "Looks good to me, but someone else must approve" jynus asked to wait" [dns] - 10https://gerrit.wikimedia.org/r/278344 (https://phabricator.wikimedia.org/T125827) (owner: 10Papaul) [17:23:10] (03CR) 10Ori.livneh: [C: 031] Notify Jobchron when jobrunner.conf changes. [puppet] - 10https://gerrit.wikimedia.org/r/278931 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [17:23:22] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:24:11] (03CR) 10Dzahn: "added Jaime and Moritz" [dns] - 10https://gerrit.wikimedia.org/r/278344 (https://phabricator.wikimedia.org/T125827) (owner: 10Papaul) [17:24:49] (03CR) 10Jcrespo: [C: 04-1] Decom:Removed production DNS for db200[1-7] Bug:T125827 [dns] - 10https://gerrit.wikimedia.org/r/278344 (https://phabricator.wikimedia.org/T125827) (owner: 10Papaul) [17:25:06] (03PS2) 10Elukey: Notify Jobchron when jobrunner.conf changes. [puppet] - 10https://gerrit.wikimedia.org/r/278931 (https://phabricator.wikimedia.org/T123675) [17:25:33] (03PS4) 10Alex Monk: [WIP] openstack: Add proxy panel files [puppet] - 10https://gerrit.wikimedia.org/r/278871 (https://phabricator.wikimedia.org/T129245) [17:26:32] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:26:49] (03PS3) 10Elukey: Notify Jobchron when jobrunner.conf changes. [puppet] - 10https://gerrit.wikimedia.org/r/278931 (https://phabricator.wikimedia.org/T123675) [17:29:21] (03CR) 10Elukey: "Applied Ori's suggestion!" [puppet] - 10https://gerrit.wikimedia.org/r/278931 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [17:29:36] (03CR) 10Elukey: [C: 032] "Applied Ori's suggestion!" [puppet] - 10https://gerrit.wikimedia.org/r/278931 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [17:30:32] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:33:35] !log starting wmf.18 branching [17:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:33:44] !log mobileapps deployment complete, deployed 85856f7 [17:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:34:42] (03PS1) 10Yuvipanda: Followup to I28b0dfaecd [puppet] - 10https://gerrit.wikimedia.org/r/278940 [17:35:03] elukey: congrats on all redises jq on debian! [17:35:54] (03CR) 10Yuvipanda: [C: 032] Followup to I28b0dfaecd [puppet] - 10https://gerrit.wikimedia.org/r/278940 (owner: 10Yuvipanda) [17:37:28] (03CR) 10Sabya: "Any idea what is the FAILURE about? regarding operations-puppet-tox-pep8-jessie?" [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [17:39:46] (03CR) 10Elukey: "Added a comment to improve varnishstatsd4, nothing huge. Ema, I'll let you decide if it is worth or not :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/277790 (https://phabricator.wikimedia.org/T128788) (owner: 10Ema) [17:40:04] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: stat1001 access + sudo rights for nuria and mforns - https://phabricator.wikimedia.org/T130226#2130333 (10mark) Lacking an Ops meeting this week and potentially the next, I'm just going to approve this request. [17:41:03] godog: \o/ (all credits to _joe_ for the patience though) [17:41:07] (03PS3) 10Milimetric: Re-organize analytics dumps to their own page [puppet] - 10https://gerrit.wikimedia.org/r/269696 (https://phabricator.wikimedia.org/T115344) [17:42:56] 6Operations, 10Ops-Access-Requests, 6Discovery, 10Maps, 13Patch-For-Review: Requesting maps-admins access for Eric Evans - https://phabricator.wikimedia.org/T130412#2135290 (10mark) Considering the lack of Ops meeting this week and potentially the next, I'll approve this now. [17:43:13] (03PS6) 10Legoktm: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [17:43:18] 6Operations, 10Reading-Web, 10Traffic, 6Wikipedia-iOS-App-Product-Backlog, 13Patch-For-Review: Setup up Site Association file for Universal Link Support - https://phabricator.wikimedia.org/T111829#2141565 (10jcrespo) 5Open>3Resolved I've just confirmed it on https://developer.apple.com/library/ios/do... [17:43:22] (03CR) 10Legoktm: [C: 031] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [17:45:58] (03PS1) 10Yuvipanda: labs: return CNAMEs only when asked for [puppet] - 10https://gerrit.wikimedia.org/r/278941 [17:46:47] (03PS2) 10Dzahn: Include statistics-admins on stat1001 (role statistics::web), include nuria in that group [puppet] - 10https://gerrit.wikimedia.org/r/278316 (https://phabricator.wikimedia.org/T130226) (owner: 10Ottomata) [17:47:08] (03PS3) 10Dzahn: Include statistics-admins on stat1001 (role statistics::web), include nuria in that group [puppet] - 10https://gerrit.wikimedia.org/r/278316 (https://phabricator.wikimedia.org/T130226) (owner: 10Ottomata) [17:47:40] (03CR) 10Dzahn: [C: 032] "has approval on ticket now" [puppet] - 10https://gerrit.wikimedia.org/r/278316 (https://phabricator.wikimedia.org/T130226) (owner: 10Ottomata) [17:51:06] 6Operations, 6Mobile-Apps, 10Traffic: Millions of request per minute to /.well-known/apple-app-site-association producing 404s - https://phabricator.wikimedia.org/T130647#2141611 (10jcrespo) [17:59:13] (03PS1) 10Rush: ssh-key-ldap-lookup multiple server array handling [puppet] - 10https://gerrit.wikimedia.org/r/278944 (https://phabricator.wikimedia.org/T130583) [18:00:01] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: stat1001 access + sudo rights for nuria and mforns - https://phabricator.wikimedia.org/T130226#2141658 (10Dzahn) 5Open>3Resolved @ottomata you were right about avoiding to add another new group and after @mark's approval i merged that now. i ran pu... [18:01:52] (03PS2) 10Rush: ssh-key-ldap-lookup multiple server array handling [puppet] - 10https://gerrit.wikimedia.org/r/278944 (https://phabricator.wikimedia.org/T130583) [18:02:18] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: stat1001 access + sudo rights for nuria and mforns - https://phabricator.wikimedia.org/T130226#2141662 (10Dzahn) @nuria see above, you have users on stat1001 now [18:05:42] (03PS2) 10Dzahn: add eevans to maps-admins group [puppet] - 10https://gerrit.wikimedia.org/r/278349 (https://phabricator.wikimedia.org/T130412) (owner: 10Eevans) [18:06:36] (03CR) 10Dzahn: [C: 032] "has support and approval on ticket, in lieu of ops meeting" [puppet] - 10https://gerrit.wikimedia.org/r/278349 (https://phabricator.wikimedia.org/T130412) (owner: 10Eevans) [18:07:19] (03PS1) 10Ottomata: Redirect both stderr and stdout to mylvmbackup logfile [puppet] - 10https://gerrit.wikimedia.org/r/278946 [18:08:19] 6Operations, 10Ops-Access-Requests, 6Discovery, 10Maps, 13Patch-For-Review: Requesting maps-admins access for Eric Evans - https://phabricator.wikimedia.org/T130412#2141703 (10Dzahn) 5Open>3Resolved a:3Dzahn merged. please let us know if any unexpected issues (after puppet ran on servers you should... [18:08:57] (03PS2) 10Ottomata: Redirect both stderr and stdout to mylvmbackup logfile [puppet] - 10https://gerrit.wikimedia.org/r/278946 [18:09:06] (03CR) 10Ottomata: [C: 032 V: 032] Redirect both stderr and stdout to mylvmbackup logfile [puppet] - 10https://gerrit.wikimedia.org/r/278946 (owner: 10Ottomata) [18:09:53] 6Operations, 10Ops-Access-Requests, 6Discovery, 10Maps: Requesting maps-admins access for Eric Evans - https://phabricator.wikimedia.org/T130412#2141708 (10Dzahn) [18:10:09] (03CR) 10Yuvipanda: [C: 031] ssh-key-ldap-lookup multiple server array handling [puppet] - 10https://gerrit.wikimedia.org/r/278944 (https://phabricator.wikimedia.org/T130583) (owner: 10Rush) [18:10:15] urandom: ^ [18:10:30] (03PS3) 10Rush: ssh-key-ldap-lookup multiple server array handling [puppet] - 10https://gerrit.wikimedia.org/r/278944 (https://phabricator.wikimedia.org/T130583) [18:10:31] urandom: eh, too late:) what i want to say is, you have been added to maps-admins [18:11:47] mutante: thanks! [18:13:33] (03CR) 10Dzahn: "the ticket meanwhile says "@ArielGlenn the extension is now deployed in read-only mode.". Does that mean it's good to go?" [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [18:13:39] urandom: np [18:14:06] (03CR) 10Rush: [C: 032] ssh-key-ldap-lookup multiple server array handling [puppet] - 10https://gerrit.wikimedia.org/r/278944 (https://phabricator.wikimedia.org/T130583) (owner: 10Rush) [18:14:33] (03PS1) 10Ema: VTC tests compatible with Varnish 3 and 4 [puppet] - 10https://gerrit.wikimedia.org/r/278948 (https://phabricator.wikimedia.org/T128188) [18:18:02] (03PS1) 10Alexandros Kosiaris: ores::base: Remove data_path [puppet] - 10https://gerrit.wikimedia.org/r/278950 [18:18:04] (03PS1) 10Alexandros Kosiaris: ores::base: Remove handling of /srv [puppet] - 10https://gerrit.wikimedia.org/r/278951 [18:19:31] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: puppet fail [18:22:23] (03PS1) 10Yuvipanda: docker: Have registry listen on 443 port [puppet] - 10https://gerrit.wikimedia.org/r/278952 [18:24:19] (03CR) 10Yuvipanda: [C: 032 V: 032] docker: Have registry listen on 443 port [puppet] - 10https://gerrit.wikimedia.org/r/278952 (owner: 10Yuvipanda) [18:27:31] bblack: in case you haven't seen it yet: https://grafana.wikimedia.org/dashboard/db/navigation-timing-by-continent , https://grafana.wikimedia.org/dashboard/db/navigation-timing-by-geolocation [18:29:01] (03CR) 10Mobrovac: [C: 031] Add DC named topics to event bus topic config [puppet] - 10https://gerrit.wikimedia.org/r/278752 (https://phabricator.wikimedia.org/T127718) (owner: 10Ottomata) [18:29:27] mobrovac: we actually might not need that after all. might just do this: https://phabricator.wikimedia.org/T130562 [18:29:44] not yet sure where the topic config should be enforced [18:29:55] ori: nice! [18:30:21] am currently thinking on whatever topic the user requests in the message, not the full prefixed topic [18:30:23] wills ee [18:30:36] mutante: how long should i wait on that? i don't seem to be able to access maps-test2004 [18:30:45] ottomata: yeah, saw that and it's a great way to get around the fact that MW has no idea where it is located, but the topic name itself still has to contain the name of the DC [18:31:13] mutante: never mind :) [18:31:13] yes [18:31:19] but, the message won't contain that [18:31:27] actually yeah, mobrovac what do you thikn [18:31:27] urandom: :) ok [18:31:27] so [18:31:34] if we do --topic=prefix [18:31:41] should the service alter the meta.topic field? [18:31:42] or [18:31:52] should it just produce to the topic prefix without altering the field? [18:31:55] ori: we can take a week avg of that data, scale x population (or x internet-population) and get a sort of weighting of "how important is it to improve navtiming for a given continent?" [18:32:15] sorry, --topic-prefix='eqiad.' [18:32:16] etc. [18:32:28] (which I suspect still supports Asia as the first choice, given we can't do them all at once) [18:32:53] an asia DC helps Oceania too in the continent view, of course [18:32:57] ottomata: we can also consider meta.topic as the "canonical name" (so no dc prefix) [18:33:17] or the service could add it, not sure which would be better [18:33:19] yeah, i'm not sure which is better [18:33:24] it seems best to not alter [18:33:29] bblack: yes, I think so (as in: yes, we can do that, and yes, it would be a good idea) [18:34:27] (03PS1) 10Jcrespo: Increase weight of db1024, db1074 and db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278955 (https://phabricator.wikimedia.org/T130351) [18:34:50] of course there's always additional interpretation to make. for some areas, it's the last mile that hurts far more than the closeness of our edge site [18:35:21] (03PS2) 10Jcrespo: Increase weight of db1024, db1074 and db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278955 (https://phabricator.wikimedia.org/T130351) [18:35:22] (so placing a DC in asia might substantial improve navtiming there, whereas it might make a smaller diff placing one or two somewhere in Africa directly) [18:36:26] (03CR) 10Jcrespo: [C: 032] Increase weight of db1024, db1074 and db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278955 (https://phabricator.wikimedia.org/T130351) (owner: 10Jcrespo) [18:37:04] on the other other hand, there's also the whole "build it and they will come" angle: when latency for services sucks no matter what the last mile looks like, there's less incentive on operators to improve the last mile) [18:38:26] (03PS2) 10Muehlenhoff: Add ferm rules for redis access on maps cluster [puppet] - 10https://gerrit.wikimedia.org/r/277197 [18:38:38] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for redis access on maps cluster [puppet] - 10https://gerrit.wikimedia.org/r/277197 (owner: 10Muehlenhoff) [18:39:03] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Increase weight of db1024, db1074 and db1076 (duration: 01m 34s) [18:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:28] yuvipanda: I've puppet-merged your docket change along [18:39:39] !log Rolling restart of Cassandra on maps cluster : T130393, T128787 [18:39:40] T130393: Create separate Kibana dashboards for production Cassandra clusters - https://phabricator.wikimedia.org/T130393 [18:39:40] T128787: Reduce Cassandra logstash output - https://phabricator.wikimedia.org/T128787 [18:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:57] morebots: woops, thanks [18:39:57] I am a logbot running on tools-exec-1215. [18:39:57] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [18:39:57] To log a message, type !log . [18:40:14] moritzm: ^ [18:40:17] bblack: Right. But getting a PoP set up in Africa might be more challenging, because the internet infrastructure is generally poorer. The impact of the next PoP (IMO) will be measured not only by its direct impact, but also its indirect impact, in terms of earning the confidence of the board and the community to provision additional PoPs. [18:40:32] so it's practical to go with a location that is simpler to operate in [18:40:37] moritzm: is there a way for me to allow a non-root process to bind to a 443 port? [18:40:57] port forwarding using iptables? [18:40:58] moritzm: the default docker-registry systemd unit runs as docker-registry user, but this means I can't set it to listen to 443 [18:41:11] !log Aborting restart of Cassandra on maps cluster : T130393, T128787 [18:41:13] T130393: Create separate Kibana dashboards for production Cassandra clusters - https://phabricator.wikimedia.org/T130393 [18:41:13] T128787: Reduce Cassandra logstash output - https://phabricator.wikimedia.org/T128787 [18:41:15] ori: yeah, or nginx, but I'm hoping to find some other trick that renders that unnecessary. [18:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:41:27] ori: maybe systemd can grant a capability temporarily [18:41:38] * yuvipanda is looking [18:41:51] PROBLEM - Apache HTTP on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:42:01] akosiaris: ping? [18:42:05] yuvipanda: http://stackoverflow.com/a/414258/582542 ? [18:42:41] PROBLEM - HHVM rendering on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:43:02] PROBLEM - nutcracker port on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:13] yuvipanda: after a lot of trial and error it seems that the person settled on iptables: http://stackoverflow.com/a/21653102/582542 [18:43:15] (03PS3) 10Rush: Set exim environment for labs instances [puppet] - 10https://gerrit.wikimedia.org/r/278899 (owner: 10Muehlenhoff) [18:43:21] PROBLEM - Disk space on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:29] (03CR) 10Rush: [C: 031] "thanks moritzm for doing all the exim work" [puppet] - 10https://gerrit.wikimedia.org/r/278899 (owner: 10Muehlenhoff) [18:43:32] PROBLEM - Check size of conntrack table on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:43:52] PROBLEM - DPKG on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:01] PROBLEM - HHVM processes on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:13] PROBLEM - nutcracker process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:13] PROBLEM - salt-minion processes on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:22] PROBLEM - RAID on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:28] ori: reading [18:44:31] PROBLEM - SSH on mw1125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:44:42] PROBLEM - configured eth on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:44:43] PROBLEM - dhclient process on mw1125 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:47:11] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [18:47:12] yuvipanda: strictly speaking you don't need root, there's a Linux capability for that [18:48:07] moritzm: right, but http://stackoverflow.com/questions/413807/is-there-a-way-for-non-root-processes-to-bind-to-privileged-ports-1024-on-l/21653102#21653102 suggests that doesn't exactly work [18:48:26] ori: yeah, agreed :) [18:56:01] (03PS1) 10Yuvipanda: k8s: Allow docker registry to bind to lower ports [puppet] - 10https://gerrit.wikimedia.org/r/278958 [18:56:10] ori: moritzm ^ seems to work (tested it locally) [18:57:12] (03PS1) 10Dereckson: Allow bureaucrat group to add/remove translationadmin on species [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278959 (https://phabricator.wikimedia.org/T129888) [18:59:50] (03CR) 10jenkins-bot: [V: 04-1] k8s: Allow docker registry to bind to lower ports [puppet] - 10https://gerrit.wikimedia.org/r/278958 (owner: 10Yuvipanda) [18:59:52] PROBLEM - puppet last run on mw1139 is CRITICAL: CRITICAL: Puppet has 110 failures [19:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160322T1900). Please do the needful. [19:00:14] !log mw1125 - powercycle [19:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:00:40] * thcipriani continues doing the needful. [19:00:59] (03PS2) 10Yuvipanda: k8s: Allow docker registry to bind to lower ports [puppet] - 10https://gerrit.wikimedia.org/r/278958 [19:01:03] needful -v [19:02:22] RECOVERY - SSH on mw1125 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [19:02:32] RECOVERY - configured eth on mw1125 is OK: OK - interfaces up [19:02:33] RECOVERY - dhclient process on mw1125 is OK: PROCS OK: 0 processes with command name dhclient [19:02:42] RECOVERY - nutcracker port on mw1125 is OK: TCP OK - 0.000 second response time on port 11212 [19:03:03] RECOVERY - Disk space on mw1125 is OK: DISK OK [19:03:17] (03PS3) 10Yuvipanda: k8s: Allow docker registry to bind to lower ports [puppet] - 10https://gerrit.wikimedia.org/r/278958 [19:03:19] 6Operations, 6Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2142031 (10MoritzMuehlenhoff) Moving to a backport of 2.4.41 is probably the better solution, though. [19:03:22] RECOVERY - Check size of conntrack table on mw1125 is OK: OK: nf_conntrack is 7 % full [19:03:22] RECOVERY - Apache HTTP on mw1125 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.138 second response time [19:03:33] RECOVERY - HHVM processes on mw1125 is OK: PROCS OK: 6 processes with command name hhvm [19:03:33] RECOVERY - DPKG on mw1125 is OK: All packages OK [19:03:53] RECOVERY - nutcracker process on mw1125 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [19:03:53] RECOVERY - salt-minion processes on mw1125 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:04:02] RECOVERY - RAID on mw1125 is OK: OK: no RAID installed [19:04:11] RECOVERY - HHVM rendering on mw1125 is OK: HTTP OK: HTTP/1.1 200 OK - 67435 bytes in 1.164 second response time [19:04:33] moritzm: I added you to the setcap patch :) [19:06:06] yuvipanda: sure, I'll have a look tomorrow morning [19:06:39] 6Operations: Migrate pool counters to trusty/jessie - https://phabricator.wikimedia.org/T123734#2142055 (10Joe) For the record, a poolcounter server failing should not cause downtime anymore. [19:06:53] RECOVERY - puppet last run on mw1125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:07:19] 6Operations: Maps Cassandra cluster discrepencies - https://phabricator.wikimedia.org/T130654#2142059 (10Eevans) [19:07:37] 6Operations: Maps Cassandra cluster discrepencies - https://phabricator.wikimedia.org/T130654#2142078 (10Eevans) p:5Triage>3High [19:08:18] (03PS4) 10Ema: Port varnishreqstats and varnishstatsd to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/277790 (https://phabricator.wikimedia.org/T128788) [19:08:47] (03CR) 10Ema: [C: 032 V: 032] Port varnishreqstats and varnishstatsd to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/277790 (https://phabricator.wikimedia.org/T128788) (owner: 10Ema) [19:10:09] (03CR) 10Dzahn: "could we get this to run in compiler? http://puppet-compiler.wmflabs.org/2119/" [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [19:15:58] (03CR) 10Dzahn: "added in labs/private https://gerrit.wikimedia.org/r/#/c/278961/" [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [19:17:20] 6Operations, 6Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2142143 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [19:19:53] moritzm: cool. I've done it manually for now [19:22:28] (03CR) 10MarcoAurelio: [C: 031] Allow bureaucrat group to add/remove translationadmin on species [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278959 (https://phabricator.wikimedia.org/T129888) (owner: 10Dereckson) [19:23:18] 6Operations, 6Performance-Team, 6Release-Engineering-Team, 7Availability, and 3 others: Dig through logs from 15 Mar 2016 read-only test and file bugs - https://phabricator.wikimedia.org/T129973#2142180 (10aaron) 5Open>3Resolved I did a comb through. AFAIK, everything I saw has a patch or a bug on the... [19:23:25] twentyafterfour: here and ready to get that phab change merged? [19:23:37] i added fake key to labs/private, now it compiles [19:23:56] we can see the diff on tin/mira .. bla [19:24:17] (03PS1) 10Yuvipanda: Revert "labs: Add support for custom cnames in labs recursor" [puppet] - 10https://gerrit.wikimedia.org/r/278964 [19:24:55] (03CR) 10MarcoAurelio: [C: 031] Disable upload on ia.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278411 (https://phabricator.wikimedia.org/T130425) (owner: 10Dereckson) [19:24:57] (03PS1) 10Thcipriani: Group0 to 1.27.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278965 [19:25:06] (03PS1) 10Ottomata: Use non bash specifc output redirect, and redirect full flock command output in mylvmbackup job [puppet] - 10https://gerrit.wikimedia.org/r/278966 [19:25:43] (03CR) 10Dzahn: [C: 031] "compiles on tin/mira too now http://puppet-compiler.wmflabs.org/2121/" [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [19:25:52] (03PS10) 10Dzahn: Add a deployment source & target class for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [19:26:17] (03CR) 10Ottomata: [C: 032] Use non bash specifc output redirect, and redirect full flock command output in mylvmbackup job [puppet] - 10https://gerrit.wikimedia.org/r/278966 (owner: 10Ottomata) [19:29:06] jouncebot: next [19:29:06] In 3 hour(s) and 30 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160322T2300) [19:29:16] jouncebot: do you also know puppet swats? [19:31:35] it should [19:31:36] (03CR) 10Alex Monk: [C: 04-1] "PS4 actually dealt with Jenkins, but not the issues that Andrew raised" [puppet] - 10https://gerrit.wikimedia.org/r/278871 (https://phabricator.wikimedia.org/T129245) (owner: 10Alex Monk) [19:32:11] !log thcipriani@tin Started scap: testwiki to php-1.27.0-wmf.18 and rebuild l10n cache [19:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:38:34] (03PS22) 10Ema: Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [19:40:13] PROBLEM - cassandra CQL 10.64.32.159:9042 on restbase1003 is CRITICAL: Connection refused [19:41:19] (03CR) 10Alex Monk: [WIP] openstack: Add proxy panel files (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/278871 (https://phabricator.wikimedia.org/T129245) (owner: 10Alex Monk) [19:43:53] (03PS5) 10Alex Monk: [WIP] openstack: Add proxy panel files [puppet] - 10https://gerrit.wikimedia.org/r/278871 (https://phabricator.wikimedia.org/T129245) [19:45:16] hmm. scap is moving weirdly quickly. [19:50:54] (03CR) 10Aaron Schulz: [C: 04-2] Made the session/main stashes write to both DCs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247325 (https://phabricator.wikimedia.org/T111575) (owner: 10Aaron Schulz) [19:52:43] ACKNOWLEDGEMENT - cassandra CQL 10.64.32.159:9042 on restbase1003 is CRITICAL: Connection refused eevans Node has been decommissioned. [19:53:30] !log Rolling restart of RESTBase Cassandra cluster : T130393, T128787 [19:53:32] T130393: Create separate Kibana dashboards for production Cassandra clusters - https://phabricator.wikimedia.org/T130393 [19:53:32] T128787: Reduce Cassandra logstash output - https://phabricator.wikimedia.org/T128787 [19:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:56:22] PROBLEM - Apache HTTP on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:04] PROBLEM - HHVM rendering on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:53] PROBLEM - RAID on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:58:12] PROBLEM - dhclient process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:58:33] PROBLEM - DPKG on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:02] PROBLEM - Disk space on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:12] PROBLEM - configured eth on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:22] PROBLEM - Check size of conntrack table on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:34] PROBLEM - HHVM processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:59:34] PROBLEM - SSH on mw1139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:00:18] (03PS6) 10Alex Monk: [WIP] openstack: Add proxy panel files [puppet] - 10https://gerrit.wikimedia.org/r/278871 (https://phabricator.wikimedia.org/T129245) [20:02:02] hmmm. mw1139 is just hanging sync-common :\ [20:02:11] PROBLEM - nutcracker port on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:02:13] PROBLEM - salt-minion processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:02:31] PROBLEM - nutcracker process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:02:53] (03PS2) 10Aaron Schulz: [WIP] Switched to pt-heartbeat lag detection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) [20:03:23] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [100000000.0] [20:10:53] (03PS1) 10Andrew Bogott: Clean up some meaningless lines in makedomain script [puppet] - 10https://gerrit.wikimedia.org/r/278970 [20:11:01] PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: Connection refused [20:11:28] (03PS2) 10Yuvipanda: Revert "labs: Add support for custom cnames in labs recursor" [puppet] - 10https://gerrit.wikimedia.org/r/278964 (https://phabricator.wikimedia.org/T118758) [20:12:08] (03PS3) 10Yuvipanda: Revert "labs: Add support for custom cnames in labs recursor" [puppet] - 10https://gerrit.wikimedia.org/r/278964 (https://phabricator.wikimedia.org/T118758) [20:12:15] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "labs: Add support for custom cnames in labs recursor" [puppet] - 10https://gerrit.wikimedia.org/r/278964 (https://phabricator.wikimedia.org/T118758) (owner: 10Yuvipanda) [20:12:35] ^^^ that's me [20:13:16] (03CR) 10Alex Monk: "Ic881fc13 already does this" [puppet] - 10https://gerrit.wikimedia.org/r/278970 (owner: 10Andrew Bogott) [20:14:32] RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.010 second response time on port 9042 [20:16:49] (03CR) 10Alex Monk: "I put PS5 on labtestweb2001 and it promptly broke, then I fixed it in PS6." [puppet] - 10https://gerrit.wikimedia.org/r/278871 (https://phabricator.wikimedia.org/T129245) (owner: 10Alex Monk) [20:18:21] !log thcipriani@tin Finished scap: testwiki to php-1.27.0-wmf.18 and rebuild l10n cache (duration: 46m 09s) [20:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:55] (03CR) 10Andrew Bogott: "So it does!" [puppet] - 10https://gerrit.wikimedia.org/r/278970 (owner: 10Andrew Bogott) [20:19:02] (03Abandoned) 10Andrew Bogott: Clean up some meaningless lines in makedomain script [puppet] - 10https://gerrit.wikimedia.org/r/278970 (owner: 10Andrew Bogott) [20:19:24] (03CR) 10Andrew Bogott: [C: 032] openstack: clean up a couple of trivial things in makedomain [puppet] - 10https://gerrit.wikimedia.org/r/278539 (owner: 10Alex Monk) [20:20:28] (03PS2) 10Andrew Bogott: Add wmflabsdotorg credentials to horizon config [puppet] - 10https://gerrit.wikimedia.org/r/278538 (https://phabricator.wikimedia.org/T129245) (owner: 10Alex Monk) [20:20:39] greg-g: Ok with you, if I push out a small JS-only fix? [20:20:57] (talking about https://gerrit.wikimedia.org/r/277782) [20:21:01] RECOVERY - MariaDB Slave IO: s3 on db1075 is OK: OK slave_io_state not a slave [20:21:49] RECOVERY - MariaDB Slave SQL: s3 on db1075 is OK: OK slave_sql_state Slave_SQL_Running: Yes [20:21:50] hoo: I'm in the midst of running the Train deployment. [20:22:03] I read that as ruining [20:22:07] * Reedy hugs thcipriani [20:22:19] !log deployed latest version of WDQS [20:22:19] hahaha, that too :P [20:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:23] thcipriani: No worries [20:22:34] (03CR) 10Thcipriani: [C: 032] Group0 to 1.27.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278965 (owner: 10Thcipriani) [20:22:56] I needs Greg's +1 first, then I will coordinate with whoever is currently deploying [20:22:59] (03Merged) 10jenkins-bot: Group0 to 1.27.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278965 (owner: 10Thcipriani) [20:23:14] okie doke. I should be done here shortly (hopefully) [20:23:41] RECOVERY - dhclient process on mw1139 is OK: PROCS OK: 0 processes with command name dhclient [20:23:44] (03CR) 10Andrew Bogott: [C: 04-1] Add wmflabsdotorg credentials to horizon config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/278538 (https://phabricator.wikimedia.org/T129245) (owner: 10Alex Monk) [20:25:31] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.27.0-wmf.18 [20:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:26:37] bd808: those chatty Cassandra StatusLogger messages should be pretty much gone, as of about 4 hours ago [20:26:54] urandom: cool. thanks for working on that [20:27:00] bd808: no worries [20:27:23] bd808: do i need to do something special for the new cluster attribute? [20:27:33] i don't see them in kibana [20:27:47] bd808: https://gerrit.wikimedia.org/r/#/c/278330/ <-- that [20:28:32] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [5000000.0] [20:28:42] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [20:28:50] urandom: I see them at least on some messages in https://logstash.wikimedia.org/#/dashboard/elasticsearch/default [20:29:11] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [5000000.0] [20:29:11] PROBLEM - dhclient process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:29:51] mutante: I'm here [20:29:57] (03PS23) 10Ema: Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) [20:30:45] bd808: yeah, so do i... on that dashboard anyway [20:30:48] (03PS3) 10Alex Monk: Add wmflabsdotorg credentials to horizon config [puppet] - 10https://gerrit.wikimedia.org/r/278538 (https://phabricator.wikimedia.org/T129245) [20:31:45] twentyafterfour: ok, let's go [20:31:52] (03PS11) 10Dzahn: Add a deployment source & target class for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [20:32:06] (03CR) 10Dzahn: [C: 032] Add a deployment source & target class for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [20:32:18] urandom: the last hour of https://logstash.wikimedia.org/#/dashboard/elasticsearch/cassandra has it for all records too [20:32:25] hoo: yeah, fine :) [20:33:12] bd808: damn, sorry, there they are... [20:33:19] not sure what i was looking at [20:33:22] RECOVERY - nutcracker process on mw1139 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [20:33:23] RECOVERY - DPKG on mw1139 is OK: All packages OK [20:33:43] bd808: sorry for the noise! [20:34:01] PROBLEM - NTP on mw1139 is CRITICAL: NTP CRITICAL: No response from NTP server [20:34:34] urandom: hey no worries. The top hit on the cassandra dashboard before I messed with the time period didn't have one. Not sure why that was [20:36:47] greg-g: Great, thanks [20:37:33] (03PS1) 10Yuvipanda: k8s: Extract basic docker engine setup [puppet] - 10https://gerrit.wikimedia.org/r/278976 [20:37:59] (03PS2) 10Yuvipanda: k8s: Extract basic docker engine setup [puppet] - 10https://gerrit.wikimedia.org/r/278976 [20:38:00] mutante: should I run puppet manually? [20:38:08] twentyafterfour: done [20:38:11] phab-deploy ALL=(root) NOPASSWD: /usr/sbin/service phd * [20:38:11] +phab-deploy ALL=(root) NOPASSWD: /usr/sbin/service apache2 * [20:38:13] etc... [20:38:17] cool [20:38:21] thanks! [20:38:29] i also run it on tin/mira [20:38:33] urandom: I am around [20:38:52] PROBLEM - DPKG on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:39:02] akosiaris: mmm, i opened you a ticket [20:39:05] mutante: ok I'll run a couple of tests, real deployment is tomorrow night though. Thank you! [20:39:15] * urandom tries to find the ticket [20:39:18] twentyafterfour: ok, cool. np [20:39:33] I'll also work on hieraizing the configs [20:39:39] twentyafterfour: i added a fake key in labs/private [20:39:48] twentyafterfour: great [20:39:50] akosiaris: https://phabricator.wikimedia.org/T130654 [20:40:22] mira/tin: [20:40:23] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [20:40:25] Notice: /Stage[main]/Phabricator::Deployment::Source/Keyholder::Agent[phabricator]/File[/etc/keyholder-auth.d/phabricator.yml]/ensure: defined content [20:40:36] akosiaris: tl;dr those nodes are missing stuff in /srv/deployment, and i wonder if Cassandra won't fail to start next time it is bounced [20:40:40] a [20:40:55] akosiaris: a change went out today that relies on some jars in the classpath [20:40:57] (03CR) 10Yuvipanda: [C: 032] "Tested" [puppet] - 10https://gerrit.wikimedia.org/r/278976 (owner: 10Yuvipanda) [20:41:16] urandom: how are those jars shipped into production ? [20:41:23] deployment ? [20:41:26] yeah [20:41:29] (03CR) 10Jcrespo: [C: 031] "I have 2 conditions-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [20:41:31] ah, that explains it [20:41:36] so trebuchet I assume [20:41:42] akosiaris: yes. :/ [20:41:53] #fml [20:42:21] well, the maps-cluster never got the change to make those boxes targets for that repo. It should be fixable rather easy [20:42:32] PROBLEM - nutcracker process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:42:36] * akosiaris famous last words [20:42:48] (03PS1) 10Yuvipanda: tools: Add a docker builder role [puppet] - 10https://gerrit.wikimedia.org/r/278977 [20:43:03] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:43:36] urandom: class { '::cassandra::metrics': } [20:43:36] class { '::cassandra::logging': } [20:43:39] I assume you need both [20:43:42] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:43:43] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [20:44:00] akosiaris: for some value of "need", I guess :) [20:44:14] yeah ok, lemme add them to the maps role [20:44:32] PROBLEM - Keyholder SSH agent on tin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [20:44:33] RECOVERY - NTP on mw1139 is OK: NTP OK: Offset 0.002876162529 secs [20:44:33] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:44:33] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:44:58] ok, so the keyholder thing [20:45:08] that is suspiciously close to the phab-related change [20:45:18] twentyafterfour [20:45:35] that is what made me not merge it last week [20:45:38] !log Rolling restart of RESTBase Cassandra cluster complete : T130393, T128787 [20:45:39] T130393: Create separate Kibana dashboards for production Cassandra clusters - https://phabricator.wikimedia.org/T130393 [20:45:39] T128787: Reduce Cassandra logstash output - https://phabricator.wikimedia.org/T128787 [20:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:45:43] (03CR) 10Aaron Schulz: "Switch-over should be fine (the jobs will just skip the wait checks). I was going to amend this to use the shard column, but that might ne" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [20:45:51] mutante: what thing? [20:45:52] PROBLEM - cxserver endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:45:52] PROBLEM - cxserver endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:10] IIRC, back in July when I was setting up the maps-test cluster, I went for the KISS principle and avoided those 2 classes [20:46:11] < icinga-wm> PROBLEM - Keyholder SSH agent on mira is CRITICAL: [20:46:11] !log decommissioning restbase1004-a.eqiad.wmnet : T125842 [20:46:12] T125842: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842 [20:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:46:22] PROBLEM - Disk space on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:23] PROBLEM - puppet last run on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:23] PROBLEM - configured eth on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:46:28] hmm [20:46:39] thcipriani: Are you done? [20:46:44] hoo: yup. [20:46:51] sorry, should've poked you. [20:46:54] 6Operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2142471 (10Krinkle) @hashar I'm sorry, but I can't work with that. It passing doesn't mean it worked. Running an old npm version may mean certain hooks/tests never run in the first place, giving the false illusion of a pass.... [20:46:55] Will go ahead then [20:46:56] thanks [20:47:10] i kind of expect an issue with that now [20:47:10] 6Operations, 10Continuous-Integration-Infrastructure, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2142472 (10Krinkle) [20:50:46] !log Stopping restbase2004.codfw.wmnet for offline sstablescrub : T130254 [20:50:47] T130254: Investigate recent OOM events on restbase2004 - https://phabricator.wikimedia.org/T130254 [20:50:47] [tin:~] $ sudo keyholder status [20:50:47] keyholder-agent start/running, process 2168 [20:50:47] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:50:47] mutante: it looks like keyholder is fine on tin, not sure why not on mira? [20:50:47] [mira:~] $ sudo keyholder status [20:50:47] keyholder-agent start/running, process 2322 [20:50:47] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:50:47] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:50:47] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:50:48] 6Operations, 6Labs, 13Patch-For-Review: Setup private docker registry with authentication support in tools - https://phabricator.wikimedia.org/T118758#2142478 (10yuvipanda) 5Open>3Resolved I've reverted all the CNAME work - @Joe pointed out that we'll want to have the registry available (in a readonly mo... [20:50:49] RECOVERY - cxserver endpoints health on scb2001 is OK: All endpoints are healthy [20:50:49] RECOVERY - cxserver endpoints health on scb2002 is OK: All endpoints are healthy [20:50:49] twentyafterfour: no, it's the same on both tin and mira. both CRIT [20:50:49] (03CR) 10Andrew Bogott: [C: 032] "I tested this with the puppet compiler and it worked as expected. Will rebase and merge." [puppet] - 10https://gerrit.wikimedia.org/r/278538 (https://phabricator.wikimedia.org/T129245) (owner: 10Alex Monk) [20:50:50] (03PS4) 10Andrew Bogott: Add wmflabsdotorg credentials to horizon config [puppet] - 10https://gerrit.wikimedia.org/r/278538 (https://phabricator.wikimedia.org/T129245) (owner: 10Alex Monk) [20:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:50:50] * twentyafterfour doesn't know how the icinga check works [20:50:50] keyholder seems to be functional [20:52:27] keyholder definitely works on tin. I wonder if it's critical because there's now a key that needs to be added to the keyholder. [20:52:28] twentyafterfour: does the new phab key have a passphrase? [20:52:32] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [20:52:39] no, it's running but not armed [20:52:43] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [20:52:45] keys ask for passphrases [20:52:50] some have one, some dont [20:53:18] (03PS2) 10Andrew Bogott: openstack: clean up a couple of trivial things in makedomain [puppet] - 10https://gerrit.wikimedia.org/r/278539 (owner: 10Alex Monk) [20:53:19] [mira:~] $ sudo keyholder arm [20:53:19] Enter passphrase for /etc/keyholder.d/eventlogging_rsa: [20:53:19] Enter passphrase for /etc/keyholder.d/mwdeploy_rsa: [20:53:19] /etc/keyholder.d/phabricator_rsa is not an acceptable key. Does it have a passphrase? [20:53:39] also need the passphrases for the non-mwdeploy keys [20:53:46] which i got from pwstore [20:54:07] "sudo keyholder arm" [20:54:26] mutante: shouldn't be passphrased [20:54:36] keyholder does not likt that [20:54:49] oh [20:54:52] it wants all keys to have one apparently [20:54:54] I didn't know this [20:55:05] (03PS1) 10Alexandros Kosiaris: maps: Add cassandra helper classes [puppet] - 10https://gerrit.wikimedia.org/r/278981 (https://phabricator.wikimedia.org/T130654) [20:55:23] so what i have is: [20:55:31] mediawiki-deployment-key-passphrase [20:55:34] (03CR) 10Jcrespo: "I am not worried about that, that is supposed to be fixed. I was giving my ok to deploy it as the infrastructure side of things has worked" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [20:55:46] and we need that for eventlogging and phabricator and services [20:56:02] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [20:56:03] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [20:56:13] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:56:17] ^^^^ not sure what this is about [20:56:31] https://wikitech.wikimedia.org/wiki/Keyholder [20:56:37] (03CR) 10Jcrespo: "My +1 was based on the assumption that shard was being used (and had already been deployed)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [20:57:10] mutante: ^ that wiki page shows where the passwords are supposed to go [20:57:53] that's weird that it requires manual arming with manual passphrase entry [20:58:02] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [20:58:02] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [20:58:10] twentyafterfour: ah! that helps, thanks [20:58:26] I tested this stuff in beta and didn't run into that ... I guess it only requires the passphrase on prod? [20:59:02] RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys. [20:59:23] !log tin - arm keyholder with deployment keys [20:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:00:55] mutante: if you are poking around in there [21:00:56] https://gerrit.wikimedia.org/r/#/c/259596/ [21:01:03] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [21:01:04] i have been meaning to merge that the next time i had to get in there [21:01:22] !log mira - arm keyholder [21:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:01:32] don't worry about it if you don't feel like bothering atm [21:01:47] ok, lol, that just takes typing the passphrase 8 times [21:01:51] after you get them from pws [21:01:56] yeah [21:01:58] where you had to type a GPG passphrase [21:02:08] also each time [21:02:09] yeah its pretty annoying [21:02:45] wow.. AND the service restart is still needed? [21:02:53] well that part icinga doesnt even know [21:03:01] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [21:03:24] wait, is it really needed? [21:03:42] how can we tell if it's fine now [21:03:44] (03CR) 10Ottomata: "Hmm, does look like it! This def was a problem once..." [puppet] - 10https://gerrit.wikimedia.org/r/259596 (owner: 10Ottomata) [21:03:56] mutante: i'd leave that patch alone, i just saw tyler's comment [21:03:59] "keyholder status" was already ok before arming it [21:04:06] i'll mess with it the next time i'm in there [21:04:08] ottomata: alright [21:04:28] ottomata: btw, you were right about not adding even more admin groups [21:04:32] and that is done now [21:04:35] aye, cool [21:04:37] ja saw thank you [21:04:46] i'll work with them to make sure the file perms are what they should be [21:04:54] great, yw [21:05:29] twentyafterfour: yea, so let's just replace that phab key [21:06:00] mutante: ok should I generate it on iridium or do you wanna do it? [21:06:02] besides that it seems all good now [21:06:08] ok [21:06:44] i can do it, since i need to add the passphrase in pws anyways [21:06:50] sorry about the oversight, that's the problem with testing on labs - some stuff falls short of production-parity even by design [21:06:56] mutante: ok [21:06:57] just first going for lunch [21:07:07] oh, no problem [21:07:24] cool, have a good lunch, I should be around if you need me when you get back [21:07:37] thanks, bbiaw [21:07:54] https://it.wiktionary.org/wiki/advert <-- anyone can check this on chrome? [21:08:02] it seems to appear blank [21:09:42] RECOVERY - nutcracker process on mw1139 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [21:09:42] RECOVERY - DPKG on mw1139 is OK: All packages OK [21:09:51] Vito, with that name, I would not be surprised that it is an ad blocker thing [21:09:55] Vito: blank on firefox, works on chrome [21:10:11] RECOVERY - Disk space on mw1139 is OK: DISK OK [21:10:13] RECOVERY - configured eth on mw1139 is OK: OK - interfaces up [21:10:19] jynus: good thought, I bet that's it (my firefox is adbloked and my chrome isn't) [21:10:43] he [21:10:52] works for me on both browsers [21:11:12] RECOVERY - dhclient process on mw1139 is OK: PROCS OK: 0 processes with command name dhclient [21:11:12] RECOVERY - nutcracker port on mw1139 is OK: TCP OK - 0.000 second response time on port 11212 [21:11:22] RECOVERY - salt-minion processes on mw1139 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:11:27] yep it's adblock [21:12:09] Vito: jynus is right, try without ad blocker, then it works [21:12:30] you should disable adblock on wikipedia, we need that extra revenue! Oh wait [21:12:38] lol [21:12:57] so spot on with all the adblock fighting these days [21:15:11] PROBLEM - DPKG on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:17:19] PROBLEM - configured eth on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:17:19] PROBLEM - dhclient process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:17:19] PROBLEM - nutcracker port on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:17:19] PROBLEM - salt-minion processes on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:17:19] PROBLEM - nutcracker process on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:17:22] PROBLEM - Disk space on mw1139 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:17:45] jynus, CreepyEyedJimboBlock [21:18:33] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:21:39] (03PS2) 10Yuvipanda: tools: Add a docker builder role [puppet] - 10https://gerrit.wikimedia.org/r/278977 [21:22:02] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add a docker builder role [puppet] - 10https://gerrit.wikimedia.org/r/278977 (owner: 10Yuvipanda) [21:22:52] waiting for mw1139 [21:23:43] !log hoo@tin Synchronized php-1.27.0-wmf.17/extensions/Wikidata: Update Wikibase: Fix add qualifier link not getting disabled (duration: 07m 30s) [21:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:23:47] is there a known issue with logstash? I'm seeing no restbase logs: https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase [21:24:08] (03PS1) 10Alexandros Kosiaris: Add role::ores::web and roles::ores::worker [puppet] - 10https://gerrit.wikimedia.org/r/278989 (https://phabricator.wikimedia.org/T124201) [21:24:09] nothing for the last couple of hours, at least [21:24:10] (03PS1) 10Alexandros Kosiaris: Assign roles::ores::web, roles::ores::worker to SCB [puppet] - 10https://gerrit.wikimedia.org/r/278990 (https://phabricator.wikimedia.org/T124201) [21:24:30] Verified my fix [21:25:05] (03PS1) 10Yuvipanda: docker: Restart docker service, rather than service_unit [puppet] - 10https://gerrit.wikimedia.org/r/278991 [21:25:25] urandom: hmm... so no restbase logs? Usually that means the logstas100x instance restbase is pointed at is busted [21:25:28] * bd808 looks [21:26:03] (03CR) 10Alexandros Kosiaris: [C: 032] maps: Add cassandra helper classes [puppet] - 10https://gerrit.wikimedia.org/r/278981 (https://phabricator.wikimedia.org/T130654) (owner: 10Alexandros Kosiaris) [21:26:07] (03PS2) 10Alexandros Kosiaris: maps: Add cassandra helper classes [puppet] - 10https://gerrit.wikimedia.org/r/278981 (https://phabricator.wikimedia.org/T130654) [21:26:12] (03CR) 10Alexandros Kosiaris: [V: 032] maps: Add cassandra helper classes [puppet] - 10https://gerrit.wikimedia.org/r/278981 (https://phabricator.wikimedia.org/T130654) (owner: 10Alexandros Kosiaris) [21:26:28] (03PS2) 10Yuvipanda: docker: Restart docker service, rather than service_unit [puppet] - 10https://gerrit.wikimedia.org/r/278991 [21:27:11] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0] [21:27:44] !log hoo@tin Synchronized php-1.27.0-wmf.18/extensions/Wikidata: Update Wikibase: Fix add qualifier link not getting disabled (duration: 03m 43s) [21:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:58] Please run sync-common on mw1139 once it's back online [21:28:02] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [5000000.0] [21:28:03] (whoever is looking) [21:28:44] !log Restarted logstash process on logstash1002; dead from OOM since 2016-03-18T11:47:12 [21:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:28:50] urandom: ^ [21:29:02] RECOVERY - logstash process on logstash1002 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [21:29:03] bd808: thank you! [21:29:28] We are working on a new systemd unit to fix that :/ [21:29:34] (03PS1) 10Yuvipanda: docker: Start the docker service by default [puppet] - 10https://gerrit.wikimedia.org/r/279019 [21:31:07] (03PS3) 10Aaron Schulz: [WIP] Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) [21:31:21] RECOVERY - nutcracker process on mw1139 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [21:31:21] RECOVERY - DPKG on mw1139 is OK: All packages OK [21:31:22] (03CR) 10Aaron Schulz: [C: 04-2] "Blocked on two core patches" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [21:31:30] (03PS1) 10Jcrespo: Pool db1044 and db1075 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279031 (https://phabricator.wikimedia.org/T130351) [21:31:42] RECOVERY - Disk space on mw1139 is OK: DISK OK [21:31:43] RECOVERY - configured eth on mw1139 is OK: OK - interfaces up [21:31:52] RECOVERY - Check size of conntrack table on mw1139 is OK: OK: nf_conntrack is 0 % full [21:32:12] RECOVERY - HHVM processes on mw1139 is OK: PROCS OK: 6 processes with command name hhvm [21:32:22] RECOVERY - RAID on mw1139 is OK: OK: no RAID installed [21:32:22] RECOVERY - SSH on mw1139 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [21:32:24] twentyafterfour, jynus: oh well, ty both [21:32:28] I'm too used to adblock :'D [21:32:42] RECOVERY - dhclient process on mw1139 is OK: PROCS OK: 0 processes with command name dhclient [21:32:42] RECOVERY - nutcracker port on mw1139 is OK: TCP OK - 0.000 second response time on port 11212 [21:32:43] RECOVERY - Apache HTTP on mw1139 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.856 second response time [21:32:51] RECOVERY - salt-minion processes on mw1139 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:33:04] (03CR) 10jenkins-bot: [V: 04-1] docker: Start the docker service by default [puppet] - 10https://gerrit.wikimedia.org/r/279019 (owner: 10Yuvipanda) [21:33:31] RECOVERY - HHVM rendering on mw1139 is OK: HTTP OK: HTTP/1.1 200 OK - 67468 bytes in 0.226 second response time [21:33:46] (03CR) 10Yuvipanda: [C: 032] docker: Restart docker service, rather than service_unit [puppet] - 10https://gerrit.wikimedia.org/r/278991 (owner: 10Yuvipanda) [21:34:01] (03PS2) 10Yuvipanda: docker: Start the docker service by default [puppet] - 10https://gerrit.wikimedia.org/r/279019 [21:35:21] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [21:36:12] PROBLEM - puppet last run on restbase2004 is CRITICAL: CRITICAL: Puppet has 1 failures [21:36:15] (03CR) 10Yuvipanda: [C: 032] docker: Start the docker service by default [puppet] - 10https://gerrit.wikimedia.org/r/279019 (owner: 10Yuvipanda) [21:37:07] (03CR) 10Yuvipanda: [V: 032] docker: Start the docker service by default [puppet] - 10https://gerrit.wikimedia.org/r/279019 (owner: 10Yuvipanda) [21:38:23] (03CR) 10Aaron Schulz: [C: 04-2] [WIP] Switched to pt-heartbeat lag detection on s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/243116 (https://phabricator.wikimedia.org/T111266) (owner: 10Aaron Schulz) [21:38:52] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [21:41:03] !log Ran sync-common on mw1139 (which missed two deploys) [21:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:41:33] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [21:46:06] 6Operations, 13Patch-For-Review: Maps Cassandra cluster discrepencies - https://phabricator.wikimedia.org/T130654#2142765 (10akosiaris) 5Open>3Resolved a:3akosiaris So, after the above patch got merged and puppet ran we got ``` akosiaris@maps-test2001:/srv/deployment/cassandra$ ls -1 logstash-logback-en... [21:46:10] (03CR) 10Jcrespo: "I am ok with the config, I am not sure I see clearly the plan. Apply it to all, then restart the servers one by one?" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/278042 (https://phabricator.wikimedia.org/T111654) (owner: 10Volans) [21:46:11] urandom: ^ [21:46:39] akosiaris: \o/ [22:10:17] 6Operations, 10Continuous-Integration-Infrastructure, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2142877 (10hashar) I have looked at Debian source package from https://anonscm.debian.org/git/collab-maint/nodejs.git branch `master-lts` and it pass the config flag `--without-npm` .... [22:26:42] (03PS7) 10Alex Monk: [WIP] openstack: Add proxy panel files [puppet] - 10https://gerrit.wikimedia.org/r/278871 (https://phabricator.wikimedia.org/T129245) [22:33:23] twentyafterfour: hey, so i'm meanwhile back and made a new key. now i wonder where i put the public part of it [22:33:46] and i can give it a new name or overwrite the existing one. whats better [22:46:52] jouncebot: next [22:46:52] In 0 hour(s) and 13 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160322T2300) [22:48:11] (03CR) 10Aaron Schulz: [C: 032] Remove obsolete comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277359 (owner: 10Aaron Schulz) [22:48:56] (03Merged) 10jenkins-bot: Remove obsolete comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277359 (owner: 10Aaron Schulz) [22:50:14] !log replacing phab deploy key with a new one [22:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:50:37] !log aaron@tin Synchronized wmf-config/filebackend-production.php: comment tweak (duration: 00m 35s) [22:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:50:57] (03CR) 10Aaron Schulz: [C: 032] Bump jobqueue "connectTimeout" to 300ms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278090 (owner: 10Aaron Schulz) [22:51:29] (03Merged) 10jenkins-bot: Bump jobqueue "connectTimeout" to 300ms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278090 (owner: 10Aaron Schulz) [22:52:33] !log aaron@tin Synchronized wmf-config/jobqueue-eqiad.php: Bump timeout to 300ms (duration: 00m 29s) [22:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:54:48] (03CR) 10Aaron Schulz: [C: 032] Lower "max lag" and $wgAPIMaxLagThreshold to 8/6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275739 (owner: 10Aaron Schulz) [22:55:39] (03Merged) 10jenkins-bot: Lower "max lag" and $wgAPIMaxLagThreshold to 8/6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275739 (owner: 10Aaron Schulz) [22:56:23] !log aaron@tin Synchronized wmf-config: Lower "max lag" and $wgAPIMaxLagThreshold to 8/6 (duration: 00m 29s) [22:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:57:39] (03Abandoned) 10Aaron Schulz: [WIP] Lowered "max lag" setting to 5 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/242814 (https://phabricator.wikimedia.org/T95501) (owner: 10Aaron Schulz) [23:00:04] RoanKattouw ostriches Krenair MaxSem: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160322T2300). Please do the needful. [23:00:05] jan_drewniak Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:01:40] Hi. [23:01:57] o/ [23:02:33] !log mira re-arm keyholder for new phab key [23:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:02:47] eh, whoever is deploying, can i do this on tin really quick before you start [23:02:52] hmm, I guess I'll do it [23:03:01] MaxSem: can i have 1 minute please [23:03:03] mutante, go ahead [23:03:07] ok [23:03:21] hey I have a general SWAT question. who should plus 2 my mediawiki-config patches? [23:03:29] the deployer [23:04:02] !log re-arm keyholder on tin for new phab key [23:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:04:09] MaxSem: done. green lights [23:04:38] odder's not here? [23:05:41] @seen odder [23:05:41] mutante: Last time I saw odder they were quitting the network with reason: Quit: leaving N/A at 3/19/2016 11:13:39 PM (2d23h52m1s ago) [23:06:04] MaxSem: I can test the patches [23:06:15] just wanted to ask if the images were properly minimized [23:06:23] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [23:06:33] mutante, ^ [23:06:59] bah, i literally just did that..double checking [23:07:12] mira/tin [23:07:16] it's running and armed [23:07:21] keyholder-agent start/running, process 2322 [23:08:43] PROBLEM - Keyholder SSH agent on tin is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [23:09:20] oh come on [23:10:42] (03PS2) 10Dereckson: Create a HiDPI logo for the Czech Wikipedia (cswiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278334 (https://phabricator.wikimedia.org/T130392) (owner: 10Odder) [23:11:29] (03CR) 10Dereckson: "PS2: optipng -o7" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278334 (https://phabricator.wikimedia.org/T130392) (owner: 10Odder) [23:11:34] MaxSem: ^ here you are [23:11:44] \o/ [23:12:04] mutante, is all ok for me to deploy? [23:12:30] no, i'll do it one more time :/ [23:14:05] !log tin - restarting and re-arming keyholder [23:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:11] RECOVERY - Keyholder SSH agent on tin is OK: OK: Keyholder is armed with all configured keys. [23:15:12] !log mira - restarting and re-arming keyholder [23:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:15:23] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [23:15:31] MaxSem: ^ much better. please feel free [23:15:38] weeeee [23:16:03] i had to stop the service apparently to remove an old key. sorry for the delay [23:16:37] (03CR) 10Aaron Schulz: "FYI, I see 6 errors in the last 24 hours" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275734 (owner: 10Aaron Schulz) [23:18:01] (03CR) 10MaxSem: [C: 032] Update favicon for Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278321 (https://phabricator.wikimedia.org/T70728) (owner: 10Odder) [23:18:57] (03Merged) 10jenkins-bot: Update favicon for Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278321 (https://phabricator.wikimedia.org/T70728) (owner: 10Odder) [23:20:19] !log maxsem@tin Synchronized static/favicon/testwikidata.ico: https://gerrit.wikimedia.org/r/#/c/278321/ (duration: 00m 31s) [23:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:08] (03CR) 10MaxSem: [C: 032] Create a HiDPI logo for the Czech Wikipedia (cswiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278334 (https://phabricator.wikimedia.org/T130392) (owner: 10Odder) [23:21:21] MaxSem: you've already purge it? https://test.wikidata.org/static/favicon/testwikidata.ico is still 32x32 / https://test.wikidata.org/static/favicon/testwikidata.ico?debug=true ok [23:21:44] (03Merged) 10jenkins-bot: Create a HiDPI logo for the Czech Wikipedia (cswiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278334 (https://phabricator.wikimedia.org/T130392) (owner: 10Odder) [23:23:07] !log maxsem@tin Synchronized static/images/project-logos: https://gerrit.wikimedia.org/r/#/c/278334/ (duration: 00m 26s) [23:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:23:49] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/278334/ (duration: 00m 27s) [23:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:10] (03Restored) 10ArielGlenn: Move default config into a file [dumps] - 10https://gerrit.wikimedia.org/r/43156 (owner: 10Awight) [23:26:43] !log ran echo 'https://test.wikidata.org/static/favicon/testwikidata.ico' | mwscript purgeList.php [23:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:58] Dereckson, ^^^ [23:27:08] it's echo 'https://www.wikimedia.org/static/favicon/testwikidata.ico' | mwscript purgeList.php [23:27:13] Since https://gerrit.wikimedia.org/r/#/c/265623/ [23:27:32] er wrong URL [23:27:57] (03CR) 10MaxSem: [C: 032] Bumping portals to master. Deploying A/B test T124111 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278902 (https://phabricator.wikimedia.org/T124112) (owner: 10Jdrewniak) [23:28:15] it's now on en.wikipedia.org the master site, not www.wikimedia.org [23:28:31] (03Merged) 10jenkins-bot: Bumping portals to master. Deploying A/B test T124111 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278902 (https://phabricator.wikimedia.org/T124112) (owner: 10Jdrewniak) [23:28:51] MaxSem: so echo 'https://en.wikipedia.org/static/favicon/testwikidata.ico' | mwscript purgeList.php [23:29:24] (03CR) 10ArielGlenn: "as soon as there's data to dump (i.e. as soon as the extension is deployed with the ability to actually add url shorteners) this can go ou" [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [23:29:29] owchie [23:30:18] dear roots, could you please disown stuff in /srv/mediawiki-staging/.git on tin? ping ori, mutante [23:35:15] !log maxsem@tin Synchronized portals/prod/wikipedia.org/assets: (no message) (duration: 00m 26s) [23:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:35:41] !log maxsem@tin Synchronized portals: (no message) (duration: 00m 25s) [23:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:35:51] jan_drewniak, ^^^ [23:36:17] MaxSem: okay so I confirm: since https://gerrit.wikimedia.org/r/#/c/269149/ the server to purge is the one defined in $static_host @ puppet modules/role/manifests/cache/base.pp. And currently, $static_host = 'en.wikipedia.org'. So we need to purge https://en.wikipedia.org/static/favicon/testwikidata.ico [23:36:33] (03CR) 10MaxSem: [C: 032] Disable upload on ia.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278411 (https://phabricator.wikimedia.org/T130425) (owner: 10Dereckson) [23:36:45] MaxSem: hurra! thanks [23:37:12] Dereckson, done [23:37:21] (03Merged) 10jenkins-bot: Disable upload on ia.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278411 (https://phabricator.wikimedia.org/T130425) (owner: 10Dereckson) [23:37:30] MaxSem: icon works o/ [23:37:33] (03CR) 10MaxSem: [C: 032] Allow bureaucrat group to add/remove translationadmin on species [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278959 (https://phabricator.wikimedia.org/T129888) (owner: 10Dereckson) [23:38:17] (03Merged) 10jenkins-bot: Allow bureaucrat group to add/remove translationadmin on species [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278959 (https://phabricator.wikimedia.org/T129888) (owner: 10Dereckson) [23:38:24] MaxSem: oh, i did not get that ping when it's at the end of the line [23:38:33] (03CR) 10MaxSem: [C: 032] Revert "Remove extraneous namespace shorcut alias on ru.wikibooks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278213 (owner: 10Dereckson) [23:39:10] (03Merged) 10jenkins-bot: Revert "Remove extraneous namespace shorcut alias on ru.wikibooks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278213 (owner: 10Dereckson) [23:39:27] MaxSem: disown = everything mwdeploy:wikidev ? [23:39:49] mutante, yep but o/ri already done that [23:40:29] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: SWAT x 3 (duration: 00m 26s) [23:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:40:40] MaxSem: ok, alright [23:40:56] 1 maxsem wikidev 259K Mar 22 23:39 index [23:41:10] Testing. [23:42:42] MaxSem: all three changes tested, thanks for the deploy [23:43:06] thanks Dereckson [23:46:25] !log maxsem@tin Synchronized php-1.27.0-wmf.17/extensions/WikimediaEvents/: https://gerrit.wikimedia.org/r/#/c/279066/ (duration: 00m 28s) [23:46:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:38] !log maxsem@tin Synchronized php-1.27.0-wmf.18/extensions/WikimediaEvents/: https://gerrit.wikimedia.org/r/#/c/279066/ (duration: 00m 27s) [23:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master