[00:16:38] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [5000000.0] [00:30:39] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [01:35:39] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [5000000.0] [01:37:09] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [5000000.0] [01:49:30] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [01:51:00] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [02:26:28] PROBLEM - Apache HTTP on mw1253 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.003 second response time [02:26:34] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.18) (duration: 11m 32s) [02:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:09] RECOVERY - Apache HTTP on mw1253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.030 second response time [02:30:41] 6Operations, 10ops-requests, 13Patch-For-Review: remove (was: Install) python-mysqldb on terbium - https://phabricator.wikimedia.org/T84075#2154270 (10Legoktm) Nope, I no longer need the python-mysqldb package. [02:35:00] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Mar 28 02:35:00 UTC 2016 (duration 8m 26s) [02:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:58:08] PROBLEM - puppet last run on es2014 is CRITICAL: CRITICAL: puppet fail [04:26:00] RECOVERY - puppet last run on es2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:59:09] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 18.52% of data above the critical threshold [100000000.0] [06:23:38] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:29:48] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: puppet fail [06:29:59] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:09] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: puppet fail [06:30:09] PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:18] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:18] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: puppet fail [06:30:29] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: puppet fail [06:30:30] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: puppet fail [06:30:38] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: puppet fail [06:30:39] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [06:31:09] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:09] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:39] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:49] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:29] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:38] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:10] PROBLEM - puppet last run on mw2166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:51:30] PROBLEM - HHVM rendering on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:51:50] PROBLEM - puppet last run on db1027 is CRITICAL: CRITICAL: Puppet has 1 failures [06:52:19] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:55:58] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:56:18] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:38] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:56:39] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:56:40] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:57:18] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:19] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:48] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:49] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:57:58] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:58] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:58:08] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:09] RECOVERY - puppet last run on es2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:09] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:18] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:30] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:30] RECOVERY - puppet last run on mw2166 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:16:29] RECOVERY - puppet last run on db1027 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:47:40] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [5000000.0] [08:01:40] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [08:21:40] !log powercycling ms-be2016, down for ~1d, unresponsive on console [08:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:22:39] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 13.79% of data above the critical threshold [100000000.0] [08:26:39] RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [08:26:39] RECOVERY - very high load average likely xfs on ms-be2016 is OK: OK - load average: 11.72, 3.79, 1.36 [08:26:39] RECOVERY - swift-account-auditor on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [08:26:39] RECOVERY - swift-container-server on ms-be2016 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [08:26:39] RECOVERY - Check size of conntrack table on ms-be2016 is OK: OK: nf_conntrack is 5 % full [08:26:39] RECOVERY - swift-account-replicator on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [08:26:39] RECOVERY - DPKG on ms-be2016 is OK: All packages OK [08:26:40] RECOVERY - swift-account-reaper on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:26:40] RECOVERY - salt-minion processes on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:26:41] RECOVERY - Disk space on ms-be2016 is OK: DISK OK [08:26:41] RECOVERY - configured eth on ms-be2016 is OK: OK - interfaces up [08:26:42] RECOVERY - swift-object-updater on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [08:34:50] 6Operations, 10Analytics, 13Patch-For-Review: nf_conntrack warnings for kafka hosts - https://phabricator.wikimedia.org/T131028#2154356 (10elukey) ```ipv4 2 tcp 6 73 TIME_WAIT src=10.64.16.127 dst=10.64.53.12``` TIME_WAIT is a state for the TCP actor that does the active close, in this case the src... [08:40:05] akosiaris: any clue of what's up with the scb2001 changeprop alert? [08:40:29] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 72128 bytes in 0.157 second response time [08:40:49] RECOVERY - HHVM jobrunner on mw1008 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [08:40:49] RECOVERY - Apache HTTP on mw1197 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.034 second response time [08:40:49] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 72128 bytes in 0.175 second response time [08:41:08] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.042 second response time [08:44:04] !log force-rebooting ms-be2008 to fix hundreds of unkillable mkfs.xfs stuck due to a hosed disk [08:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:47:08] PROBLEM - Host ms-be2008 is DOWN: PING CRITICAL - Packet loss = 100% [08:47:12] paravoid: other than this ? https://phabricator.wikimedia.org/T130948 ? nope [08:47:18] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [08:49:59] RECOVERY - changeprop endpoints health on scb2001 is OK: All endpoints are healthy [08:50:12] paravoid: a restart seemed to fix it [08:50:18] maybe never really deployed ? [08:50:23] mobrovac: ^ ? [09:05:09] akosiaris: he's probably not working today :) [09:42:10] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: puppet fail [09:47:36] 6Operations, 10Traffic, 10Wikimedia-Blog, 7HTTPS: Switch blog to HTTPS-only - https://phabricator.wikimedia.org/T105905#2154394 (10Tbayer) Update: After some more detours, all posts that embedded HTTP content should now have been converted to HTTPS. In the end Automattic ran two conversion scripts for us,... [09:50:19] RECOVERY - Host ms-be2008 is UP: PING OK - Packet loss = 0%, RTA = 36.39 ms [09:50:49] RECOVERY - very high load average likely xfs on ms-be2008 is OK: OK - load average: 9.67, 2.97, 1.03 [09:50:49] RECOVERY - RAID on ms-be2008 is OK: OK: optimal, 13 logical, 13 physical [10:08:39] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [10:23:59] (03PS1) 10Florianschmidtwelzow: Enable mobile link preview (popups) on beta labs wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279934 (https://phabricator.wikimedia.org/T113243) [12:33:16] (03PS1) 10Alexandros Kosiaris: ores: Move redisproxy into the labs role module [puppet] - 10https://gerrit.wikimedia.org/r/279939 [12:59:59] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [13:04:40] (03PS2) 10Alexandros Kosiaris: ores: Move redisproxy into the labs role module [puppet] - 10https://gerrit.wikimedia.org/r/279939 [13:04:46] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ores: Move redisproxy into the labs role module [puppet] - 10https://gerrit.wikimedia.org/r/279939 (owner: 10Alexandros Kosiaris) [13:10:58] PROBLEM - Host ns0-v4 is DOWN: PING CRITICAL - Packet loss = 100% [13:11:18] PROBLEM - Host radon is DOWN: PING CRITICAL - Packet loss = 100% [13:12:18] PROBLEM - Host ns0-v6 is DOWN: PING CRITICAL - Packet loss = 100% [13:24:59] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [13:25:10] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [13:28:19] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:30:43] is radon down for a rason? [13:31:41] looking... [13:32:31] !log rebooting radon (unresponsive console) [13:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:33:08] (03PS2) 10Reedy: Remove old throttle rules. Swap array() -> [] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278454 [13:34:23] (03PS3) 10Reedy: Remove old throttle rules. Swap array() -> [] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278454 [13:34:44] oh [13:34:47] sorry, missed it [13:34:49] RECOVERY - Host ns0-v6 is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [13:35:08] if we're paging for slave lag we should probably page for nsN checks too [13:35:53] akosiaris: your change was not merged on the puppetmasters [13:36:06] labs I see, so I'll merge [13:37:38] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [13:37:49] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [13:38:48] RECOVERY - Host radon is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [13:39:59] RECOVERY - Host ns0-v4 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [13:40:36] bblack: I spent about 30 seconds thinking about how to rename labs-ns2 and then that ns0 alert fired and I spent a minute or two thinking that somehow I had broken dns just by thinking about it. [13:41:18] it's still possible [13:41:33] (03PS1) 10Faidon Liambotis: authdns: page on global alerts [puppet] - 10https://gerrit.wikimedia.org/r/279941 [13:41:45] radon behaved strangely on restart, it couldn't look up names via resolver for a while after boot... [13:42:02] it's still having some issues on that front [13:42:15] bblack: ^ [13:42:47] paravoid: I guess we don't still have the x-dc authdns IP failover at the router thing anymore? [13:43:02] (03CR) 10Andrew Bogott: [C: 031] "heck yes" [puppet] - 10https://gerrit.wikimedia.org/r/279941 (owner: 10Faidon Liambotis) [13:43:04] (03CR) 10BBlack: [C: 031] authdns: page on global alerts [puppet] - 10https://gerrit.wikimedia.org/r/279941 (owner: 10Faidon Liambotis) [13:43:16] (03PS4) 10Ottomata: Increase log verbosity on reportupdater cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/275508 (https://phabricator.wikimedia.org/T126058) (owner: 10Mforns) [13:43:51] (03CR) 10Ottomata: [C: 032 V: 032] Increase log verbosity on reportupdater cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/275508 (https://phabricator.wikimedia.org/T126058) (owner: 10Mforns) [13:44:04] bad otto :P [13:44:08] (03PS2) 10Faidon Liambotis: authdns: page on global alerts [puppet] - 10https://gerrit.wikimedia.org/r/279941 [13:44:14] (03CR) 10Faidon Liambotis: [C: 032] authdns: page on global alerts [puppet] - 10https://gerrit.wikimedia.org/r/279941 (owner: 10Faidon Liambotis) [13:44:23] wha? [13:44:39] kidding [13:44:52] ooooooooook oh about V 2? :) [13:44:55] commit merge race :) [13:44:58] ohhh [13:44:59] haha [13:46:25] 6Operations, 7Puppet, 10Beta-Cluster-Infrastructure, 10Traffic, 13Patch-For-Review: Fix puppet on deployment-cache* hosts in beta cluster - https://phabricator.wikimedia.org/T129270#2154574 (10Ottomata) Thanks yall! [13:50:40] 6Operations, 10ops-eqiad, 10Traffic: investigate radon crash - https://phabricator.wikimedia.org/T131053#2154606 (10BBlack) [13:51:32] (03PS1) 10Alexandros Kosiaris: ores: fix puppet breakage in staging [puppet] - 10https://gerrit.wikimedia.org/r/279942 [13:55:34] (03CR) 10Ottomata: [C: 031] "+1 from me!" [puppet] - 10https://gerrit.wikimedia.org/r/279717 (https://phabricator.wikimedia.org/T130948) (owner: 10Mobrovac) [13:57:40] (03PS1) 10Faidon Liambotis: nagios: amend check_bgp to emit CRITs for transits [puppet] - 10https://gerrit.wikimedia.org/r/279943 [14:00:35] (03PS2) 10Alexandros Kosiaris: ores: fix puppet breakage in staging [puppet] - 10https://gerrit.wikimedia.org/r/279942 [14:00:37] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ores: fix puppet breakage in staging [puppet] - 10https://gerrit.wikimedia.org/r/279942 (owner: 10Alexandros Kosiaris) [14:02:30] 6Operations, 10Analytics, 13Patch-For-Review: nf_conntrack warnings for kafka hosts - https://phabricator.wikimedia.org/T131028#2154609 (10Ottomata) Last week, ApiAction Kafka logging was turned on from MW. I'm not sure how much this would have increased the number of connections from MW. Is it possible th... [14:07:53] (03CR) 10Ottomata: [C: 031] "+1, but before merge, could you add some detailed usage examples / docs on how to use the functions somewhere? In the comment headers of " [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac) [14:14:04] (03PS1) 10Andrew Bogott: Stop using old labs-ns0 and labs-ns1, move ns2/ns3 to ns0/ns1 [puppet] - 10https://gerrit.wikimedia.org/r/279945 (https://phabricator.wikimedia.org/T131052) [14:14:20] (03CR) 10Andrew Bogott: [C: 04-2] "This needs to be carefully staged" [puppet] - 10https://gerrit.wikimedia.org/r/279945 (https://phabricator.wikimedia.org/T131052) (owner: 10Andrew Bogott) [14:14:29] PROBLEM - NTP on radon is CRITICAL: NTP CRITICAL: Offset unknown [14:15:17] (03PS1) 10Andrew Bogott: Rename labs-ns2/ns3 to labs-ns0/ns1, set aside old labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/279946 [14:15:36] (03CR) 10Andrew Bogott: [C: 04-2] "to be merged during a migration window" [dns] - 10https://gerrit.wikimedia.org/r/279946 (owner: 10Andrew Bogott) [14:49:42] RECOVERY - NTP on radon is OK: NTP OK: Offset 0.05194473267 secs [14:53:04] (03PS1) 10Ottomata: Prefix eventbus Kafka topics with datacenter name [puppet] - 10https://gerrit.wikimedia.org/r/279951 (https://phabricator.wikimedia.org/T130562) [14:55:15] (03CR) 10jenkins-bot: [V: 04-1] Prefix eventbus Kafka topics with datacenter name [puppet] - 10https://gerrit.wikimedia.org/r/279951 (https://phabricator.wikimedia.org/T130562) (owner: 10Ottomata) [14:56:01] (03PS2) 10Ottomata: Prefix eventbus Kafka topics with datacenter name [puppet] - 10https://gerrit.wikimedia.org/r/279951 (https://phabricator.wikimedia.org/T130562) [14:59:51] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [5000000.0] [15:00:04] anomie ostriches thcipriani marktraceur aude: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160328T1500). [15:00:04] andrewbogott: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:01:12] I can SWAT. andrewbogott around for SWAT? [15:03:15] thcipriani: I think he is having a logistics problem on his end for being online, I imagine he pops up in a few? [15:03:23] (03PS8) 10Rush: shinken: Only regenerate configuration when there are changes [puppet] - 10https://gerrit.wikimedia.org/r/267423 (owner: 10Tim Landscheidt) [15:04:08] chasemp: cool. thanks for the info. looks like a simple patch, can get it out anytime during the window. [15:06:35] (03CR) 10Rush: [C: 032] "seems good thanks Tim" [puppet] - 10https://gerrit.wikimedia.org/r/267423 (owner: 10Tim Landscheidt) [15:06:53] thcipriani: I'm back [15:06:55] (03PS1) 10BBlack: tlsproxy: nginx security restrictions via systemd unit frag [puppet] - 10https://gerrit.wikimedia.org/r/279952 [15:07:06] andrewbogott: okie doke. [15:07:43] I don't understand what happened with that patch over the weekend but I think it's back to its original state [15:07:44] ? [15:08:16] (03CR) 10jenkins-bot: [V: 04-1] tlsproxy: nginx security restrictions via systemd unit frag [puppet] - 10https://gerrit.wikimedia.org/r/279952 (owner: 10BBlack) [15:08:28] andrewbogott: yeah, looks like commit message stuff? [15:08:41] yeah, I know not why [15:08:51] (03PS2) 10BBlack: tlsproxy: nginx security restrictions via systemd unit frag [puppet] - 10https://gerrit.wikimedia.org/r/279952 [15:09:52] (03CR) 10jenkins-bot: [V: 04-1] tlsproxy: nginx security restrictions via systemd unit frag [puppet] - 10https://gerrit.wikimedia.org/r/279952 (owner: 10BBlack) [15:10:29] /usr/lib/ruby/vendor_ruby/puppet-lint/bin.rb:78:in `block in run': invalid option: --no-puppet_url_without_modules-check (OptionParser::InvalidOption) [15:10:32] ^ [15:10:54] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [15:12:05] (03CR) 10BBlack: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/279952 (owner: 10BBlack) [15:12:35] maybe just integration-slave-trusty-1024 has issues but integration-slave-trusty-1006 doesn't [15:22:52] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [15:23:24] 6Operations, 10hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2154708 (10Ottomata) Bump. [15:23:52] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2154709 (10Ottomata) Bump [15:24:08] 6Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2154710 (10Ottomata) Bump [15:26:30] (03PS2) 10Ottomata: Removed page_edit topic. [puppet] - 10https://gerrit.wikimedia.org/r/274805 (https://phabricator.wikimedia.org/T126220) (owner: 10Ppchelko) [15:31:40] !log thcipriani@tin Synchronized php-1.27.0-wmf.18/extensions/OpenStackManager/OpenStackManager.php: SWAT: Wikitech: Remove address, domain, proxy special pages [[gerrit:279569]] (duration: 00m 31s) [15:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:31:44] ^ andrewbogott check please [15:33:01] thcipriani: looks right to me. Thanks! [15:33:20] andrewbogott: awesome, thanks for checking! [15:34:01] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [15:48:32] PROBLEM - puppet last run on mw2079 is CRITICAL: CRITICAL: puppet fail [15:57:11] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:02:36] 6Operations, 10Traffic, 7Mobile, 13Patch-For-Review: Investigate if login.m.wikimedia.org needs to stay around - https://phabricator.wikimedia.org/T123431#2154832 (10csteipp) The only danger would be if MobileFrontend suddenly started hooking earlier in the login process, and redirected mobile users from l... [16:05:42] (03PS5) 10Rush: hiera_lookup: recognize labs project and site [puppet] - 10https://gerrit.wikimedia.org/r/276346 (https://phabricator.wikimedia.org/T129092) (owner: 10Hashar) [16:07:43] 6Operations, 6Analytics-Kanban, 13Patch-For-Review: nf_conntrack warnings for kafka hosts - https://phabricator.wikimedia.org/T131028#2154847 (10Nuria) [16:12:35] (03CR) 10Mobrovac: [C: 031] ruthenium services: tweaks based on changes to Parsoid & testreduce [puppet] - 10https://gerrit.wikimedia.org/r/268141 (owner: 10Subramanya Sastry) [16:14:26] ori, mutante|away when you get a chance, could one of you take a look at https://gerrit.wikimedia.org/r/268141 ? [16:17:52] RECOVERY - puppet last run on mw2079 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:22:09] (03CR) 10Mobrovac: Prefix eventbus Kafka topics with datacenter name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279951 (https://phabricator.wikimedia.org/T130562) (owner: 10Ottomata) [16:22:11] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Puppet has 1 failures [16:24:35] (03CR) 10Chad: [C: 031] Beta: Add mira to deployment_hosts [puppet] - 10https://gerrit.wikimedia.org/r/279392 (owner: 10Thcipriani) [16:25:27] (03CR) 10Chad: [C: 031] Use repo_path instead of repo for deploy-local [puppet] - 10https://gerrit.wikimedia.org/r/275905 (owner: 10Thcipriani) [16:27:05] (03PS2) 10Andrew Bogott: Rename labs-ns2/ns3 to labs-ns0/ns1, set aside old labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/279946 [16:27:32] (03CR) 10jenkins-bot: [V: 04-1] Rename labs-ns2/ns3 to labs-ns0/ns1, set aside old labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/279946 (owner: 10Andrew Bogott) [16:27:45] (03PS1) 10Faidon Liambotis: nagios: remove broken WHOIS support from check_bgp [puppet] - 10https://gerrit.wikimedia.org/r/279958 [16:27:47] (03PS1) 10Faidon Liambotis: nagios: use SNMP v2c for check_bgp [puppet] - 10https://gerrit.wikimedia.org/r/279959 [16:27:49] (03PS1) 10Faidon Liambotis: network: enable check_bgp for cr1-esams/cr2-knams [puppet] - 10https://gerrit.wikimedia.org/r/279960 [16:27:52] bblack: you'll love this :) [16:28:14] I removed an XXX of yours, for starters :) [16:28:22] with a one-liner no less :P [16:28:33] (that took me quite a while to figure out) [16:29:28] 6Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2154934 (10RobH) a:5RobH>3mark [16:29:56] (03PS3) 10Andrew Bogott: Rename labs-ns2/ns3 to labs-ns0/ns1, set aside old labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/279946 [16:30:08] (03CR) 10Faidon Liambotis: [C: 032] nagios: remove broken WHOIS support from check_bgp [puppet] - 10https://gerrit.wikimedia.org/r/279958 (owner: 10Faidon Liambotis) [16:30:20] (03CR) 10jenkins-bot: [V: 04-1] Rename labs-ns2/ns3 to labs-ns0/ns1, set aside old labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/279946 (owner: 10Andrew Bogott) [16:30:35] (03CR) 10Faidon Liambotis: [C: 032] nagios: use SNMP v2c for check_bgp [puppet] - 10https://gerrit.wikimedia.org/r/279959 (owner: 10Faidon Liambotis) [16:31:06] 6Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#1970805 (10RobH) I accidentally assigned in another person, and didn't assign the task to mark. Not sure how I did that, but it didn't reveal any private info. E... [16:31:18] (03CR) 10Faidon Liambotis: [C: 032] network: enable check_bgp for cr1-esams/cr2-knams [puppet] - 10https://gerrit.wikimedia.org/r/279960 (owner: 10Faidon Liambotis) [16:32:59] (03PS3) 10Ottomata: Prefix eventbus Kafka topics with datacenter name [puppet] - 10https://gerrit.wikimedia.org/r/279951 (https://phabricator.wikimedia.org/T130562) [16:33:01] (03CR) 10Ottomata: Prefix eventbus Kafka topics with datacenter name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279951 (https://phabricator.wikimedia.org/T130562) (owner: 10Ottomata) [16:33:03] paravoid: awesome :) [16:36:49] SNMPv2c has only been around for 2 decades, I guess it's about time we upgraded anyways :) [16:43:02] PROBLEM - Apache HTTP on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.005 second response time [16:43:32] PROBLEM - HHVM rendering on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.020 second response time [16:44:00] (03PS3) 10BBlack: tlsproxy: nginx security restrictions via systemd unit frag [puppet] - 10https://gerrit.wikimedia.org/r/279952 [16:44:47] (03PS1) 10Yuvipanda: docker: Ensure that apt repository is present before installing [puppet] - 10https://gerrit.wikimedia.org/r/279963 [16:48:37] (03CR) 10Yuvipanda: [C: 032] docker: Ensure that apt repository is present before installing [puppet] - 10https://gerrit.wikimedia.org/r/279963 (owner: 10Yuvipanda) [16:49:22] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:50:33] PROBLEM - Apache HTTP on mw1097 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50395 bytes in 0.760 second response time [16:52:02] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.065 second response time [16:54:21] (03CR) 10Thcipriani: [C: 031] scap::target: Allow scap's user to restart all services on a node [puppet] - 10https://gerrit.wikimedia.org/r/279717 (https://phabricator.wikimedia.org/T130948) (owner: 10Mobrovac) [16:56:32] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0] [16:57:02] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [16:59:22] PROBLEM - Host radon is DOWN: PING CRITICAL - Packet loss = 100% [17:01:33] PROBLEM - Host ns0-v6 is DOWN: PING CRITICAL - Packet loss = 100% [17:01:33] PROBLEM - Host ns0-v4 is DOWN: PING CRITICAL - Packet loss = 100% [17:09:08] (03PS2) 10Andrew Bogott: Stop using old labs-ns0 and labs-ns1, move ns2/ns3 to ns0/ns1 [puppet] - 10https://gerrit.wikimedia.org/r/279945 (https://phabricator.wikimedia.org/T131052) [17:09:10] (03PS1) 10Andrew Bogott: Temporarily remove horizin GUI for proxies and DNS [puppet] - 10https://gerrit.wikimedia.org/r/279967 [17:10:21] (03PS2) 10Andrew Bogott: Temporarily remove horizon GUI for proxies and DNS [puppet] - 10https://gerrit.wikimedia.org/r/279967 [17:12:18] (03PS4) 10Andrew Bogott: Rename labs-ns2/ns3 to labs-ns0/ns1, set aside old labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/279946 [17:13:55] (03CR) 10Andrew Bogott: [C: 032] Temporarily remove horizon GUI for proxies and DNS [puppet] - 10https://gerrit.wikimedia.org/r/279967 (owner: 10Andrew Bogott) [17:14:45] seems nameserver paging didn't work [17:18:03] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [17:18:14] !log rebooting radon again [17:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:18:32] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [17:20:22] RECOVERY - Host radon is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [17:20:48] !log cr1/2-eqiad: deactivating static routes for ns0/ns1 (ipv4/ipv6) pointing to radon [17:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:21:12] RECOVERY - Host ns0-v4 is UP: PING OK - Packet loss = 0%, RTA = 36.03 ms [17:21:21] RECOVERY - Host ns0-v6 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [17:25:57] 6Operations, 10ops-eqiad: db1067 degraded RAID - https://phabricator.wikimedia.org/T130517#2155135 (10Cmjohnson) 5Open>3Resolved Fixed [17:26:03] (03PS1) 10Faidon Liambotis: nagios: rewrite check_bgp from scratch [puppet] - 10https://gerrit.wikimedia.org/r/279971 [17:27:27] 6Operations, 10ops-eqiad, 10Traffic: investigate radon crash - https://phabricator.wikimedia.org/T131053#2154594 (10Cmjohnson) I see CPU heat warnings...could that be a cause or just a symptom? [17:32:26] 6Operations, 10ops-eqiad, 10Traffic: investigate radon crash - https://phabricator.wikimedia.org/T131053#2155164 (10BBlack) note: it crashed a second time just now. we're looking into dealing with the hardware issue now. @faidon has re-routed the requests to baham in codfw (ns1), so from an external perspe... [17:33:54] (03PS3) 10Andrew Bogott: Stop using old labs-ns0 and labs-ns1, move ns2/ns3 to ns0/ns1 [puppet] - 10https://gerrit.wikimedia.org/r/279945 (https://phabricator.wikimedia.org/T131052) [17:33:56] (03PS1) 10Andrew Bogott: Keystone totp: If the totp plugin is invoked, require 2fa to be enabled. [puppet] - 10https://gerrit.wikimedia.org/r/279972 [17:35:47] paravoid: hi there [17:36:02] PROBLEM - BGP status on cr2-esams is CRITICAL: CRITICAL: The requested table is empty or does not exist [17:36:25] some time ago i requested access to prod to do server side uploads [17:36:47] (03PS1) 10Dzahn: mediawiki: remove python-mysqldb from maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/279973 (https://phabricator.wikimedia.org/T84075) [17:37:07] you and mark agreed i will not do harm, but it didn't move forward, is it likely to change anytime ? [17:37:12] (03PS4) 10Andrew Bogott: Stop using old labs-ns0 and labs-ns1, move ns2/ns3 to ns0/ns1 [puppet] - 10https://gerrit.wikimedia.org/r/279945 (https://phabricator.wikimedia.org/T131052) [17:38:53] (03CR) 10Dzahn: [C: 032] "was only terbium, confirmed not used" [puppet] - 10https://gerrit.wikimedia.org/r/279973 (https://phabricator.wikimedia.org/T84075) (owner: 10Dzahn) [17:40:21] 6Operations, 10ops-requests, 13Patch-For-Review: remove (was: Install) python-mysqldb on terbium - https://phabricator.wikimedia.org/T84075#2155210 (10Dzahn) @Legoktm Ok, thank you for confirming. I removed the package from the maintenance server role. [17:42:52] 6Operations, 10Traffic: Set up LVS for current AuthDNS - https://phabricator.wikimedia.org/T101525#2155228 (10BBlack) Bump - T131053 reminds us that it's uncomfortable to lose even 1/3 authdns servers when hardware fails, as that places us within 1 more random hardware failure of "oh crap". We really should r... [17:44:47] !log manually fixed T129881 rename [17:44:48] T129881: Special:GlobalRenameProgress/GaryMoore - https://phabricator.wikimedia.org/T129881 [17:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:45:49] !log mw1152,mw2090,terbium apt-get remove python-mysqldb (T84075) [17:45:50] T84075: remove (was: Install) python-mysqldb on terbium - https://phabricator.wikimedia.org/T84075 [17:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:46:12] 6Operations, 10ops-requests, 13Patch-For-Review: remove (was: Install) python-mysqldb on terbium - https://phabricator.wikimedia.org/T84075#2155262 (10Dzahn) 5Open>3Resolved [17:46:14] 6Operations, 10ops-codfw, 6DC-Ops, 13Patch-For-Review: setup new mw maint host - wasat - https://phabricator.wikimedia.org/T129930#2155263 (10Dzahn) [17:48:05] (03PS1) 10Andrew Bogott: Explicitly set Horizon session lengths to two hours. [puppet] - 10https://gerrit.wikimedia.org/r/279975 (https://phabricator.wikimedia.org/T130621) [17:48:17] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review, 15User-greg: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2155280 (10Dzahn) @Dereckson This week the ops meeting is on Tuesday instead of Monday. So in about 24 hours this should be on the m... [17:50:32] (03CR) 10Andrew Bogott: [C: 032] Explicitly set Horizon session lengths to two hours. [puppet] - 10https://gerrit.wikimedia.org/r/279975 (https://phabricator.wikimedia.org/T130621) (owner: 10Andrew Bogott) [17:50:46] (03PS2) 10Dzahn: remove login.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/276385 (https://phabricator.wikimedia.org/T123431) [17:51:33] (03CR) 10Dzahn: "@Luke081515 yes, but besides that +1? the rebase was actually just clicking the rebase button" [dns] - 10https://gerrit.wikimedia.org/r/276385 (https://phabricator.wikimedia.org/T123431) (owner: 10Dzahn) [17:52:22] !log shutting down radon to re-apply thermal paste: [17:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:52:57] paravoid: for the record i am talking about https://phabricator.wikimedia.org/T106447 [17:52:59] (03PS3) 10Dzahn: remove login.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/276385 (https://phabricator.wikimedia.org/T123431) [17:56:09] (03PS4) 10Dzahn: remove login.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/276385 (https://phabricator.wikimedia.org/T123431) [17:58:30] (03CR) 10Nschaaf: "It looks like the jenkins build hasn't been successful for nearly 2 months." [puppet] - 10https://gerrit.wikimedia.org/r/278555 (owner: 10Sabya) [17:58:57] (03CR) 10Dzahn: [C: 032] "quote bblack from T111967 "I've watched one mobile cache for hours now, and the only single request to login.m.wikimedia.org was a GET / f" [dns] - 10https://gerrit.wikimedia.org/r/276385 (https://phabricator.wikimedia.org/T123431) (owner: 10Dzahn) [18:00:04] andrewbogott krenair: Respected human, time to deploy Labs DNS & Horizon switch-over (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160328T1800). Please do the needful. [18:00:52] PROBLEM - HHVM rendering on mw1140 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.012 second response time [18:00:56] mutante, could you take a look at https://gerrit.wikimedia.org/r/#/c/268141/ ? i cannot update parsoid on ruthenium till that puppet patch is merged. [18:01:02] (03CR) 10Andrew Bogott: [C: 032] Rename labs-ns2/ns3 to labs-ns0/ns1, set aside old labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/279946 (owner: 10Andrew Bogott) [18:02:11] PROBLEM - Apache HTTP on mw1140 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [18:02:36] subbu: yes, already saw it. i will get to it soon [18:02:51] thanks. [18:03:29] (03PS5) 10Andrew Bogott: Rename labs-ns2/ns3 to labs-ns0/ns1, set aside old labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/279946 [18:03:33] !log mw1140 restart hhvm [18:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:03:53] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.426 second response time [18:04:32] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 74518 bytes in 0.371 second response time [18:06:04] 6Operations, 10ops-eqiad, 10Traffic: investigate radon crash - https://phabricator.wikimedia.org/T131053#2155361 (10Cmjohnson) Not sure if it's the culprit yet but re-applied thermal paste to the cpu. It was dry and brittle. [18:09:12] (03CR) 10Andrew Bogott: [C: 032] Rename labs-ns2/ns3 to labs-ns0/ns1, set aside old labs-ns0/ns1 [dns] - 10https://gerrit.wikimedia.org/r/279946 (owner: 10Andrew Bogott) [18:10:22] (03CR) 10Andrew Bogott: [C: 032] Stop using old labs-ns0 and labs-ns1, move ns2/ns3 to ns0/ns1 [puppet] - 10https://gerrit.wikimedia.org/r/279945 (https://phabricator.wikimedia.org/T131052) (owner: 10Andrew Bogott) [18:10:28] (03PS5) 10Andrew Bogott: Stop using old labs-ns0 and labs-ns1, move ns2/ns3 to ns0/ns1 [puppet] - 10https://gerrit.wikimedia.org/r/279945 (https://phabricator.wikimedia.org/T131052) [18:20:51] PROBLEM - puppet last run on mw2083 is CRITICAL: CRITICAL: Puppet has 1 failures [18:22:31] 6Operations, 10Traffic, 7Mobile, 13Patch-For-Review: Investigate if login.m.wikimedia.org needs to stay around - https://phabricator.wikimedia.org/T123431#2155461 (10Dzahn) also see comment from bblack on: T111967#1622278 and i checked [oxygen:/srv/log/webrequest] $ jq .uri_host /srv/log/webrequest/sample... [18:23:53] 6Operations, 10Traffic, 7Mobile: Investigate if www.m.wikipedia.org needs to stay around - https://phabricator.wikimedia.org/T120143#2155477 (10Dzahn) [18:23:55] 6Operations, 10Traffic, 7Mobile, 13Patch-For-Review: Investigate if login.m.wikimedia.org needs to stay around - https://phabricator.wikimedia.org/T123431#2155475 (10Dzahn) 5Open>3Resolved https://gerrit.wikimedia.org/r/#/c/276385/ merged [18:25:30] 6Operations, 10Traffic, 7HTTPS: Preload HSTS for select hostnames within wikimedia.org - https://phabricator.wikimedia.org/T111967#1621104 (10Dzahn) login.m.wikimedia.org has been removed now with T123431 / https://gerrit.wikimedia.org/r/#/c/276385/. [18:25:39] (03CR) 10Andrew Bogott: [C: 04-2] "On second glance, this should wait for the updates with markmonitor to happen, otherwise we'll get into an uncomfortable oscillating state" [puppet] - 10https://gerrit.wikimedia.org/r/279945 (https://phabricator.wikimedia.org/T131052) (owner: 10Andrew Bogott) [18:28:24] (03PS3) 10Dzahn: ruthenium services: tweaks based on changes to Parsoid & testreduce [puppet] - 10https://gerrit.wikimedia.org/r/268141 (owner: 10Subramanya Sastry) [18:29:27] 6Operations, 10Traffic, 7Mobile: Investigate if login.m.wikimedia.org needs to stay around - https://phabricator.wikimedia.org/T123431#2155508 (10Dzahn) [18:29:47] (03CR) 10Dzahn: [C: 032] ruthenium services: tweaks based on changes to Parsoid & testreduce [puppet] - 10https://gerrit.wikimedia.org/r/268141 (owner: 10Subramanya Sastry) [18:30:21] subbu|lunch: ^ so when you get back from lunch, go ahead on ruthenium [18:30:42] !log labs public DNS will remain in a transitional state until MarkMonitor updates their records and the TTL expires. DNS resolution should proceed as usual in the meantime. [18:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:34:39] andrewbogott: throw a notice in -labs irc banner? [18:35:27] mutante, thanks. [18:35:43] done [18:36:23] (03PS6) 10Chad: Update debian package for gerrit [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 [18:38:37] (03CR) 10Chad: "Using vanilla gerrit 2.12.2 from upstream. Does not bundle delete-project for simplicity, can be revisited later." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [18:40:20] (03CR) 10Paladox: [C: 031] Update debian package for gerrit [debs/gerrit] - 10https://gerrit.wikimedia.org/r/263631 (owner: 10Chad) [18:41:04] (03PS1) 10BBlack: authdns: allow authdns-update through ferm [puppet] - 10https://gerrit.wikimedia.org/r/279978 [18:42:10] (03CR) 10jenkins-bot: [V: 04-1] authdns: allow authdns-update through ferm [puppet] - 10https://gerrit.wikimedia.org/r/279978 (owner: 10BBlack) [18:46:02] (03PS2) 10BBlack: authdns: allow authdns-update through ferm [puppet] - 10https://gerrit.wikimedia.org/r/279978 [18:47:32] RECOVERY - puppet last run on mw2083 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:50:49] (03CR) 10BBlack: [C: 032] authdns: allow authdns-update through ferm [puppet] - 10https://gerrit.wikimedia.org/r/279978 (owner: 10BBlack) [18:58:37] (03CR) 10CSteipp: [C: 031] Keystone totp: If the totp plugin is invoked, require 2fa to be enabled. [puppet] - 10https://gerrit.wikimedia.org/r/279972 (owner: 10Andrew Bogott) [19:00:18] (03PS1) 10BBlack: authdns+ferm: ipv6 rules for authdns-update too [puppet] - 10https://gerrit.wikimedia.org/r/279983 [19:00:45] (03CR) 10BBlack: [C: 032 V: 032] authdns+ferm: ipv6 rules for authdns-update too [puppet] - 10https://gerrit.wikimedia.org/r/279983 (owner: 10BBlack) [19:01:04] (03PS2) 10Andrew Bogott: Keystone totp: If the totp plugin is invoked, require 2fa to be enabled. [puppet] - 10https://gerrit.wikimedia.org/r/279972 [19:02:46] (03CR) 10Andrew Bogott: [C: 032] Keystone totp: If the totp plugin is invoked, require 2fa to be enabled. [puppet] - 10https://gerrit.wikimedia.org/r/279972 (owner: 10Andrew Bogott) [19:09:29] (03PS1) 10Andrew Bogott: Keystone totp: Backport liberty change to kilo [puppet] - 10https://gerrit.wikimedia.org/r/279985 [19:10:32] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 28168 bytes in 0.174 second response time [19:10:47] 6Operations, 6Performance-Team, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#2155649 (10Krinkle) @Aaron says we recently upgraded to HHVM 3.12 which presumably contains the fix for 3.11 as well. [19:11:13] (03CR) 10Andrew Bogott: [C: 032] Keystone totp: Backport liberty change to kilo [puppet] - 10https://gerrit.wikimedia.org/r/279985 (owner: 10Andrew Bogott) [19:11:35] 6Operations, 6Performance-Team, 10Wikimedia-General-or-Unknown: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1895546 (10Krinkle) a:3aaron [19:14:18] (03PS1) 10Subramanya Sastry: Fix syntax error from f1456a74 [puppet] - 10https://gerrit.wikimedia.org/r/279987 [19:15:27] mutante, sorry .. one more ^ i missed this during review from 2 weeks back during the dc switchover test. [19:18:54] (03PS1) 10BBlack: Import Upstream version 1.9.12 [software/nginx] (wmf-1.9.12-1) - 10https://gerrit.wikimedia.org/r/279989 [19:18:56] (03PS1) 10BBlack: multicert + libssl1.0.2 patches for 1.9.12 [software/nginx] (wmf-1.9.12-1) - 10https://gerrit.wikimedia.org/r/279990 [19:18:58] (03PS1) 10BBlack: nginx (1.9.12-1+wmf1) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.9.12-1) - 10https://gerrit.wikimedia.org/r/279991 [19:23:41] (03PS1) 10Yuvipanda: docker: Install debootstrap in base image builder [puppet] - 10https://gerrit.wikimedia.org/r/279994 [19:24:05] (03PS2) 10Yuvipanda: docker: Install debootstrap in base image builder [puppet] - 10https://gerrit.wikimedia.org/r/279994 [19:24:07] (03PS2) 10Dzahn: Fix syntax error from f1456a74 [puppet] - 10https://gerrit.wikimedia.org/r/279987 (owner: 10Subramanya Sastry) [19:24:16] (03CR) 10Dzahn: [C: 032 V: 032] Fix syntax error from f1456a74 [puppet] - 10https://gerrit.wikimedia.org/r/279987 (owner: 10Subramanya Sastry) [19:24:34] subbu|lunch: done [19:25:40] (03PS2) 10Dzahn: mw:maintenance: add role to wasat [puppet] - 10https://gerrit.wikimedia.org/r/279659 (https://phabricator.wikimedia.org/T129930) [19:25:52] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000000.0] [19:26:14] (03CR) 10Dzahn: [C: 032] mw:maintenance: add role to wasat [puppet] - 10https://gerrit.wikimedia.org/r/279659 (https://phabricator.wikimedia.org/T129930) (owner: 10Dzahn) [19:27:33] thanks [19:28:46] (03PS3) 10Yuvipanda: docker: Install debootstrap in base image builder [puppet] - 10https://gerrit.wikimedia.org/r/279994 [19:28:55] (03CR) 10Yuvipanda: [C: 032 V: 032] docker: Install debootstrap in base image builder [puppet] - 10https://gerrit.wikimedia.org/r/279994 (owner: 10Yuvipanda) [19:36:03] (03PS1) 10Rush: create-dbuser change how user grant files are created [puppet] - 10https://gerrit.wikimedia.org/r/279995 [19:38:15] (03CR) 10Yuvipanda: [C: 04-1] "for shell=True :) use subprocess.check_output instead?" [puppet] - 10https://gerrit.wikimedia.org/r/279995 (owner: 10Rush) [19:40:12] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [19:43:44] 6Operations, 10ops-eqiad, 10Traffic: investigate radon crash - https://phabricator.wikimedia.org/T131053#2155703 (10BBlack) Status update: radon's been back up for 1h40m now and is theoretically functional. Waiting a bit to see if it seems stable. ns0 traffic is still routed to baham. If for some reason w... [19:49:10] (03CR) 10Yuvipanda: "It would also fail since you aren't importing subprocess." [puppet] - 10https://gerrit.wikimedia.org/r/279995 (owner: 10Rush) [19:54:13] 6Operations, 6Discovery, 10MediaWiki-Vendor, 10Wikimedia-Logstash, and 3 others: Upgrade ruflin/elastica to 2.3.1 - https://phabricator.wikimedia.org/T127831#2055202 (10EBernhardson) a:5dcausse>3EBernhardson [19:54:58] 6Operations, 6Discovery, 10MediaWiki-Vendor, 10Wikimedia-Logstash, and 3 others: Upgrade ruflin/elastica to 2.3.1 - https://phabricator.wikimedia.org/T127831#2055202 (10EBernhardson) Tested with the above two patches and things look to be working, or at least it passes the browser test suite and I couldn't... [19:55:34] (03PS2) 10Rush: create-dbuser change how user grant files are created [puppet] - 10https://gerrit.wikimedia.org/r/279995 [19:58:35] (03CR) 10Tim Landscheidt: create-dbuser change how user grant files are created (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279995 (owner: 10Rush) [20:00:39] (03CR) 10Yuvipanda: create-dbuser change how user grant files are created (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279995 (owner: 10Rush) [20:00:39] 6Operations, 6Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2155759 (10BBlack) 1.9.12 porting/packaging was fairly trivial so far. Testing on cp1008/pinkunicorn to happen later today or tomorrow. [20:00:39] (03CR) 10Rush: [C: 031] create-dbuser change how user grant files are created (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279995 (owner: 10Rush) [20:00:50] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160328T2000). Please do the needful. [20:01:20] (03PS2) 10BBlack: Import Upstream version 1.9.12 [software/nginx] (wmf-1.9.12-1) - 10https://gerrit.wikimedia.org/r/279989 (https://phabricator.wikimedia.org/T96848) [20:01:22] (03PS2) 10BBlack: multicert + libssl1.0.2 patches for 1.9.12 [software/nginx] (wmf-1.9.12-1) - 10https://gerrit.wikimedia.org/r/279990 (https://phabricator.wikimedia.org/T96848) [20:01:24] (03PS2) 10BBlack: nginx (1.9.12-1+wmf1) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.9.12-1) - 10https://gerrit.wikimedia.org/r/279991 (https://phabricator.wikimedia.org/T96848) [20:02:31] no parsoid deploy today [20:04:08] (03PS3) 10Rush: create-dbuser change how user grant files are created [puppet] - 10https://gerrit.wikimedia.org/r/279995 [20:04:32] (03CR) 10Yuvipanda: [C: 031] create-dbuser change how user grant files are created [puppet] - 10https://gerrit.wikimedia.org/r/279995 (owner: 10Rush) [20:05:40] (03PS4) 10Rush: create-dbuser change how user grant files are created [puppet] - 10https://gerrit.wikimedia.org/r/279995 [20:08:05] (03CR) 10Rush: [C: 032] create-dbuser change how user grant files are created [puppet] - 10https://gerrit.wikimedia.org/r/279995 (owner: 10Rush) [20:09:47] (03PS2) 10Ottomata: Drop wmf prefix for resource_change topic [puppet] - 10https://gerrit.wikimedia.org/r/279685 (owner: 10GWicke) [20:11:27] (03CR) 10Ottomata: [C: 032] Drop wmf prefix for resource_change topic [puppet] - 10https://gerrit.wikimedia.org/r/279685 (owner: 10GWicke) [20:11:38] (03PS3) 10Ottomata: Removed page_edit topic. [puppet] - 10https://gerrit.wikimedia.org/r/274805 (https://phabricator.wikimedia.org/T126220) (owner: 10Ppchelko) [20:11:53] (03CR) 10Ottomata: [C: 032 V: 032] Removed page_edit topic. [puppet] - 10https://gerrit.wikimedia.org/r/274805 (https://phabricator.wikimedia.org/T126220) (owner: 10Ppchelko) [20:21:18] (03PS1) 10Rush: create-dbusers notify service on script change [puppet] - 10https://gerrit.wikimedia.org/r/280001 [20:22:10] (03CR) 10Yuvipanda: [C: 031] create-dbusers notify service on script change [puppet] - 10https://gerrit.wikimedia.org/r/280001 (owner: 10Rush) [20:22:48] (03PS2) 10Rush: create-dbusers notify service on script change [puppet] - 10https://gerrit.wikimedia.org/r/280001 [20:23:12] (03PS4) 10Ottomata: Prefix eventbus Kafka topics with datacenter name [puppet] - 10https://gerrit.wikimedia.org/r/279951 (https://phabricator.wikimedia.org/T130562) [20:24:13] (03CR) 10Rush: [C: 032] create-dbusers notify service on script change [puppet] - 10https://gerrit.wikimedia.org/r/280001 (owner: 10Rush) [20:25:06] (03CR) 10Ottomata: [C: 032] Prefix eventbus Kafka topics with datacenter name [puppet] - 10https://gerrit.wikimedia.org/r/279951 (https://phabricator.wikimedia.org/T130562) (owner: 10Ottomata) [20:25:19] (03CR) 10Ottomata: "tested in mw-vagrant and beta labs" [puppet] - 10https://gerrit.wikimedia.org/r/279951 (https://phabricator.wikimedia.org/T130562) (owner: 10Ottomata) [20:25:51] (03PS1) 10Jforrester: Apply rate limit to edits for normal users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280002 [20:25:53] (03PS1) 10Jforrester: [WIP] Let Wikidata editors edit at a higher rate than on other wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280003 [20:32:41] (03CR) 10Jforrester: "Lydia," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280003 (owner: 10Jforrester) [20:35:19] !log starting mobileapps deployment [20:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:37:43] !log restbase canary deploy of 78e6eab37c to restbase1005 [20:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:44:43] (03PS1) 10Dereckson: Logo for ast.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280057 (https://phabricator.wikimedia.org/T131027) [20:48:09] (03CR) 10MarcoAurelio: "Have you run optiPNG? :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280057 (https://phabricator.wikimedia.org/T131027) (owner: 10Dereckson) [20:48:48] (03CR) 10MarcoAurelio: "s/have you/did you" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280057 (https://phabricator.wikimedia.org/T131027) (owner: 10Dereckson) [20:49:14] (03CR) 10Dereckson: "/home/dereckson/dev/mediawiki/operations/mediawiki-config/static/images/project-logos ] optipng -o7 astwiktionary.png" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280057 (https://phabricator.wikimedia.org/T131027) (owner: 10Dereckson) [20:49:51] (03CR) 10Dereckson: "(Yes, I have.)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280057 (https://phabricator.wikimedia.org/T131027) (owner: 10Dereckson) [20:53:01] (03CR) 10MarcoAurelio: [C: 031] Logo for ast.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280057 (https://phabricator.wikimedia.org/T131027) (owner: 10Dereckson) [20:58:38] 6Operations, 6Discovery, 10MediaWiki-Vendor, 10Wikimedia-Logstash, and 3 others: Upgrade ruflin/elastica to 2.3.1 - https://phabricator.wikimedia.org/T127831#2155978 (10Reedy) Let me know if you want me to rebase/reapply https://gerrit.wikimedia.org/r/#/c/260159/ [21:04:00] !log mobileapps deployed 90cfdcd [21:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:04:26] (03Abandoned) 10Ottomata: Add DC named topics to event bus topic config [puppet] - 10https://gerrit.wikimedia.org/r/278752 (https://phabricator.wikimedia.org/T127718) (owner: 10Ottomata) [21:04:30] !log restbase start deploy of 78e6eab37c [21:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:04:35] (03PS3) 10Dzahn: mw:maintenance: add role to wasat [puppet] - 10https://gerrit.wikimedia.org/r/279659 (https://phabricator.wikimedia.org/T129930) [21:07:03] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 15.38% of data above the critical threshold [100000000.0] [21:12:37] !log restbase finised deploy of 78e6eab37c [21:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:13:12] (03Abandoned) 10Dzahn: move mv maintenance host from mw2090 to wasat [puppet] - 10https://gerrit.wikimedia.org/r/279444 (https://phabricator.wikimedia.org/T129930) (owner: 10Dzahn) [21:18:11] PROBLEM - DPKG on wasat is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:19:14] no, it's just installing things [21:19:53] RECOVERY - DPKG on wasat is OK: All packages OK [21:26:00] !log http/2 enabled on pinkunicorn.wikimedia.org for testing [21:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:26:41] !log http/2 enabled on pinkunicorn.wikimedia.org for testing - T96848 [21:26:42] T96848: Support HTTP/2 - https://phabricator.wikimedia.org/T96848 [21:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:34:08] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [21:36:43] (03PS1) 10Yuvipanda: docker: Fix format of auth config file [puppet] - 10https://gerrit.wikimedia.org/r/280084 [21:36:54] (03PS2) 10Yuvipanda: docker: Fix format of auth config file [puppet] - 10https://gerrit.wikimedia.org/r/280084 [21:36:58] PROBLEM - mediawiki-installation DSH group on wasat is CRITICAL: Host wasat is not in mediawiki-installation dsh group [21:37:07] bblack: Hm.. will pinkunicorn be a debug_backend? [21:37:11] Could be one way of testing it [21:37:29] Oh right, this is before that [21:37:37] Not sure we can parse X-Wikimedia-Debug at that level [21:39:19] ACKNOWLEDGEMENT - mediawiki-installation DSH group on wasat is CRITICAL: Host wasat is not in mediawiki-installation dsh group daniel_zahn https://phabricator.wikimedia.org/T129930 [21:41:45] mutante: hmm, where did it say it? [21:41:51] mutante: did that page people?! [21:42:04] yuvipanda: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=paws.wmflabs.org&service=PAWS+Main+page [21:42:08] I got paged, but that is the only thing that's supposed to happen. [21:42:23] yuvipanda: no, i'm looking at the web ui [21:42:26] for other things [21:42:31] and just see it there [21:42:33] aaaaah, ok :) It'll be back up shortly [21:42:48] 'k :) [21:43:11] mutante: thanks for noticing! [21:44:18] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: Puppet has 2 failures [21:44:53] yw [21:45:08] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 72281 bytes in 0.335 second response time [21:45:18] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.048 second response time [21:46:08] RECOVERY - puppet last run on wasat is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [21:46:13] !log mw1119 restarted hhvm [21:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:47:19] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [5000000.0] [21:50:12] (03PS2) 10Dzahn: remove mw-maintenance, LDAP client from mw2090 [puppet] - 10https://gerrit.wikimedia.org/r/279680 [21:50:41] (03CR) 10Dzahn: [C: 032] remove mw-maintenance, LDAP client from mw2090 [puppet] - 10https://gerrit.wikimedia.org/r/279680 (owner: 10Dzahn) [21:51:32] Krinkle: pinkunicorn is just for TLS termination testing, it's not really valid for testing deeper stuff in the stack [21:51:44] bblack: k [21:51:53] bblack: I meant for testing a full wiki page load [21:51:58] and somehow forcing it to proxy through there [21:52:12] since it does serve from text caches and app servers [21:52:26] To measure performance accross browsers and more closely observe its behaviour. [21:52:30] well, you can fake that with DNS hacks, but it's not "supported" or automate-able [21:52:42] Implementing HTTP/2 isn't trivial and I wouldn't be surprised if there were some gotchas in how nginx does it. [21:52:42] and will behave a bit differently on cache hotness, too [21:53:05] with regards to how it prioritises the streams and how smart it is etc. [21:53:07] (as in hack your local client/DNS to think pinkunicorn's IP is the IP for whatever.wikipedia.org) [21:53:30] gotta run, back in a couple hours [22:03:47] !log mw2090 - disable puppet/cron, crons now on wasat [22:03:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:07:19] 6Operations, 10ops-codfw, 6DC-Ops, 13Patch-For-Review: setup new mw maint host - wasat - https://phabricator.wikimedia.org/T129930#2156233 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/279680 [22:08:42] (03PS1) 10Dzahn: mw:maintenance: also move mariadb maintenance class [puppet] - 10https://gerrit.wikimedia.org/r/280089 (https://phabricator.wikimedia.org/T129930) [22:08:58] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [22:12:08] (03CR) 10Dzahn: [C: 032] mw:maintenance: also move mariadb maintenance class [puppet] - 10https://gerrit.wikimedia.org/r/280089 (https://phabricator.wikimedia.org/T129930) (owner: 10Dzahn) [22:14:32] (03PS2) 10Dzahn: admin: create shell account for dereckson [puppet] - 10https://gerrit.wikimedia.org/r/279565 (https://phabricator.wikimedia.org/T129365) [22:16:17] (03CR) 10Dzahn: "this creates cron jobs for tendril and percona-toolkit. also see the "# TODO: check if both of these are still needed"" [puppet] - 10https://gerrit.wikimedia.org/r/280089 (https://phabricator.wikimedia.org/T129930) (owner: 10Dzahn) [22:17:11] (03PS1) 10Ppchelko: Bump s-maxage for purged endpoints. [puppet] - 10https://gerrit.wikimedia.org/r/280091 [22:17:57] PROBLEM - puppet last run on mw1089 is CRITICAL: CRITICAL: Puppet has 1 failures [22:24:38] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:24:57] andrewbogott: ^? [22:31:32] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.077 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [22:33:33] (03PS1) 10Ottomata: Use eventbus topic config from mediawiki/event-schemas repo [puppet] - 10https://gerrit.wikimedia.org/r/280097 [22:43:31] RECOVERY - puppet last run on mw1089 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:44:10] (03PS1) 10ArielGlenn: hackdeploy pylint cleanup: invalid names, indentation, docstrings mostly [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280100 [22:44:12] (03PS1) 10ArielGlenn: prep-dumps-deploy full pylint and pep8 cleanup [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280101 [22:44:14] (03PS1) 10ArielGlenn: archiveloader full pylint and pep8 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280102 [22:44:16] (03PS1) 10ArielGlenn: fixup-interwikis: full pylint and pep8 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280103 [22:44:18] sorry for the spam [22:44:18] (03PS1) 10ArielGlenn: thumbDateAnalysis full pylint and pep8 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280104 [22:44:20] (03PS1) 10ArielGlenn: thumbFilesSizesCounts: full pylint and pep8 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280105 [22:44:22] (03PS1) 10ArielGlenn: thumbPxSize: full pylint and pep8 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280106 [22:44:24] (03PS1) 10ArielGlenn: misc tiny thumb scripts: pylint, pep8 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280107 [22:44:26] didnt' realize I had quite so many of these [22:44:26] (03PS1) 10ArielGlenn: pylint and pep8 for scripts related to media tarball creation [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280108 [22:44:28] (03PS1) 10ArielGlenn: toy offline reader: pylint and pep8 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280109 [22:44:30] (03PS1) 10ArielGlenn: pylint and pep8 for wikiqueries scripts, runphpscriptlet, rsyncmedia [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280110 [22:51:21] (03PS1) 10Catrope: Enable Echo footer notice on all English language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280115 (https://phabricator.wikimedia.org/T128937) [22:53:51] 6Operations, 10Continuous-Integration-Config, 10Dumps-Generation, 13Patch-For-Review, 7WorkType-Maintenance: operations/dumps repo should pass flake8 - https://phabricator.wikimedia.org/T114249#2156425 (10ArielGlenn) [22:56:56] 6Operations, 10Continuous-Integration-Config, 10Dumps-Generation, 13Patch-For-Review, 7WorkType-Maintenance: operations/dumps repo should pass flake8 - https://phabricator.wikimedia.org/T114249#2156435 (10ArielGlenn) I have some changesets in but I need to do testing against them first before merges. The... [22:58:26] (03CR) 10Paladox: "recheck" [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/280101 (owner: 10ArielGlenn) [23:00:04] RoanKattouw ostriches Krenair MaxSem: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160328T2300). Please do the needful. [23:00:04] bearND MatmaRex RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:25] hello. [23:00:41] hi. I'm a little busy but can probably handle swat if necessary [23:00:48] but RoanKattouw are you ok with doing this? [23:00:49] I can do it [23:00:55] ok [23:01:19] hi [23:02:05] Thanks [23:02:08] I'd normally do it but not today [23:02:58] (03PS1) 10Dzahn: admin: remove robla from ldap admins [puppet] - 10https://gerrit.wikimedia.org/r/280119 [23:03:38] okay, sent everything non-config to Zuul [23:04:46] ugh, more qunit failures [23:04:58] 6Operations, 10Dumps-Generation, 7HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2156496 (10ArielGlenn) [23:05:27] RoanKattouw, https://gerrit.wikimedia.org/r/#/c/280112/ blows up again [23:06:23] * RoanKattouw looks [23:06:29] (03PS1) 10Dzahn: admin: remove svn-group perms from ldap-admins [puppet] - 10https://gerrit.wikimedia.org/r/280123 [23:06:46] MaxSem: That's a bogus failure, and it's in the test job, not the gate-and-submit job [23:06:49] Hopefully it'll still merge [23:06:53] :P [23:06:55] If not, it just needs to be re-+2-ed [23:07:31] (03CR) 10Dzahn: [C: 032] admin: remove robla from ldap admins [puppet] - 10https://gerrit.wikimedia.org/r/280119 (owner: 10Dzahn) [23:09:18] (03CR) 10Dzahn: [C: 032] admin: remove svn-group perms from ldap-admins [puppet] - 10https://gerrit.wikimedia.org/r/280123 (owner: 10Dzahn) [23:09:55] (03CR) 10GWicke: [C: 031] Use eventbus topic config from mediawiki/event-schemas repo [puppet] - 10https://gerrit.wikimedia.org/r/280097 (owner: 10Ottomata) [23:16:24] MaxSem: It merged this time [23:16:29] :P [23:16:49] !log maxsem@tin Synchronized php-1.27.0-wmf.18/extensions/MobileApp/: https://gerrit.wikimedia.org/r/#q,279980,n,z (duration: 01m 07s) [23:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:00] bearND, ^ [23:17:25] MaxSem: thanks, purging cache [23:18:59] !log maxsem@tin Synchronized php-1.27.0-wmf.18/extensions/Echo/: https://gerrit.wikimedia.org/r/#/c/280112/ (duration: 00m 43s) [23:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:19:11] RoanKattouw, ^ [23:19:20] * MatmaRex away, brb [23:20:14] Krenair: RoanKattouw or MaxSem : Do you happen to know if the purge command needs to change again? I'm running it against enwiki domain but don't see the update. [23:20:49] what exact file are you trying to purge and what command are running to do it? [23:21:25] (03PS1) 10Dzahn: mediawiki: make mw2090 normal appserver again [puppet] - 10https://gerrit.wikimedia.org/r/280130 (https://phabricator.wikimedia.org/T129930) [23:21:43] Krenair: https://meta.wikimedia.org/static/current/extensions/MobileApp/config/android.json. I'm running: [23:21:44] echo "https://en.wikimedia.org/static/current/extensions/MobileApp/config/android.json" | mwscript purgeList.php --wiki=aawiki [23:22:06] on tin [23:22:21] how do I tell whether I see the new or old version? [23:22:31] the last value should be 25 instead of 2 [23:22:34] you don't need to supply --wiki to purgeList [23:22:43] I see 25 though [23:23:00] en.wikipedia.org bearND [23:23:01] Krenair: ugh, i see it now, too. Maybe it took awhile [23:23:02] not en.wikimedia [23:23:26] Dereckson: thanks. I'll update the instructions [23:23:52] wait you really had en.wikimedia in the instructions? [23:24:02] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [23:24:03] copy&paste error [23:24:43] (03CR) 10MaxSem: [C: 032] Enable Echo footer notice on all English language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280115 (https://phabricator.wikimedia.org/T128937) (owner: 10Catrope) [23:25:39] Krenair: so, like this? echo "https://en.wikipedia.org/static/current/extensions/MobileApp/config/android.json" | mwscript purgeList.php [23:25:56] (03CR) 10Dzahn: [C: 032] mediawiki: make mw2090 normal appserver again [puppet] - 10https://gerrit.wikimedia.org/r/280130 (https://phabricator.wikimedia.org/T129930) (owner: 10Dzahn) [23:26:02] (03Merged) 10jenkins-bot: Enable Echo footer notice on all English language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280115 (https://phabricator.wikimedia.org/T128937) (owner: 10Catrope) [23:27:33] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:27:35] bearND, looks right [23:27:47] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/280115/ (duration: 00m 35s) [23:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:28:00] RoanKattouw, ^ [23:28:10] Great Thank you, MaxSem, Krenair, Dereckson! [23:28:15] MaxSem: Thanks. Confirmed the other one works [23:30:01] Yup, that one works too [23:30:51] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [23:31:14] that was me hitting ctrl +c [23:32:41] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [23:33:26] * MatmaRex back [23:33:59] * MatmaRex pokes MaxSem [23:34:51] !log mw2090 - reboot, reinstall as regular appserver [23:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:37:31] PROBLEM - Host mw2090 is DOWN: PING CRITICAL - Packet loss = 100% [23:38:34] !log maxsem@tin Synchronized php-1.27.0-wmf.18/includes/api/ApiUpload.php: https://gerrit.wikimedia.org/r/#/c/280092/ (duration: 02m 26s) [23:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:48] MatmaRex, ^ [23:39:47] thanks MaxSem, it works [23:40:33] thus ends our SWAT [23:41:23] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [23:41:54] 6Operations, 6Performance-Team, 10scap, 7HHVM, and 2 others: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#2156608 (10greg) [23:43:02] RECOVERY - Host mw2090 is UP: PING OK - Packet loss = 0%, RTA = 36.37 ms [23:47:55] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7HTTPS, 13Patch-For-Review: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2156614 (10CCogdill_WMF) @Dzahn I've confirmed it's fine on our side to delete all the subdomains you... [23:55:23] (03CR) 10Reedy: [C: 04-1] "Needs expanding for the rest of the similar domains" [dns] - 10https://gerrit.wikimedia.org/r/278353 (https://phabricator.wikimedia.org/T130414) (owner: 10Dzahn) [23:59:20] MaxSem: i'd appreciate if you also merged https://gerrit.wikimedia.org/r/280083 in master :D (and its dependency)