[00:04:40] 6Operations, 6Labs, 10wikitech.wikimedia.org: decom old wikitech-static machine - https://phabricator.wikimedia.org/T129391#2152577 (10Dzahn) I deleted the VM in rackspace web UI, it's gone. [00:06:03] 6Operations, 6Labs, 10wikitech.wikimedia.org: decom old wikitech-static machine - https://phabricator.wikimedia.org/T129391#2152580 (10Dzahn) [00:06:05] 6Operations, 13Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2152579 (10Dzahn) [00:06:20] 6Operations, 6Labs, 10wikitech.wikimedia.org: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2152582 (10Dzahn) [00:06:22] 6Operations, 13Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1931694 (10Dzahn) [00:06:24] 6Operations, 6Labs, 10wikitech.wikimedia.org: decom old wikitech-static machine - https://phabricator.wikimedia.org/T129391#2104237 (10Dzahn) 5Open>3Resolved [00:07:33] have a nice weekend [00:11:56] !log scb[12]00[12] - delete changeprop main.log per request [00:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:12:30] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [00:12:48] RECOVERY - changeprop endpoints health on scb2002 is OK: All endpoints are healthy [00:17:52] 7Blocked-on-Operations, 6Operations, 10EventBus, 6Services, and 3 others: New Service Request - Change Propagation - https://phabricator.wikimedia.org/T128463#2152606 (10mobrovac) The service is now operational in production, but we are not able to reliably redeploy it due to {T130948}. This can be conside... [01:58:28] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [02:00:28] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 3.62 ms [02:01:33] 6Operations, 6Labs: Labtest designate giving out Forbidden exceptions when trying to list domains - https://phabricator.wikimedia.org/T130979#2152683 (10AlexMonk-WMF) [02:22:39] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.18) (duration: 10m 06s) [02:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:13] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Mar 26 02:31:13 UTC 2016 (duration 8m 34s) [02:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:59:19] why is "gerrit query" via ssh so slow? [03:07:45] (03CR) 10Tim Landscheidt: [C: 04-1] "| Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Failed to parse template ldap/open_ldap.erb:" [puppet] - 10https://gerrit.wikimedia.org/r/279682 (owner: 10Dzahn) [03:19:46] Transferred: sent 3568, received 5504 bytes, in 708.9 seconds [03:19:46] Bytes per second: sent 5.0, received 7.8 [03:19:59] Ran it again and got: [03:20:00] Transferred: sent 3568, received 5496 bytes, in 0.3 seconds [03:20:00] Bytes per second: sent 11577.9, received 17834.1 [03:59:28] 6Operations: Migrate argon to jessie - https://phabricator.wikimedia.org/T123729#1936574 (10Krenair) Downtime on this system is rather problematic for anti-vandalism because it hosts the IRC RC feed. Could probably be replaced (temporarily?) by a VM - MW has been able to send that data to multiple systems for a... [04:17:36] 6Operations, 10Traffic, 7HTTPS: irc.wikimedia.org talks HTTP but not HTTPS - https://phabricator.wikimedia.org/T130981#2152741 (10Krenair) [06:07:49] PROBLEM - puppet last run on elastic2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:38] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:49] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:59] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:19] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:29] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:10] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:10] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:49] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:48] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:59] RECOVERY - puppet last run on elastic2018 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:50:39] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [06:51:20] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [06:56:29] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:56:49] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:57:19] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:57:29] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:57:49] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:58:08] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:09] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:01:10] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:01:59] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:06:40] 6Operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Packaging: Update phantomjs to 2.1.1 on trusty - https://phabricator.wikimedia.org/T130940#2152835 (10Dereckson) [08:06:57] 6Operations, 10Continuous-Integration-Infrastructure, 6Labs, 10Packaging: Update phantomjs to 2.1.1 on trusty - https://phabricator.wikimedia.org/T130940#2151512 (10Dereckson) We would need a custom Debian package, as Ubuntu provides < Xenial 1.9 [08:49:17] 6Operations, 10DNS, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: Updating DNS records for Major Gifts subdomain (benefactors.wikimedia.org) - https://phabricator.wikimedia.org/T130937#2152884 (10Peachey88) [08:55:40] PROBLEM - RAID on restbase2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:09:38] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [09:14:49] PROBLEM - puppet last run on mw2189 is CRITICAL: CRITICAL: puppet fail [09:43:08] RECOVERY - puppet last run on mw2189 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:18:19] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [10:18:59] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 56.67% of data above the critical threshold [5000000.0] [10:26:38] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: Puppet has 1 failures [10:28:59] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [10:33:19] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [10:39:52] (03PS5) 10Alexandros Kosiaris: ores::base: Remove handling of /srv [puppet] - 10https://gerrit.wikimedia.org/r/278951 [10:41:33] (03CR) 10jenkins-bot: [V: 04-1] ores::base: Remove handling of /srv [puppet] - 10https://gerrit.wikimedia.org/r/278951 (owner: 10Alexandros Kosiaris) [10:53:19] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [10:53:44] (03PS6) 10Alexandros Kosiaris: ores::base: Remove handling of /srv [puppet] - 10https://gerrit.wikimedia.org/r/278951 [11:02:09] (03CR) 10Alexandros Kosiaris: [C: 032] ores::base: Remove handling of /srv [puppet] - 10https://gerrit.wikimedia.org/r/278951 (owner: 10Alexandros Kosiaris) [11:07:14] (03CR) 10Alexandros Kosiaris: [C: 032] hiera_lookup: support 'labs' realm [puppet] - 10https://gerrit.wikimedia.org/r/276345 (https://phabricator.wikimedia.org/T129092) (owner: 10Hashar) [11:07:19] (03PS4) 10Alexandros Kosiaris: hiera_lookup: support 'labs' realm [puppet] - 10https://gerrit.wikimedia.org/r/276345 (https://phabricator.wikimedia.org/T129092) (owner: 10Hashar) [11:24:50] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [11:25:09] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [11:26:14] (03PS1) 10Volans: Refactor of the CAs certificate genearation [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/279694 (https://phabricator.wikimedia.org/T111654) [11:30:39] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [11:32:09] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:05:52] (03PS1) 10Alexandros Kosiaris: ores: Move the lb functionality into the role [puppet] - 10https://gerrit.wikimedia.org/r/279695 [12:08:16] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Move the lb functionality into the role [puppet] - 10https://gerrit.wikimedia.org/r/279695 (owner: 10Alexandros Kosiaris) [13:07:45] (03PS1) 10Alexandros Kosiaris: ores: Use str2bool the check presence of cache [puppet] - 10https://gerrit.wikimedia.org/r/279698 [13:09:30] (03CR) 10Alexandros Kosiaris: [C: 032] ores: Use str2bool the check presence of cache [puppet] - 10https://gerrit.wikimedia.org/r/279698 (owner: 10Alexandros Kosiaris) [13:26:56] !log Stopping Cassandra on restbase2004.codfw.wmnet [13:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:28:31] https://upload.wikimedia.org/wikipedia/commons/8/85/The_Peanuts_Movie_Uploaded_by_Hossain_(Download_Group_BD).ogv <-- wtf is this still downloadable? [13:28:37] please remove asap [13:29:19] also this https://upload.wikimedia.org/wikipedia/commons/8/87/Gullivers_Travels_%281939%29.webm [13:29:51] oh that's legit [13:30:29] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [13:30:40] ^^^ on this. [13:33:24] godog: can you look into why https://upload.wikimedia.org/wikipedia/commons/8/85/The_Peanuts_Movie_Uploaded_by_Hossain_(Download_Group_BD).ogv is still there when https://commons.wikimedia.org/wiki/File:The_Peanuts_Movie_Uploaded_by_Hossain_(Download_Group_BD).ogv was deleted? (and nuke it?) [13:34:55] Vito: i don't think many people are around on saturday, and this is probably not severe enough to warrant pulling people in. if godog is not around, can you file a task? [13:35:35] AaronSchulz: or maybe you're around and can kill that file? ^^^ [13:40:02] yeah I'll take a look [13:46:12] so it appears to be cache pollution, Vito would you mind filing a task? I don't feel confident enough with varnish bans to do that on saturday with nobody around [13:46:46] !log Cassandra on restbase2004.codfw.wmnet shut down, hardware failure; Down for the weekend : T130990 [13:46:47] T130990: restbase2004.codfw.wmnet: Failed disk/RAID - https://phabricator.wikimedia.org/T130990 [13:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:54:32] 6Operations, 10ops-codfw, 10RESTBase-Cassandra: restbase2004.codfw.wmnet: Failed disk/RAID - https://phabricator.wikimedia.org/T130990#2153009 (10Krenair) [13:57:09] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: puppet fail [14:00:28] 6Operations, 6Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2153017 (10BBlack) On looking into the spdy/3 stats drop, my suspicion is something has changed with IE11 (that it has dropped SPDY/3 for http/[12]), but I still haven't found a good sou... [14:01:17] 6Operations, 6Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2153018 (10BBlack) The basic plan for moving forward is to start from debian's current 1.9.11 packaging, update it to 1.9.12 (because .12 upstreams our temporary fix for openssl "shutdow... [14:04:40] yep godog [14:08:32] Vito: I'm executing a ban for it now, will take a few more minutes [14:09:53] ACKNOWLEDGEMENT - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused eevans Hardware failure Down for the duration. [14:10:29] Vito: ban done, it should be gone from caches [14:17:31] yeah 404s for me, thanks bblack ! [14:24:30] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [14:30:13] great bblack [14:30:14] ty [14:46:14] MatmaRex: btw there's should be a deeper meaning in my typo "undeleted" instead of "deleted" [14:46:32] -s [15:37:17] (03CR) 10Tim Landscheidt: ores: Move the lb functionality into the role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279695 (owner: 10Alexandros Kosiaris) [15:43:00] 6Operations, 6Labs: Labtest designate giving out Forbidden exceptions when trying to list domains - https://phabricator.wikimedia.org/T130979#2152683 (10Andrew) I see the same behavior you're seeing if I visit the DNS panel for a project that I'm a 'user' in but not a 'projectadmin' in. If I am a projectadmin... [16:21:14] !log krenair@tin Synchronized php-1.27.0-wmf.18/extensions/SemanticForms: https://gerrit.wikimedia.org/r/#/c/279701/ (duration: 00m 32s) [16:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:52:59] (03PS1) 10Andrew Bogott: Designate config: Allow more than 20 records for v2 queries. [puppet] - 10https://gerrit.wikimedia.org/r/279714 (https://phabricator.wikimedia.org/T130976) [16:54:06] Krenair: ^ fixes the deletion issue. No real way we could have caught this on labtest. [16:54:25] Or, well, I guess if we'd imported a massive dataset we would've caught it. *shrug* [16:54:49] (03CR) 10Andrew Bogott: [C: 032] Designate config: Allow more than 20 records for v2 queries. [puppet] - 10https://gerrit.wikimedia.org/r/279714 (https://phabricator.wikimedia.org/T130976) (owner: 10Andrew Bogott) [16:55:28] andrewbogott, how many domains do we have under wmflabs.org directly? [16:57:23] less than 1000, but let me check... [16:57:38] 507 [16:59:01] Krenair: what do you think, 1000 is enough room to grow? [16:59:59] should give us 3-4 years [17:00:13] I don't like arbitrary limits much though... do you guys add icinga checks for this sort of thing? [17:04:56] (03PS1) 10Mobrovac: scap::target: Allow scap's user to restart all services on a node [puppet] - 10https://gerrit.wikimedia.org/r/279717 (https://phabricator.wikimedia.org/T130948) [17:18:19] Krenair: I can set it to 'None' which would be unlimited. There's vague denial-of-service risk for apis without size limits. [17:18:29] There's not really a tradition of icinga checking such things, although it's not a bad idea [17:21:55] yeah, I know they're necessary, I'm still not a huge fan [17:23:33] I'd just prefer to not have to re-solve the same problem in a few years time without any documentation on the issue [17:24:51] (03PS2) 10Mobrovac: scap::target: Allow scap's user to restart all services on a node [puppet] - 10https://gerrit.wikimedia.org/r/279717 (https://phabricator.wikimedia.org/T130948) [17:41:06] (03PS3) 10Mobrovac: scap::target: Allow scap's user to restart all services on a node [puppet] - 10https://gerrit.wikimedia.org/r/279717 (https://phabricator.wikimedia.org/T130948) [17:52:28] (03PS4) 10Mobrovac: scap::target: Allow scap's user to restart all services on a node [puppet] - 10https://gerrit.wikimedia.org/r/279717 (https://phabricator.wikimedia.org/T130948) [17:58:22] (03CR) 10Mobrovac: "All known scap targets (iridium, aqs1002, kafka1001, scb1001) are happy with this change: https://puppet-compiler.wmflabs.org/2186/" [puppet] - 10https://gerrit.wikimedia.org/r/279717 (https://phabricator.wikimedia.org/T130948) (owner: 10Mobrovac) [18:00:21] (03CR) 10Tim Landscheidt: [C: 04-1] "The patch changes inter alia "role […], labs::dnsrecursor" to "role […], labs::dns::recursor", but does not split up role::labs::dnsrecurs" [puppet] - 10https://gerrit.wikimedia.org/r/271735 (owner: 10Dzahn) [18:03:33] (03CR) 10Mobrovac: "Waiting for Id13f35ec2cf4e32e4931ffdc9df69425d433aad8 to land before continuing work here." [puppet] - 10https://gerrit.wikimedia.org/r/279415 (owner: 10Mobrovac) [18:35:09] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/ubuntu is over 12 hours old. [18:44:09] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/mirrors/ubuntu is over 0 hours old. [19:50:26] (03PS1) 10MarcoAurelio: Setting $wgMetaNamespaces for an.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279724 (https://phabricator.wikimedia.org/T131006) [20:18:11] (03CR) 10Dereckson: [C: 031] "Looks technically good to me, alias not needed as Wikipedia is already provided for all the wikipedia dblist." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279724 (https://phabricator.wikimedia.org/T131006) (owner: 10MarcoAurelio) [21:08:26] (03CR) 10Luke081515: [C: 031] Setting $wgMetaNamespaces for an.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279724 (https://phabricator.wikimedia.org/T131006) (owner: 10MarcoAurelio) [21:11:26] !log removed 2FA from wikitech accounts that looked to be affected by T130892 [21:11:27] T130892: wikitech 2fa provisioning form does so without confirmation - https://phabricator.wikimedia.org/T130892 [21:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:32:04] (03PS1) 10Andrew Bogott: Designate policy: Allow non-projectadmins to view and list domains and records. [puppet] - 10https://gerrit.wikimedia.org/r/279729 [22:33:40] 6Operations, 6Labs: Labtest designate giving out Forbidden exceptions when trying to list domains - https://phabricator.wikimedia.org/T130979#2153502 (10Andrew) ok, actually preventing the tab from showing up for non-projectadmins turns out to be hard (the code doesn't do policy checks properly). And, anyway,... [22:33:50] (03PS2) 10Andrew Bogott: Designate policy: Allow non-projectadmins to view and list domains and records. [puppet] - 10https://gerrit.wikimedia.org/r/279729 (https://phabricator.wikimedia.org/T130979) [22:42:59] (03CR) 10Andrew Bogott: [C: 032] Designate policy: Allow non-projectadmins to view and list domains and records. [puppet] - 10https://gerrit.wikimedia.org/r/279729 (https://phabricator.wikimedia.org/T130979) (owner: 10Andrew Bogott) [22:54:45] (03PS1) 10Andrew Bogott: Designate policy: redefine "admin_or_member" rule [puppet] - 10https://gerrit.wikimedia.org/r/279731 [22:55:43] (03CR) 10Andrew Bogott: [C: 032] Designate policy: redefine "admin_or_member" rule [puppet] - 10https://gerrit.wikimedia.org/r/279731 (owner: 10Andrew Bogott) [22:56:33] (03PS1) 10Andrew Bogott: Designate: Backport liberty policy changes to kilo [puppet] - 10https://gerrit.wikimedia.org/r/279732 [22:57:26] (03CR) 10jenkins-bot: [V: 04-1] Designate: Backport liberty policy changes to kilo [puppet] - 10https://gerrit.wikimedia.org/r/279732 (owner: 10Andrew Bogott) [23:01:36] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/279732 (owner: 10Andrew Bogott) [23:02:41] (03CR) 10Andrew Bogott: [C: 032] Designate: Backport liberty policy changes to kilo [puppet] - 10https://gerrit.wikimedia.org/r/279732 (owner: 10Andrew Bogott) [23:36:24] (03PS7) 10Tim Landscheidt: shinken: Only regenerate configuration when there are changes [puppet] - 10https://gerrit.wikimedia.org/r/267423 [23:57:58] PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: puppet fail