[00:00:04] started right when it got deployed [00:00:18] jdlrobson: agreed, it doesn't make sense to me either, but I'm sure we'll figure it out [00:01:07] but it sounds like it's not applying.. which at least gives me a little more confidence that when i do this again it might work [00:01:41] anyway i gotta go. Feel free to revert the two relating to wmgRelatedArticlesFooterWhitelistedSkins if necessary [00:11:14] PROBLEM - nova-compute process on labvirt1003 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [00:11:47] (03PS1) 1020after4: Paper over "Notice: Undefined variable: wmgRelatedArticlesFooterWhitelistedSkins" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350099 (https://phabricator.wikimedia.org/T162941) [00:12:10] (03CR) 1020after4: [C: 032] Paper over "Notice: Undefined variable: wmgRelatedArticlesFooterWhitelistedSkins" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350099 (https://phabricator.wikimedia.org/T162941) (owner: 1020after4) [00:12:14] RECOVERY - nova-compute process on labvirt1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [00:13:24] (03Merged) 10jenkins-bot: Paper over "Notice: Undefined variable: wmgRelatedArticlesFooterWhitelistedSkins" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350099 (https://phabricator.wikimedia.org/T162941) (owner: 1020after4) [00:13:35] (03CR) 10jenkins-bot: Paper over "Notice: Undefined variable: wmgRelatedArticlesFooterWhitelistedSkins" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350099 (https://phabricator.wikimedia.org/T162941) (owner: 1020after4) [00:14:21] twentyafterfour: the phd service on phab2001 needs to be stopped permanently still, right [00:15:10] (we fixed the monitoring check to not create a false positive, but there's a second check that gets triggered, the general "Check systemd state") [00:15:27] i guess we should turn "failed" into "disabled" [00:15:30] mutante: for now, yeah [00:15:34] "masked" [00:16:00] twentyafterfour: ok, looking at a way to keep it stopped while it's still happy about the general "state" [00:16:14] that generic icinga check that is [00:16:20] !log twentyafterfour@naos Synchronized wmf-config/CommonSettings.php: fix "Notice: Undefined variable: wmgRelatedArticlesFooterWhitelistedSkins" (duration: 01m 11s) [00:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:53] MaxSem: go ahead with your deploy, the other patch I had isn't passing CI and it's not worth the risk right now [00:17:04] thanks twentyafterfour [00:17:37] mutante: I'm not sure, maybe we should add a param to the service unit in puppet for services that shouldn't necessarily be running? [00:17:45] a skip_checks parameter? [00:18:05] if it doesn't already have something like that [00:19:03] jdlrobson: I patched the error with https://gerrit.wikimedia.org/r/#/c/350099/, I'll leave it to you from here out [00:19:19] hope your migrane goes away quickly [00:20:52] (03CR) 1020after4: "oops I got the wrong task in the Bug: field..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350099 (https://phabricator.wikimedia.org/T162941) (owner: 1020after4) [00:21:59] twentyafterfour: yea, that or fix the Icinga "systemd" check to be smart about it and skip some specific services [00:22:05] i'll look into it [00:22:26] maybe it'd be happy already if we mask the service [00:25:51] !log restarted apache2 on iridium to tune rate limiting value [00:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:03] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3208453 (10Cmjohnson) @ayounsi I would like to get an early start on this NLT than 0930 EST. Will that be possible? Thanks [00:33:20] 06Operations, 10Phabricator: Intermittent DB connectivity problem on phabricator, needs investigation - https://phabricator.wikimedia.org/T163507#3208455 (10mmodell) @epriestley: Thanks for the very helpful and detailed response. I'd like to hear what @faidon thinks about all of that before I chime in too muc... [00:38:11] !log ocg1001 - powercycle into installer, was sitting at partman step with "failure to read from sda"... [00:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:54] PROBLEM - Host ocg1001 is DOWN: PING CRITICAL - Packet loss = 100% [00:40:33] ACKNOWLEDGEMENT - Host ocg1001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn reinstall [00:41:20] ACKNOWLEDGEMENT - Check systemd state on phab2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn phd service should not be running on passive server [00:41:34] heads up, I'm going to deploy some security stuff [00:41:36] 06Operations, 06Release-Engineering-Team, 10vm-requests, 07Security-General: New ganeti VM for MW release pipeline work - https://phabricator.wikimedia.org/T163743#3208457 (10RobH) While it needs gerrit for basic cloning/fetch, it won't care if it is within its same datacenter, correct? If it is fetching... [00:41:53] RainbowSprinkles: ok, i think that makes sense [00:42:03] ive updated and ill spin up a vm tomorrow for it =] [00:42:10] MaxSem: are you done with the thing you were deploying? [00:42:14] RECOVERY - Host ocg1001 is UP: PING OK - Packet loss = 0%, RTA = 36.33 ms [00:43:14] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3208460 (10Dzahn) @cmjohnson I attempted a reinstall but it consistently fails at the partitioning step with: ``` ┌─────────────┤ [!!] Partition disks ├─────────────┐ │... [00:43:50] 06Operations, 06Release-Engineering-Team, 10vm-requests, 07Security-General: New ganeti VM for MW release pipeline work - https://phabricator.wikimedia.org/T163743#3208461 (10demon) >>! In T163743#3208457, @RobH wrote: > While it needs gerrit for basic cloning/fetch, it won't care if it is within its same... [00:45:15] PROBLEM - Host ocg1001 is DOWN: PING CRITICAL - Packet loss = 100% [00:48:23] bawolff, not deploying, go ahead [00:48:31] thanks [00:48:50] (still proding my patches with a stick) [00:50:24] PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:52:14] RECOVERY - Host ocg1001 is UP: PING OK - Packet loss = 0%, RTA = 37.41 ms [00:53:01] !log unconfirming emails associated with T163477 [00:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:28] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3208480 (10Dzahn) I changed the boot order in BIOS (port A was still first, switched to port B), did not change the error. Still "during read on /dev/sda" at partitioning step. Both identical drives are detected du... [00:59:54] PROBLEM - Host ocg1001 is DOWN: PING CRITICAL - Packet loss = 100% [01:02:14] RECOVERY - Host ocg1001 is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms [01:02:18] bawolff, is the throne occupied? [01:02:28] not quite yet [01:02:45] my internet hiccuped there for about 30 seconds, so I'm still deploying stuff [01:04:20] MaxSem: I'm done now [01:08:49] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3208499 (10Dzahn) @cmjohnson Somehow the new /dev/sda also seems to be broken. Maybe it was used in something else before? Or it was this disk that was broken the whole time and we replaced the wrong one? I dunno... [01:09:34] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3208501 (10Dzahn) a:05Dzahn>03Cmjohnson Any other disk to try? can we replace sda one more time? [01:13:24] RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [01:14:08] ^ simply ran puppet there [01:14:48] (03PS6) 10Dzahn: add netmon1002 to site [puppet] - 10https://gerrit.wikimedia.org/r/333780 (https://phabricator.wikimedia.org/T159756) [01:15:17] (03Abandoned) 10Krinkle: Interwiki map update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348893 (https://phabricator.wikimedia.org/T145337) (owner: 10Krinkle) [01:17:17] (03CR) 10Dzahn: [C: 032] add netmon1002 to site [puppet] - 10https://gerrit.wikimedia.org/r/333780 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [01:20:18] (03CR) 10Dzahn: "i was wondering why i could not find netmon1001.mgmt. Turns out there was a typo in this change. "netmont" with a "t" at the end." [dns] - 10https://gerrit.wikimedia.org/r/350009 (owner: 10Cmjohnson) [01:22:33] (03PS1) 10Dzahn: fix typo, "netmont1002.mgmt" -> "netmon1002.mgmt" [dns] - 10https://gerrit.wikimedia.org/r/350103 (https://phabricator.wikimedia.org/T159756) [01:22:46] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/350103/" [dns] - 10https://gerrit.wikimedia.org/r/350009 (owner: 10Cmjohnson) [01:23:36] (03CR) 10Dzahn: [C: 032] fix typo, "netmont1002.mgmt" -> "netmon1002.mgmt" [dns] - 10https://gerrit.wikimedia.org/r/350103 (https://phabricator.wikimedia.org/T159756) (owner: 10Dzahn) [01:36:29] 06Operations, 13Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3208567 (10Dzahn) @cmjohnson I fixed the typo above in mgmt DNS entry, but i still can't get on mgmt console after that. root@wmf7042.mgmt.eqiad.wmnet times out and ``` ssh root@netmon1002.mgmt.eq... [01:36:57] 06Operations, 13Patch-For-Review: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3208568 (10Dzahn) a:05Dzahn>03Cmjohnson [01:42:15] !log Deployed security patches for T163166 [01:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:42:43] (03PS1) 10Dzahn: repeat hostname for each record where missing in server list [dns] - 10https://gerrit.wikimedia.org/r/350104 [01:45:57] I'm off, see ya guys tomorrow have a good night! o/ [01:53:06] (03PS1) 10Dzahn: site/icinga: unify einsteinium/tegmen in single node section [puppet] - 10https://gerrit.wikimedia.org/r/350107 [01:55:22] (03PS2) 10Dzahn: site/icinga: unify einsteinium/tegmen in single node section [puppet] - 10https://gerrit.wikimedia.org/r/350107 [02:00:27] (03PS3) 10Dzahn: yubiauth: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/345085 [02:00:42] (03CR) 10jerkins-bot: [V: 04-1] yubiauth: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/345085 (owner: 10Dzahn) [02:01:10] (03CR) 10Dzahn: [C: 031] "was already compiled as no-op on iron in the past. removed "WIP" now." [puppet] - 10https://gerrit.wikimedia.org/r/345085 (owner: 10Dzahn) [02:04:09] (03PS4) 10Dzahn: yubiauth: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/345085 [02:04:34] (03CR) 10Dzahn: [C: 031] "PS4: manual rebase for "role::backup" -> "profile::backup" change" [puppet] - 10https://gerrit.wikimedia.org/r/345085 (owner: 10Dzahn) [02:06:55] (03PS8) 10Dzahn: deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728 [02:07:08] (03CR) 10jerkins-bot: [V: 04-1] deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728 (owner: 10Dzahn) [02:09:44] (03CR) 10Dzahn: "tried to compile, fails on all 3 but i believe all 3 are not actually related to this change but compiler issues. http://puppet-compiler." [puppet] - 10https://gerrit.wikimedia.org/r/344728 (owner: 10Dzahn) [02:11:08] (03PS9) 10Dzahn: deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728 [02:14:08] (03PS10) 10Dzahn: deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728 [02:14:46] I'm going to deploy a security thing shortly [02:15:20] thanks for the heads-up [02:16:36] (03CR) 10Dzahn: [C: 031] "actually it _was_ related to this change. here, looks much better now: http://puppet-compiler.wmflabs.org/6219/" [puppet] - 10https://gerrit.wikimedia.org/r/344728 (owner: 10Dzahn) [02:17:13] (03CR) 10Dzahn: [C: 031] "(also amended and rebased to include "naos", just compiler doesn't know about it yet)" [puppet] - 10https://gerrit.wikimedia.org/r/344728 (owner: 10Dzahn) [02:17:29] (03PS11) 10Dzahn: deployment::server: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/344728 [02:17:37] by shortly I mean right now, as long as I'm not in anyone's way [02:18:11] (03CR) 10Dzahn: [C: 031] yubiauth: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/345085 (owner: 10Dzahn) [02:20:22] (03PS6) 10Dzahn: mediawiki::maintenance: convert to profile/role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/342777 [02:21:18] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/6220/ not yet" [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn) [02:22:57] !log deployed patch for T163477 [02:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:57:24] I'm going to deploy another thing to add better logging to my previous thing to try and figure out why it doesn't work [03:08:56] (03PS7) 10Dzahn: mediawiki::maintenance: convert to profile/role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/342777 [03:23:29] (03CR) 10Krinkle: Use EtcdConfig (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [03:23:41] (03CR) 10Krinkle: "(See also PS2 comment about labs hostname)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [03:25:01] (03PS1) 10Dzahn: remove barium.frack.eqiad incl. mgmt [dns] - 10https://gerrit.wikimedia.org/r/350113 (https://phabricator.wikimedia.org/T162952) [03:26:54] (03PS2) 10Dzahn: remove barium.frack.eqiad incl. mgmt [dns] - 10https://gerrit.wikimedia.org/r/350113 (https://phabricator.wikimedia.org/T162952) [03:41:02] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops, 13Patch-For-Review: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3208700 (10Dzahn) @Robh @Jgreen see DNS change above. in this case it removes both main IP and mgmt at once, realizing that normally we do it seperat... [03:43:19] (03PS8) 10Dzahn: mediawiki::maintenance: convert to profile/role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/342777 [03:49:53] (03PS9) 10Dzahn: mediawiki::maintenance: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/342777 [03:51:01] (03PS10) 10Dzahn: mediawiki::maintenance: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/342777 [03:51:15] (03CR) 10Dzahn: [C: 031] mediawiki::maintenance: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn) [03:51:56] (03CR) 10Dzahn: "@DBA let me know when is a good time for this, but no rush" [puppet] - 10https://gerrit.wikimedia.org/r/348565 (owner: 10Dzahn) [03:52:55] (03CR) 10Dzahn: "we should merge it after the other change to give more permission to phab stats user. anytime you guys have a moment for it.. i'll rebase " [puppet] - 10https://gerrit.wikimedia.org/r/348779 (owner: 10Dzahn) [03:53:21] (03PS4) 10Dzahn: mariadb: grant user 'phstats' additional select on differential db [puppet] - 10https://gerrit.wikimedia.org/r/348565 [03:53:52] (03CR) 10Dzahn: [C: 031] mariadb: grant user 'phstats' additional select on differential db [puppet] - 10https://gerrit.wikimedia.org/r/348565 (owner: 10Dzahn) [03:54:44] (03CR) 10Dzahn: [C: 031] "+1, just needs to go after https://gerrit.wikimedia.org/r/348565 and rebase" [puppet] - 10https://gerrit.wikimedia.org/r/348779 (owner: 10Dzahn) [03:55:57] (03CR) 10Dzahn: [C: 04-1] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/345791 (https://phabricator.wikimedia.org/T134271) (owner: 10Dzahn) [03:56:15] (03CR) 10Dzahn: [C: 04-1] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/320698 (owner: 10Dzahn) [04:02:01] (03PS2) 10Dzahn: base::service_unit: add symlink from /etc into /var for systemd units [puppet] - 10https://gerrit.wikimedia.org/r/348665 [04:04:05] (03CR) 10Dzahn: [C: 031] site/icinga: unify einsteinium/tegmen in single node section [puppet] - 10https://gerrit.wikimedia.org/r/350107 (owner: 10Dzahn) [04:04:29] (03CR) 10Dzahn: [C: 031] repeat hostname for each record where missing in server list [dns] - 10https://gerrit.wikimedia.org/r/350104 (owner: 10Dzahn) [04:04:50] (03CR) 10Dzahn: [C: 031] remove barium.frack.eqiad incl. mgmt [dns] - 10https://gerrit.wikimedia.org/r/350113 (https://phabricator.wikimedia.org/T162952) (owner: 10Dzahn) [04:05:33] and ../away zzzz [04:10:54] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=954.50 Read Requests/Sec=561.40 Write Requests/Sec=0.60 KBytes Read/Sec=38258.80 KBytes_Written/Sec=10.40 [04:20:54] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.50 Read Requests/Sec=0.20 Write Requests/Sec=126.00 KBytes Read/Sec=1.20 KBytes_Written/Sec=707.20 [04:54:12] Wow, so js on the create account page, that gets triggered by keypress does a db query on every wiki in the cluster to see if the username is registered anywhere (If I'm reading this right). That seems kind of craxy [05:09:14] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps wave]BR [05:09:24] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/1: down - Core: cr1-eqord:xe-0/0/0 (Telia, IC-314534, 24ms) {#10694} [10Gbps wave]BR [05:18:14] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [05:22:14] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-314534, 29ms) {#11375} [10Gbps wave]BR [05:23:14] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [05:41:03] !log Deploy alter table enwiki.revision on labsdb1009 and labsdb1010 - T132416 [05:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:12] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [05:44:42] (03CR) 10Thiemo Mättig (WMDE): "I noticed similar problems with that aggressive SVG compressor before: https://phabricator.wikimedia.org/F7748897" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) (owner: 10Ladsgroup) [05:46:34] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [05:46:44] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [05:52:26] (03CR) 10Ladsgroup: "Per what has been discussed with Lydia (see https://phabricator.wikimedia.org/T142104#3205378) We want the db disk icon but we need other " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) (owner: 10Ladsgroup) [05:56:44] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [05:58:40] (03PS1) 10Marostegui: db-eqiad.php: Depool db1071 and db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350119 (https://phabricator.wikimedia.org/T163548) [06:00:57] (03PS2) 10Marostegui: db-eqiad.php: Depool db1071 and db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350119 (https://phabricator.wikimedia.org/T163548) [06:01:34] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:02:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1071 and db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350119 (https://phabricator.wikimedia.org/T163548) (owner: 10Marostegui) [06:04:20] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1071 and db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350119 (https://phabricator.wikimedia.org/T163548) (owner: 10Marostegui) [06:05:41] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1071 and db1026 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350119 (https://phabricator.wikimedia.org/T163548) (owner: 10Marostegui) [06:06:12] !log marostegui@naos Synchronized wmf-config/db-eqiad.php: Repool db1071, depool db1026 - T162539 T163548 (duration: 01m 17s) [06:06:20] (03PS1) 10Marostegui: db-codfw.php: Restore db2061 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350120 [06:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:21] T162539: Deploy schema change for adding term_full_entity_id column to wb_terms table - https://phabricator.wikimedia.org/T162539 [06:06:21] T163548: Drop the useless wb_terms keys "wb_terms_entity_type" and "wb_terms_type" on "wb_terms" table - https://phabricator.wikimedia.org/T163548 [06:08:08] (03CR) 10Marostegui: [C: 032] db-codfw.php: Restore db2061 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350120 (owner: 10Marostegui) [06:09:21] (03Merged) 10jenkins-bot: db-codfw.php: Restore db2061 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350120 (owner: 10Marostegui) [06:09:27] (03CR) 10Thiemo Mättig (WMDE): [C: 04-1] "Now I'm entirely confused. The comment you are linking to says "we will use the Wikidata Logo […] for the installation in Wikimedia". Isn'" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) (owner: 10Ladsgroup) [06:09:31] (03CR) 10jenkins-bot: db-codfw.php: Restore db2061 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350120 (owner: 10Marostegui) [06:10:39] !log marostegui@naos Synchronized wmf-config/db-codfw.php: Restore db2061 original weight (duration: 00m 57s) [06:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:34] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:13:44] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:21:56] (03CR) 10Ladsgroup: "It seems I'm confused too but to my understanding what Lydia meant was that we use this logo (with Wikidata colors) for Wikimedia installa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) (owner: 10Ladsgroup) [06:24:44] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:27:34] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:34:12] !log Deploy alter table on s3, all the wikis to the watchlist table on db1075, eqiad master - T130067 [06:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:21] T130067: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067 [06:35:02] the 503 spikes above are due to cp3033, (high mailbox lag). It grew too fast for our icinga check to catch it yet [06:35:53] s/yet//, it recovered on its own [06:40:44] PROBLEM - MariaDB Slave SQL: s3 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table bawiktionary.watchlist doesnt exist on query. Default database: bawiktionary. [Query snipped] [06:41:00] That is "expected" I will fix it [06:43:44] RECOVERY - MariaDB Slave SQL: s3 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [06:44:03] (03CR) 10WMDE-leszek: "I was about to say how I interpret the linked comment, but let me check with the designer, and get back here to comment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) (owner: 10Ladsgroup) [07:10:55] (03CR) 10Ladsgroup: "I uncherry-picked this in beta. Let's see if it starts to work or there is something else going on." [puppet] - 10https://gerrit.wikimedia.org/r/348184 (https://phabricator.wikimedia.org/T161563) (owner: 10Ladsgroup) [07:11:45] (03CR) 10Ladsgroup: "(I uncherry-picked another patch too) I think we need to cherry-pick this again to see if that patch was the cause or this one." [puppet] - 10https://gerrit.wikimedia.org/r/348184 (https://phabricator.wikimedia.org/T161563) (owner: 10Ladsgroup) [07:12:13] (03PS2) 10Ladsgroup: Enable echo notification for wikibase clients in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) [07:14:40] (03CR) 10Ladsgroup: "I changed it to wikidata logo with one color (after some talks with Lydia) but since it should be 30px by 30px. It looks a little bit weir" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) (owner: 10Ladsgroup) [07:14:53] !log upgrade cp3033 varnish-be to varnish 4.1.5-1wm2, expiry thread lock/priority workaround T145661 [07:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:01] T145661: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661 [07:23:26] (03CR) 10Thiemo Mättig (WMDE): [C: 04-1] "This is becoming a bit absurd. Can we please first decide on an icon, and create it in the proper 30 by 30 pixels resolution, before uploa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) (owner: 10Ladsgroup) [07:32:09] (03CR) 10Ladsgroup: "I took your example and changed it 30px by 30px and then I changed the color to 36c given the color schema in WikimediaUI and color of oth" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) (owner: 10Ladsgroup) [07:33:23] (03CR) 10Filippo Giunchedi: [C: 031] site/icinga: unify einsteinium/tegmen in single node section [puppet] - 10https://gerrit.wikimedia.org/r/350107 (owner: 10Dzahn) [07:39:14] (03CR) 10Muehlenhoff: site/icinga: unify einsteinium/tegmen in single node section (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350107 (owner: 10Dzahn) [07:41:33] Deploy alter table on s6, all the wikis to the watchlist table on db1050, eqiad master - https://phabricator.wikimedia.org/T130067 [07:46:21] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3208907 (10ayounsi) @Cmjohnson unfortunately there is another maintenance scheduled to end at 10:00 EST (14:00 UTC), doing the maintenance after... [07:47:28] 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-fgiunchedi: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704#3208909 (10fgiunchedi) [07:48:35] (03PS1) 10ArielGlenn: set up gitignore so we don't have an empty repo [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350121 [07:52:24] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [07:55:08] 06Operations, 06Release-Engineering-Team, 10vm-requests, 07Security-General: New ganeti VM for MW release pipeline work - https://phabricator.wikimedia.org/T163743#3207976 (10fgiunchedi) >>! In T163743#3208393, @demon wrote: >>>! In T163743#3208387, @RobH wrote: >> I'd suggest we use the hostname jenkins-j... [07:56:38] (03PS2) 10ArielGlenn: set up .gitignore and tox.ini so we don't have an empty repo [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350121 [07:57:44] PROBLEM - MariaDB Slave Lag: s6 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 889.12 seconds [07:59:23] ^ expected [08:01:44] (03PS1) 10Filippo Giunchedi: base: escape $MSG in run-puppet-msg [puppet] - 10https://gerrit.wikimedia.org/r/350124 [08:03:11] (03PS1) 10Matthias Mullie: Turn $wg3dProcessor into an array of arguments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350125 [08:03:29] !log moving all slaves of s2 eqiad under db1054 [08:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:21] (03CR) 10ArielGlenn: [V: 032 C: 032] set up .gitignore and tox.ini so we don't have an empty repo [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350121 (owner: 10ArielGlenn) [08:07:06] 06Operations, 06Release-Engineering-Team, 10vm-requests, 07Security-General: New ganeti VM for MW release pipeline work - https://phabricator.wikimedia.org/T163743#3208931 (10hashar) We have CI hosts like contint1001 / contint2001. What about a generic name like: `contint1002.eqiad.wmnet` ? For network t... [08:08:31] 06Operations, 10Traffic, 10media-storage: swift-object-server 1.13.1: Wrong Content-Type returned on 304 Not Modified responses - https://phabricator.wikimedia.org/T162348#3208932 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Resolving as the swift upgrade is complete and varnish bandaids have been r... [08:15:28] marostegui: jynus: good morning. Do you mind if I deploy a mw hotfix this morning ? [08:15:43] or are you guys in the middle of some database migration? [08:15:51] hashar: let us know first, as we are doing some switchovers [08:15:58] just to make sure we don't do it at the same time [08:15:58] ah [08:16:12] I will do during the SWAT window this afternoon so [08:16:29] ah, then it should be fine :)= [08:24:38] !log Stop MySQL db1041 (eqiad master) to reclone db1062 from it - T163665 [08:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:46] T163665: Reclone db1062 from db1041 (s7 master) - https://phabricator.wikimedia.org/T163665 [08:34:01] (03PS1) 10Jcrespo: mariadb: Promote db1054 as the new s2 master on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350127 (https://phabricator.wikimedia.org/T162133) [08:34:11] ^marostegui [08:35:31] checking [08:37:33] (03CR) 10Marostegui: mariadb: Promote db1054 as the new s2 master on eqiad (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350127 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [08:38:57] (03CR) 10Thiemo Mättig (WMDE): [C: 04-1] "What I see is a very much distorted version of the Wikidata logo, enforced to be a square (but the logo is not square), and also slightly " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) (owner: 10Ladsgroup) [08:39:01] (03PS2) 10Jcrespo: mariadb: Promote db1054 as the new s2 master on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350127 (https://phabricator.wikimedia.org/T162133) [08:39:37] (03CR) 10Marostegui: [C: 031] mariadb: Promote db1054 as the new s2 master on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350127 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [08:42:05] 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Improvements to Ganglia-equivalent Prometheus dashboards - https://phabricator.wikimedia.org/T152791#3208994 (10fgiunchedi) [08:47:44] (03PS1) 10Jcrespo: mariadb: promote db1054 as the new s2 eqiad master [puppet] - 10https://gerrit.wikimedia.org/r/350130 (https://phabricator.wikimedia.org/T162133) [08:49:28] (03CR) 10Jcrespo: [C: 032] mariadb: promote db1054 as the new s2 eqiad master [puppet] - 10https://gerrit.wikimedia.org/r/350130 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [08:52:39] !log Deploy alter table s4 commonswiki.watchlist directly on db1068 (eqiad master) - T130067 [08:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:47] T130067: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067 [08:53:24] (03PS1) 10Muehlenhoff: Make generate-fancycaptcha logrotate config compatible with jessie [puppet] - 10https://gerrit.wikimedia.org/r/350131 (https://phabricator.wikimedia.org/T163555) [08:53:49] !log restarting stopping replication on s2-eqiad and restarting db1054 [08:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:29] !log Stop replication on db1088 and db1093 in sync - T130067 [08:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:14] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1008, Errmsg: Error Cant drop database ptwikimedia: database doesnt exist on query. Default database: ptwikimedia. [Query snipped] [08:58:42] I will fix that jynus ^ [08:58:44] RECOVERY - MariaDB Slave Lag: s6 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 188.02 seconds [08:59:10] yeah, that was expected [08:59:14] RECOVERY - MariaDB Slave SQL: s3 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:59:14] we conneted it yesterday [08:59:16] yep [08:59:17] :) [08:59:20] thank you [08:59:35] will it happen on dbstore2001 ? [08:59:49] no, I think it was eqiad only [08:59:55] it didn't happen on dbstore2002, so maybe not [08:59:57] because it has been deleted from there [09:00:11] it just did [09:00:13] i will fix it [09:00:33] done [09:02:06] (03CR) 10Muehlenhoff: "PCC: http://puppet-compiler.wmflabs.org/6223/" [puppet] - 10https://gerrit.wikimedia.org/r/350131 (https://phabricator.wikimedia.org/T163555) (owner: 10Muehlenhoff) [09:06:23] (03CR) 10Jcrespo: [C: 032] mariadb: Promote db1054 as the new s2 master on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350127 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [09:07:34] (03Merged) 10jenkins-bot: mariadb: Promote db1054 as the new s2 master on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350127 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [09:07:43] (03CR) 10jenkins-bot: mariadb: Promote db1054 as the new s2 master on eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350127 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [09:10:02] I think s2 is now done [09:10:25] (not counting semisunc and events, that have to be checked everywhere) [09:10:43] 54 is in 0.29 [09:10:59] !log jynus@naos Synchronized wmf-config/db-eqiad.php: Promote db1054 as the new s2 master on eqiad (duration: 01m 19s) [09:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:36] (03PS1) 10Marostegui: s2.hosts: Move db1054 as the new master [software] - 10https://gerrit.wikimedia.org/r/350136 (https://phabricator.wikimedia.org/T162133) [09:11:39] jynus ^ feel free to merge that when you want [09:12:13] we probably have to create a dblist for silver [09:12:33] yeah, also prometheus [09:12:34] indeed [09:12:48] (03CR) 10Jcrespo: [C: 032] s2.hosts: Move db1054 as the new master [software] - 10https://gerrit.wikimedia.org/r/350136 (https://phabricator.wikimedia.org/T162133) (owner: 10Marostegui) [09:17:08] (03PS1) 10Muehlenhoff: Make translationnotifications logrotate config compatible with jessie [puppet] - 10https://gerrit.wikimedia.org/r/350138 (https://phabricator.wikimedia.org/T163555) [09:17:25] (03CR) 10Jcrespo: "I am going to merge this based on how "stable" it is, I will be applying slowly to some host on the active datacenter, aiming for a a full" [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) (owner: 10Jcrespo) [09:17:32] (03CR) 10Jcrespo: [C: 032] Kill long running queries with stricter conditions [software] - 10https://gerrit.wikimedia.org/r/346559 (https://phabricator.wikimedia.org/T160984) (owner: 10Jcrespo) [09:20:51] jynus: we've summary for https://phabricator.wikimedia.org/T163344, Please have a look when you've time. [09:24:42] is that code new? [09:26:56] (03PS1) 10Addshore: Enable Cognate Logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350140 [09:28:05] jynus: https://gerrit.wikimedia.org/r/#/c/349214/ - yes. [09:28:09] the linked PHP code is not new, the JavaScript changes linked there are [09:28:24] (03CR) 10Muehlenhoff: "PCC: http://puppet-compiler.wmflabs.org/6224/" [puppet] - 10https://gerrit.wikimedia.org/r/350138 (https://phabricator.wikimedia.org/T163555) (owner: 10Muehlenhoff) [09:29:00] then, apparently, that doesn't work [09:29:50] because inserts where blocked and deadlocks where high [09:30:16] I would ask Aaron for help [09:31:25] (03CR) 10Alexandros Kosiaris: [C: 031] site/icinga: unify einsteinium/tegmen in single node section [puppet] - 10https://gerrit.wikimedia.org/r/350107 (owner: 10Dzahn) [09:33:38] (03PS1) 10Jcrespo: Promote db1054 as s2 eqiad master [puppet] - 10https://gerrit.wikimedia.org/r/350143 (https://phabricator.wikimedia.org/T162133) [09:36:18] (03CR) 10Jcrespo: [C: 032] Promote db1054 as s2 eqiad master [puppet] - 10https://gerrit.wikimedia.org/r/350143 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [09:38:45] (03PS1) 10ArielGlenn: flake8 all the python scripts [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350145 [09:42:03] kart_, Nikerabbit I am not sure if you are waiting for me to tell you to enable cxtranslation? [09:43:04] I will never do that, because I have no say on that- I just disable things when broken [09:44:17] Hi. [09:44:38] jynus: kart_ initially wanted to swat that, hashar suggested yesterday to check that with you [09:44:58] my official answer is "I don't know" [09:45:07] I imagine the concern where "if we reenable it, perhaps watch a little bit how perfs behave would be a good idea" [09:45:12] were [09:45:16] who? [09:45:51] because I do not have time for that now [09:46:39] I am doing higly critical scheduled maintenance right now [09:46:49] (03PS1) 10Muehlenhoff: Make wikidata logrotate config compatible with jessie [puppet] - 10https://gerrit.wikimedia.org/r/350149 (https://phabricator.wikimedia.org/T163555) [09:46:54] !log Deploy alter table s2 on watchlist table directly on the master (db1054) - T130067 [09:46:57] that would be us, but that would be limited as we don't have 24h manning [09:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:01] T130067: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067 [09:47:21] Nikerabbit, and I do? :-) I suppose I do [09:47:56] when I have to get up at 4am because something is broken, yes, I get up [09:48:39] you were asked to fill in https://wikitech.wikimedia.org/wiki/Incident_documentation/Report_Template [09:48:46] and haven't even touched it [09:49:00] and do you expect me to be happy? [09:49:05] I have been working on the task [09:49:11] just added a summary to it [09:49:14] !log Stop replication in sync on db1091 and db1084 for maintenance - T130067 [09:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:04] https://wikitech.wikimedia.org/wiki/Incident_documentation/20170419-ContentTranslation [09:52:11] the task literally says "negotiating whether CX can be enabled before it is fully understood" [09:52:24] (03PS3) 10Ladsgroup: Enable echo notification for wikibase clients in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) [09:52:39] basically my question is: based on the things I found, there is an issue, the cause still unknown, but it seems to be very rare when not switching datacenters, and we changed the front-end to mitigate the effects, we believe we can re-enable CX if we monitor tendril closely (but not 24h/7) while continue investigating the cause. Do you think this is too risky? [09:53:05] Nikerabbit, are you aware we are going to be switching datacenter in 1 week, right? [09:53:16] yes [09:53:17] (03CR) 10Muehlenhoff: "PCC: http://puppet-compiler.wmflabs.org/6225/" [puppet] - 10https://gerrit.wikimedia.org/r/350149 (https://phabricator.wikimedia.org/T163555) (owner: 10Muehlenhoff) [09:53:38] if not figured by then, we can pre-emptively disable CX [09:55:50] fill in the incident report, that is my prerequisite [09:56:26] Nikerabbit: the problem is that either jynus and myself are now 100% focused on some critical maintenance that was scheduled so we wouldn't have much time to monitor it and react to it as fast as we'd normally do for this week [09:57:22] are you running alter on s2? [09:57:26] yes [09:57:36] ok, I though replication had broken [09:57:40] :-) [09:57:47] yes, it broke for db1090, because it has the schema already there from before [09:57:57] i am skippig it and once it is all done will reimport that table :) [09:58:04] no I meant globally [09:58:10] ah hehe [09:58:11] no no [09:58:12] all good [10:03:16] (03CR) 10Hashar: "Why isn't it serves by the ResourceLoader? IIRC it could then be embedded in CSS as a data segment and get stored on the client local sto" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) (owner: 10Ladsgroup) [10:06:02] (03PS1) 10Jcrespo: Change db1061 to be the s6 master on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/350155 (https://phabricator.wikimedia.org/T162133) [10:08:58] (03CR) 10Marostegui: [C: 031] Change db1061 to be the s6 master on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/350155 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [10:09:09] (03CR) 10Jcrespo: [C: 032] Change db1061 to be the s6 master on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/350155 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [10:10:52] (03CR) 10Volans: base: escape $MSG in run-puppet-msg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350124 (owner: 10Filippo Giunchedi) [10:12:51] !log moving all slaves of s6 eqiad under db1061 [10:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:37] 06Operations, 10ops-eqiad, 10netops, 13Patch-For-Review: Rack and setup new eqiad row D switch stack (EX4300/QFX5100) - https://phabricator.wikimedia.org/T148506#3209149 (10elukey) >>! In T148506#3205842, @ayounsi wrote: > **Days before** > Move kafka1020 to row B T163002 Note about this move: today I wil... [10:15:18] !log restarting db1061's mysql process [10:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:20] (03PS2) 10ArielGlenn: flake8 all the python scripts [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350145 [10:19:12] (03PS2) 10Filippo Giunchedi: base: output MSG in run-puppet-agent [puppet] - 10https://gerrit.wikimedia.org/r/350124 [10:20:55] Nikerabbit: so if you fill the incident report with the timeline and the basic information you already have, note in the report you suggest to disable it while switching dc, I'm okay to reenable it at next SWAT [10:21:50] (with the understanding it will be monitored a little bit by your team) [10:25:08] (03CR) 10Ladsgroup: "All Echo icons are in their repository and as svg files: https://github.com/wikimedia/mediawiki-extensions-Echo/tree/master/modules/icons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) (owner: 10Ladsgroup) [10:29:37] 06Operations, 10media-storage, 15User-fgiunchedi: Consider storage policies for swift - https://phabricator.wikimedia.org/T151648#3209219 (10fgiunchedi) [10:34:08] db1022 errors out when doing a gtid slave change [10:34:17] which error? [10:34:35] complains about a binlog purged [10:34:41] :| [10:34:52] which means maybe some edit was done on db1061 [10:34:58] with binlog on [10:35:07] and the slave wants to recreate it [10:36:43] what if we try a reset master on db1061 if it still has no slaves attached to it? [10:37:02] no, it has dbstore1001 [10:37:05] ah [10:37:07] I will do it old way [10:37:10] ok [10:37:15] let me know if you need help [10:39:03] !log Stop replication in sync on db1090 and db1076 for maintenance - https://phabricator.wikimedia.org/T130067 [10:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:33] (03PS3) 10ArielGlenn: flake8 all the python scripts [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350145 [10:41:23] (03PS1) 10Jcrespo: mariadb: Depool db1022, promote db1061 as the s6 eqiad master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350164 (https://phabricator.wikimedia.org/T162133) [10:42:20] (03CR) 10Marostegui: mariadb: Depool db1022, promote db1061 as the s6 eqiad master (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350164 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [10:43:21] (03PS2) 10Jcrespo: mariadb: Depool db1022, promote db1061 as the s6 eqiad master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350164 (https://phabricator.wikimedia.org/T162133) [10:43:37] see? that is why I really need your reviews [10:43:48] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1022, promote db1061 as the s6 eqiad master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350164 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [10:44:04] just minor things! [10:45:31] !log stopping replication on db1050 [10:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:57] 06Operations, 10ops-eqiad: HP RAID icinga alert on ms-be1021 - https://phabricator.wikimedia.org/T163777#3209252 (10fgiunchedi) [10:50:26] 06Operations, 10ops-eqiad, 15User-fgiunchedi: HP RAID icinga alert on ms-be1021 - https://phabricator.wikimedia.org/T163777#3209264 (10fgiunchedi) [10:51:12] (03CR) 10ArielGlenn: [C: 032] flake8 all the python scripts [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350145 (owner: 10ArielGlenn) [11:01:19] !log switching eqiad s6 master to db1061 [11:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:28] (03CR) 10Volans: Use EtcdConfig (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347537 (https://phabricator.wikimedia.org/T156924) (owner: 10Tim Starling) [11:14:07] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1022, promote db1061 as the s6 eqiad master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350164 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [11:15:11] (03Merged) 10jenkins-bot: mariadb: Depool db1022, promote db1061 as the s6 eqiad master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350164 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [11:15:50] (03PS1) 10Marostegui: s6.host: db1061 is the new master [software] - 10https://gerrit.wikimedia.org/r/350168 (https://phabricator.wikimedia.org/T162133) [11:15:52] (03CR) 10jenkins-bot: mariadb: Depool db1022, promote db1061 as the s6 eqiad master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350164 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [11:26:20] (03CR) 10Jcrespo: [C: 032] s6.host: db1061 is the new master [software] - 10https://gerrit.wikimedia.org/r/350168 (https://phabricator.wikimedia.org/T162133) (owner: 10Marostegui) [11:27:56] !log Deploy alter table s1 on watchlist table directly on the master (db1052) - https://phabricator.wikimedia.org/T130067 [11:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:13] isn't it nice to just do that ? [11:28:19] haha [11:28:21] !log jynus@naos Synchronized wmf-config/db-eqiad.php: Depool db1022, promote db1061 as the s6 eqiad master (duration: 01m 17s) [11:28:21] it is [11:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:36] we could do it with gtid_domain_id + parallel, we just don't trust it yet :) [11:28:46] not this one [11:29:01] and not most on revision, page, image [11:29:55] (03PS1) 10Jcrespo: Change db1061 to be the s6 master on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/350171 (https://phabricator.wikimedia.org/T162133) [11:31:51] (03PS2) 10Jcrespo: prometheus-mysqld-exporter: Change db1061 to be the s6 master on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/350171 (https://phabricator.wikimedia.org/T162133) [11:37:39] (03PS1) 10Muehlenhoff: Blacklist macsec kernel module [puppet] - 10https://gerrit.wikimedia.org/r/350172 [11:40:34] PROBLEM - MariaDB Slave Lag: s1 on db1047 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 617.20 seconds [11:40:43] ^ expected I will silence it [11:42:45] (03CR) 10Gehel: [C: 04-1] "minor comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350149 (https://phabricator.wikimedia.org/T163555) (owner: 10Muehlenhoff) [11:43:28] (03CR) 10Gehel: [C: 04-1] Make translationnotifications logrotate config compatible with jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350138 (https://phabricator.wikimedia.org/T163555) (owner: 10Muehlenhoff) [11:43:54] PROBLEM - MariaDB Slave SQL: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1007, Errmsg: Error Cant create database ptwikimedia: database exists on query. Default database: ptwikimedia. [Query snipped] [11:44:03] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/350131 (https://phabricator.wikimedia.org/T163555) (owner: 10Muehlenhoff) [11:44:04] ^ expected, will fix it now [11:44:54] RECOVERY - MariaDB Slave SQL: s3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:46:58] !log Deploy alter table s5 on watchlist table directly on the master (db1049) - https://phabricator.wikimedia.org/T130067 [11:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:36] 06Operations, 10Traffic: Set up LVS for current AuthDNS - https://phabricator.wikimedia.org/T101525#3209358 (10ayounsi) [11:54:13] (03Abandoned) 10BBlack: build: remove --with-ipv6 (removed upstream) [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348590 (owner: 10BBlack) [11:54:17] (03Abandoned) 10BBlack: Lua module: OpenSSL-1.1 compat fixup [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348588 (owner: 10BBlack) [11:57:37] !log banning elasticsearch row D node in preparation for maintenance [11:57:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:06] (03PS2) 10BBlack: debian patch: main source to nginx-1.11.13 [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348585 [12:08:08] (03PS2) 10BBlack: debian patches: forward-port WMF patches and quilt refresh [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348586 [12:08:10] (03PS2) 10BBlack: control: depend on libssl11-dev [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348587 [12:08:12] (03PS2) 10BBlack: Create nginx-{full,light,extras}-dbg by hand. [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348589 [12:08:14] (03PS2) 10BBlack: nginx (1.11.10-1+wmf1) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348591 [12:08:16] (03PS1) 10BBlack: Add nginx-echo 1.11.x fixup patch [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/350177 [12:08:18] (03PS1) 10BBlack: nginx lua module fixups [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/350178 [12:08:46] (03CR) 10Muehlenhoff: Make translationnotifications logrotate config compatible with jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350138 (https://phabricator.wikimedia.org/T163555) (owner: 10Muehlenhoff) [12:11:47] (03PS4) 10Ladsgroup: Enable echo notification for wikibase clients in beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349984 (https://phabricator.wikimedia.org/T142104) [12:11:54] PROBLEM - MariaDB Slave SQL: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1007, Errmsg: Error Cant create database dtywiki: database exists on query. Default database: dtywiki. [Query snipped] [12:12:04] (03PS2) 10Muehlenhoff: Make translationnotifications logrotate config compatible with jessie [puppet] - 10https://gerrit.wikimedia.org/r/350138 (https://phabricator.wikimedia.org/T163555) [12:14:13] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/350138 (https://phabricator.wikimedia.org/T163555) (owner: 10Muehlenhoff) [12:14:16] I am fixing dbstore2001 [12:15:36] mmm [12:15:41] where is that coming from? [12:15:52] https://phabricator.wikimedia.org/T161529 hoo are you guys creating dtywiki? [12:18:38] ah, it was in dbstore2001 already because of x1… [12:18:54] RECOVERY - MariaDB Slave SQL: s3 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [12:19:02] next time you should try to use IF NOT EXISTS [12:19:11] so we do not break replication if it is there already [12:19:21] (03PS1) 10ArielGlenn: pylint all the things: get rid of camelcase [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350180 [12:20:39] (03PS2) 10Muehlenhoff: Make wikidata logrotate config compatible with jessie [puppet] - 10https://gerrit.wikimedia.org/r/350149 (https://phabricator.wikimedia.org/T163555) [12:24:37] 06Operations, 10Cassandra, 10Mobile-Content-Service, 06Reading-Infrastructure-Team-Backlog, 06Services: mobileapps 500s following reboot of restbase1007 - https://phabricator.wikimedia.org/T138314#3209457 (10NHarateh_WMF) [12:24:39] 06Operations, 10Mobile-Content-Service, 10ORES, 06Reading-Infrastructure-Team-Backlog, and 2 others: Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#3209453 (10NHarateh_WMF) [12:24:45] 06Operations, 10Mobile-Content-Service, 06Parsing-Team, 06Reading-Infrastructure-Team-Backlog, and 4 others: Create functional cluster checks for all services (and have them page!) - https://phabricator.wikimedia.org/T134551#3209459 (10NHarateh_WMF) [12:24:49] 06Operations, 10Mobile-Content-Service, 10RESTBase, 06Reading-Infrastructure-Team-Backlog, and 3 others: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387#3209461 (10NHarateh_WMF) [12:29:41] (03CR) 10Faidon Liambotis: [C: 032] Blacklist macsec kernel module [puppet] - 10https://gerrit.wikimedia.org/r/350172 (owner: 10Muehlenhoff) [12:29:51] (03PS2) 10Muehlenhoff: Make generate-fancycaptcha logrotate config compatible with jessie [puppet] - 10https://gerrit.wikimedia.org/r/350131 (https://phabricator.wikimedia.org/T163555) [12:35:52] !log Stop replication in sync on db1092 and db1087 for maintenance - https://phabricator.wikimedia.org/T130067 [12:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:42] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/350149 (https://phabricator.wikimedia.org/T163555) (owner: 10Muehlenhoff) [12:42:34] 06Operations, 06Reading-Infrastructure-Team, 06Reading-Infrastructure-Team-Backlog, 07Security-General, 06Services (next): Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#3209636 (10NHarateh_WMF) [12:42:51] 06Operations, 06Performance-Team, 06Reading-Infrastructure-Team, 06Reading-Infrastructure-Team-Backlog, and 5 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3209641 (10NHarateh_WMF) [12:47:39] marostegui: where the IF NOT EXISTS? [12:47:54] (03PS3) 10Muehlenhoff: Make generate-fancycaptcha logrotate config compatible with jessie [puppet] - 10https://gerrit.wikimedia.org/r/350131 (https://phabricator.wikimedia.org/T163555) [12:48:25] Dereckson: CREATE DATABASE IF NOT EXISTS is normally the good way of doing it to avoid replication issues [12:48:35] same for CREATE TABLE IF NOT EXISTS [12:48:58] sure, but where [12:49:07] Dereckson: what do you mean where? [12:49:26] marostegui: for the dty issue [12:49:53] Dereckson: Ah, the issues is that dbstore2001 is a special host so the database already existed there (because it has some echo tables) so replication broken when it was trying to get the database created there [12:50:14] (03CR) 10Muehlenhoff: [C: 032] Make generate-fancycaptcha logrotate config compatible with jessie [puppet] - 10https://gerrit.wikimedia.org/r/350131 (https://phabricator.wikimedia.org/T163555) (owner: 10Muehlenhoff) [12:55:47] jouncebot: refresh [12:55:51] I refreshed my knowledge about deployments. [12:55:51] jouncebot: next [12:55:52] In 0 hour(s) and 4 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170425T1300) [12:57:14] (03PS3) 10BBlack: nginx (1.11.10-1+wmf1) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348591 [12:58:30] (03PS2) 10Hashar: Add Draft namespace to zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349888 (https://phabricator.wikimedia.org/T163655) (owner: 10Urbanecm) [12:58:32] (03PS2) 10Hashar: Add NS aliases for zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349889 (https://phabricator.wikimedia.org/T162547) (owner: 10Urbanecm) [12:59:29] (03PS3) 10Muehlenhoff: Make translationnotifications logrotate config compatible with jessie [puppet] - 10https://gerrit.wikimedia.org/r/350138 (https://phabricator.wikimedia.org/T163555) [13:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170425T1300). Please do the needful. [13:00:05] Urbanecm and hashar: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:49] o/ [13:02:32] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349888 (https://phabricator.wikimedia.org/T163655) (owner: 10Urbanecm) [13:03:55] (03Merged) 10jenkins-bot: Add Draft namespace to zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349888 (https://phabricator.wikimedia.org/T163655) (owner: 10Urbanecm) [13:05:18] (03CR) 10Muehlenhoff: [C: 032] Make translationnotifications logrotate config compatible with jessie [puppet] - 10https://gerrit.wikimedia.org/r/350138 (https://phabricator.wikimedia.org/T163555) (owner: 10Muehlenhoff) [13:05:39] (03CR) 10jenkins-bot: Add Draft namespace to zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349888 (https://phabricator.wikimedia.org/T163655) (owner: 10Urbanecm) [13:07:18] hashar: I'm here. [13:07:24] Urbanecm: good morning [13:07:32] kart_ and Nikerabbit: I see progress at https://wikitech.wikimedia.org/w/index.php?title=Incident_documentation/20170419-ContentTranslation&action=history, and I see an explicit conclusion about DC switchover and monitoring, that's fine to me, do you want we deploy https://gerrit.wikimedia.org/r/#/c/349869/ ? [13:07:38] Afternoon for me :). [13:07:40] Urbanecm: " Add Draft namespace to zh_classicalwiki" is already on the debug machines :) [13:07:45] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/6226/ says great, deployment-prep labs puppetmaster is fine, merging" [puppet] - 10https://gerrit.wikimedia.org/r/349468 (https://phabricator.wikimedia.org/T156924) (owner: 10Giuseppe Lavagetto) [13:07:48] hashar: ack [13:07:49] (03PS2) 10Alexandros Kosiaris: profile::conftool::master: make the git root dir a parameter [puppet] - 10https://gerrit.wikimedia.org/r/349468 (https://phabricator.wikimedia.org/T156924) (owner: 10Giuseppe Lavagetto) [13:07:56] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] profile::conftool::master: make the git root dir a parameter [puppet] - 10https://gerrit.wikimedia.org/r/349468 (https://phabricator.wikimedia.org/T156924) (owner: 10Giuseppe Lavagetto) [13:08:01] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349889 (https://phabricator.wikimedia.org/T162547) (owner: 10Urbanecm) [13:08:34] Urbanecm: and there is only one broken file : id=73504 ns=0 dbk=模板:Protected_logo *** dest title exists and --add-prefix not specified [13:08:35] (03PS3) 10Muehlenhoff: Make wikidata logrotate config compatible with jessie [puppet] - 10https://gerrit.wikimedia.org/r/350149 (https://phabricator.wikimedia.org/T163555) [13:09:08] (03Merged) 10jenkins-bot: Add NS aliases for zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349889 (https://phabricator.wikimedia.org/T162547) (owner: 10Urbanecm) [13:09:09] hashar: try with --merge [13:09:15] eeek [13:09:16] (03CR) 10jenkins-bot: Add NS aliases for zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349889 (https://phabricator.wikimedia.org/T162547) (owner: 10Urbanecm) [13:09:27] hashar, I will take back lead of mediawiki-config lead at 15 UTC, or whenever you tell me [13:09:42] Dereckson: already renamed it with: 模板:Protected_logo -> 模板:Protected_logobroken :( [13:09:45] (03CR) 10Alexandros Kosiaris: [C: 032] Add separated SRV records for etcd to consume for conftool [dns] - 10https://gerrit.wikimedia.org/r/349380 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [13:09:48] (03PS3) 10Alexandros Kosiaris: Add separated SRV records for etcd to consume for conftool [dns] - 10https://gerrit.wikimedia.org/r/349380 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [13:09:51] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add separated SRV records for etcd to consume for conftool [dns] - 10https://gerrit.wikimedia.org/r/349380 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [13:09:52] jynus: okkk will poke whenever I am done [13:10:20] !log zh_classicalwiki : renamed broken page via namespaceDupes.php : id=73504 ns=0 dbk=模板:Protected_logo -> 模板:Protected_logobroken [13:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:47] Urbanecm: I am syncing the Draft namespace change [13:10:52] hashar: ack [13:11:33] hashar: Why I can't use eqiad debug machines for editing? Due to DC switch? [13:11:38] Urbanecm: yu [13:12:08] I have added ongoing work to https://wikitech.wikimedia.org/wiki/Deployments#Week_of_April_24th [13:12:11] Urbanecm: apparently they are now mw2017 and mw2099 , and I guess in the browser extension you can still use the old hostname and the request get routed properly [13:12:20] it is a summary because there are lots of tickets happening at the same time [13:12:20] (03CR) 10Muehlenhoff: [C: 032] Make wikidata logrotate config compatible with jessie [puppet] - 10https://gerrit.wikimedia.org/r/350149 (https://phabricator.wikimedia.org/T163555) (owner: 10Muehlenhoff) [13:12:25] !log hashar@naos Synchronized wmf-config/InitialiseSettings.php: Add Draft namespace to zh_classicalwiki - T163655 (duration: 01m 19s) [13:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:33] T163655: Request for adding a draft namespace to zh-classical.wikipedia.org - https://phabricator.wikimedia.org/T163655 [13:12:37] Yeah, I may view over them but the DB is locked [13:12:54] jynus: awesome. And for more highlight you might consider ops list as well. [13:13:02] (03CR) 10Dereckson: [C: 031] "DBA concern was a lack of documentation in the incident report, concern now fixed. The team asserts the issue has been triggered by the DC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349869 (https://phabricator.wikimedia.org/T163344) (owner: 10KartikMistry) [13:13:20] Urbanecm: now I am doing the "Add NS aliases for zh_classicalwiki" change [13:13:26] hashar: ack [13:13:35] Dereckson: That's not completely true as root cause is still unclear :) [13:13:47] Dereckson: but it will be great if you can add it to SWAT [13:14:07] Dereckson: Should I add patch to SWAT list now? [13:14:11] hashar as far I know it was announcer already [13:14:12] Yes, I noted in my last CR + 1 comment " The team asserts the issue has been triggered by the DC switchover (symptom), with an unknown cause." [13:14:26] but I can send another mail [13:15:19] kart_: yes, you can [13:15:23] !log Deploy alter table on silver.watchlist and labtestweb2001.labtestwiki for the watchlist table - T130067 [13:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:31] T130067: Add wl_id to watchlist tables on production dbs - https://phabricator.wikimedia.org/T130067 [13:15:49] 06Operations, 10netops: Implement RPKI (Resource Public Key Infrastructure) - https://phabricator.wikimedia.org/T61115#3209774 (10ayounsi) a:03ayounsi ARIN is also very straightforward (everything can be done online). See this copy of a blog post I wrote in 2013 https://labs.ripe.net/Members/mirjam/mozilla-u... [13:16:06] Urbanecm: and syncing [13:16:08] Dereckson: done [13:16:54] !log hashar@naos Synchronized wmf-config/InitialiseSettings.php: Add NS aliases for zh_classicalwiki - T162547 (duration: 01m 00s) [13:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:02] T162547: Request to change namespaces of zh-classical.wikipedia - https://phabricator.wikimedia.org/T162547 [13:17:07] next is https://gerrit.wikimedia.org/r/#/c/350011/ to get rid of some log spam [13:18:35] (03PS1) 10Urbanecm: Fix namespace Wikipedia_talk for zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350198 (https://phabricator.wikimedia.org/T162547) [13:18:40] kart_, Dereckson: has the front-end patch been backported already? [13:18:51] (03PS2) 10ArielGlenn: pylint all the things: get rid of camelcase [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350180 [13:19:01] Nikerabbit: right. Let me pull that too. Good catch. [13:19:10] Dereckson: wait. We need another patch too. [13:19:12] (03CR) 10jerkins-bot: [V: 04-1] pylint all the things: get rid of camelcase [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350180 (owner: 10ArielGlenn) [13:19:24] (03PS1) 10Urbanecm: Two namespace aliases for zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350199 (https://phabricator.wikimedia.org/T162547) [13:19:59] hashar: May you deploy 350199 and 350198 too? It's for the same task, I've noticed clarification right now. [13:19:59] Nikerabbit: https://gerrit.wikimedia.org/r/#/c/350200/ [13:20:37] Urbanecm: can you add them to the deployment page and give me the url please ? :) [13:20:42] hashar: Okay. [13:20:49] https://gerrit.wikimedia.org/r/350198 [13:20:49] !log hashar@naos Synchronized php-1.29.0-wmf.20/includes: Fix bogus field reference in Category::getCountMessage() callback - T162941 (duration: 01m 14s) [13:20:55] https://gerrit.wikimedia.org/r/350199 [13:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:58] T162941: Undefined property: CategoryTreeCategoryViewer::$mName - https://phabricator.wikimedia.org/T162941 [13:21:31] Dereckson: so, we need to get 350200 first and then enable CX. [13:21:35] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350199 (https://phabricator.wikimedia.org/T162547) (owner: 10Urbanecm) [13:22:03] Added [13:22:30] thanks [13:22:34] yw [13:22:34] (03Merged) 10jenkins-bot: Two namespace aliases for zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350199 (https://phabricator.wikimedia.org/T162547) (owner: 10Urbanecm) [13:22:41] !log Deploy alter table on s3 (only etwiki) for tag_summary and change_tag tables - T147166 [13:22:43] (03CR) 10jenkins-bot: Two namespace aliases for zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350199 (https://phabricator.wikimedia.org/T162547) (owner: 10Urbanecm) [13:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:52] T147166: Apply change_tag and tag_summary primary key schema change to Wikimedia wikis - https://phabricator.wikimedia.org/T147166 [13:23:02] (03PS2) 10Alexandros Kosiaris: etcd: make our rw clients use the new SRV record [puppet] - 10https://gerrit.wikimedia.org/r/349386 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [13:24:24] (03CR) 10Hashar: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350198 (https://phabricator.wikimedia.org/T162547) (owner: 10Urbanecm) [13:24:26] (03CR) 10Hashar: [C: 032] Fix namespace Wikipedia_talk for zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350198 (https://phabricator.wikimedia.org/T162547) (owner: 10Urbanecm) [13:24:33] (03CR) 10jerkins-bot: [V: 04-1] Fix namespace Wikipedia_talk for zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350198 (https://phabricator.wikimedia.org/T162547) (owner: 10Urbanecm) [13:24:37] bah [13:24:38] !log hashar@naos Synchronized wmf-config/InitialiseSettings.php: Two namespace aliases for zh_classicalwiki - T162547 (duration: 00m 49s) [13:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:46] T162547: Request to change namespaces of zh-classical.wikipedia - https://phabricator.wikimedia.org/T162547 [13:24:53] Urbanecm: gotta rebase https://gerrit.wikimedia.org/r/#/c/350198/ [13:25:15] hashar: I should rebase? [13:26:06] Urbanecm: yeah that conflicts with the change that adds Portal namespace [13:26:11] Urbanecm: i can do it if you want [13:26:34] I'll do it. [13:27:19] (03CR) 10Alexandros Kosiaris: [C: 031] "https://puppet-compiler.wmflabs.org/6228/ says the expected, merging" [puppet] - 10https://gerrit.wikimedia.org/r/349386 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [13:27:20] (03CR) 10Alexandros Kosiaris: [C: 032] etcd: make our rw clients use the new SRV record [puppet] - 10https://gerrit.wikimedia.org/r/349386 (https://phabricator.wikimedia.org/T159687) (owner: 10Giuseppe Lavagetto) [13:28:33] (03PS2) 10Urbanecm: Fix namespace Wikipedia_talk for zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350198 (https://phabricator.wikimedia.org/T162547) [13:28:45] hashar: Done in PS2 [13:28:52] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350198 (https://phabricator.wikimedia.org/T162547) (owner: 10Urbanecm) [13:30:10] (03Merged) 10jenkins-bot: Fix namespace Wikipedia_talk for zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350198 (https://phabricator.wikimedia.org/T162547) (owner: 10Urbanecm) [13:30:22] (03CR) 10jenkins-bot: Fix namespace Wikipedia_talk for zh_classicalwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350198 (https://phabricator.wikimedia.org/T162547) (owner: 10Urbanecm) [13:30:23] hashar: you're SWAT'ng, right? [13:30:27] 06Operations, 13Patch-For-Review, 15User-Joe: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687#3209809 (10akosiaris) [13:31:36] 06Operations, 06Labs: Investigate ceasing new Trusty instance creation in Labs - https://phabricator.wikimedia.org/T161899#3209811 (10Andrew) As soon as we disable Trusty we'll also be violating 'cattle, not pets' for most of our users. It will mean that anytime they need to recreate an instance they will als... [13:31:49] !log hashar@naos Synchronized wmf-config/InitialiseSettings.php: Fix namespace Wikipedia_talk for zh_classicalwiki - T162547 (duration: 00m 48s) [13:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:57] T162547: Request to change namespaces of zh-classical.wikipedia - https://phabricator.wikimedia.org/T162547 [13:33:22] handling https://gerrit.wikimedia.org/r/#/c/350203/ [13:33:27] kart_: yes I am doing the swat [13:33:32] gotta push that mediawiki/core patch [13:35:30] grrlblbl [13:35:41] Error generating thumbnail - Timeout waiting for the lock [13:35:56] !log rebooting einsteinium for update to Linux 4.9 [13:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:45] hashar: thanks. We need wmf.20 patch first and then config (CX enablement) [13:37:10] !log hashar@naos Synchronized php-1.29.0-wmf.20/includes/media/TransformationalImageHandler.php: media: Capture stderr when running convert --version - T158649 (duration: 00m 47s) [13:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:19] T158649: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649 [13:37:31] ie https://gerrit.wikimedia.org/r/#/c/350200/ (I've already merged it) for wmf.20, then https://gerrit.wikimedia.org/r/#/c/349869/ [13:37:57] 06Operations, 07Beta-Cluster-reproducible, 05MW-1.29-release (WMF-deploy-2017-04-25_(1.29.0-wmf.21)), 05MW-1.29-release-notes, and 2 others: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649#3209845 (10h... [13:38:38] 06Operations, 06Labs: Ensure we can survive a loss of labservices1001 - https://phabricator.wikimedia.org/T163402#3209846 (10Andrew) [13:38:56] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3209847 (10Gehel) I summarized the actions taken on https://etherpad.wikimedia.org/p/elastic2020. @Papaul, could you review it and see if I mis... [13:39:02] kart_: ok pushing the wmf patch [13:39:55] kart_: it is on mw2099 / mw2017 [13:40:10] hashar: can't test as CX is disabled. [13:40:14] ah yeah [13:40:17] hashar: ie wmf.20 patch [13:40:18] :) [13:40:28] (03PS3) 10Hashar: Re-enable ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349869 (https://phabricator.wikimedia.org/T163344) (owner: 10KartikMistry) [13:40:35] (03PS1) 10Alexandros Kosiaris: Use conf2001 for secondary eqiad LVS's pybal [puppet] - 10https://gerrit.wikimedia.org/r/350204 (https://phabricator.wikimedia.org/T159687) [13:40:39] (03CR) 10Hashar: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349869 (https://phabricator.wikimedia.org/T163344) (owner: 10KartikMistry) [13:40:52] kart_: lets reenable cx on the debug hosts [13:40:53] :] [13:41:00] Sure! [13:44:29] (03Merged) 10jenkins-bot: Re-enable ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349869 (https://phabricator.wikimedia.org/T163344) (owner: 10KartikMistry) [13:44:51] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/6229/ says we got the expected result, merging" [puppet] - 10https://gerrit.wikimedia.org/r/350204 (https://phabricator.wikimedia.org/T159687) (owner: 10Alexandros Kosiaris) [13:44:53] kart_: ok enabled on mw2017 / mw2099 [13:44:55] (03PS2) 10Alexandros Kosiaris: Use conf2001 for secondary eqiad LVS's pybal [puppet] - 10https://gerrit.wikimedia.org/r/350204 (https://phabricator.wikimedia.org/T159687) [13:45:00] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Use conf2001 for secondary eqiad LVS's pybal [puppet] - 10https://gerrit.wikimedia.org/r/350204 (https://phabricator.wikimedia.org/T159687) (owner: 10Alexandros Kosiaris) [13:45:40] Any chance I could sneak a security thing into the end of swat? ( https://phabricator.wikimedia.org/T163756 - need to touch the PrivateSettings.php symlink and sync it) [13:45:44] testing, hashar [13:45:50] (03CR) 10jenkins-bot: Re-enable ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/349869 (https://phabricator.wikimedia.org/T163344) (owner: 10KartikMistry) [13:48:05] hashar: looks supercool as usual (TM) [13:48:09] hashar: go ahead. [13:48:16] !!!!!!!! [13:49:24] !log hashar@naos Synchronized wmf-config/InitialiseSettings.php: Re-enable ContentTranslation - T163344 (duration: 00m 44s) [13:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:33] T163344: Do a root-cause analysis on CX outage during dc switch and get it back online - https://phabricator.wikimedia.org/T163344 [13:50:13] thanks hashar and Dereckson [13:50:40] fatal error: Argument 1 passed to MediaWiki\Linker\LinkRenderer::makeKnownLink() must implement interface MediaWiki\Linker\LinkTarget, null given in /srv/mediawiki/php-1.29.0-wmf.20/includes/linker/LinkRenderer.php on line 301 [13:50:41] bah [13:50:45] unrelated to cx [13:51:33] (03PS1) 10Jcrespo: mariadb: switch s7 eqiad master from db1041 to db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350205 (https://phabricator.wikimedia.org/T162133) [13:51:46] !log European SWAT complete [13:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:13] 06Operations, 06Labs: Investigate ceasing new Trusty instance creation in Labs - https://phabricator.wikimedia.org/T161899#3209873 (10chasemp) >>! In T161899#3209811, @Andrew wrote: > As soon as we disable Trusty we'll also be violating 'cattle, not pets' for most of our users. It will mean that anytime they... [13:52:16] jynus: I am open for service :] [13:52:26] hah [13:52:37] I think bawolff wanted to do something [13:53:07] jynus / hashar : yeah, is it ok if a sync a file? [13:53:31] there is some fatal error that started since swat :( [13:53:31] you have until I am fully ready for https://gerrit.wikimedia.org/r/350205 [13:53:43] !sal [13:53:43] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [13:53:52] jynus: cool, thanks [13:54:28] (03CR) 10Muehlenhoff: [C: 04-2] "This would break "systemctl mask"; if you disable a unit using "systemctl mask foo.service", it'll create a symlink to /dev/null in /etc/s" [puppet] - 10https://gerrit.wikimedia.org/r/348665 (owner: 10Dzahn) [13:54:46] (03PS3) 10ArielGlenn: pylint all the things: get rid of camelcase [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350180 [13:55:10] (03CR) 10Marostegui: [C: 031] mariadb: switch s7 eqiad master from db1041 to db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350205 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [13:56:16] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3209875 (10Cmjohnson) A disk replacement has been ordered with Dell Create Service Request: Service Tag 753NMD2 Confirmed: Request 947500398 was successfully submitted. [13:56:41] 06Operations, 13Patch-For-Review, 15User-Joe: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687#3074977 (10akosiaris) lvs1004, lvs1005, lvs1006 now use conf2001 per the patch above successfully. Proceeding with the rest of the plan [13:57:38] 06Operations, 13Patch-For-Review, 15User-Joe: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687#3209881 (10akosiaris) [13:57:57] bawolff: yeah looks good / open [13:58:00] !log bawolff@naos Synchronized wmf-config/PrivateSettings.php: Hopefully cause previous changes to be picked up (duration: 00m 44s) [13:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:13] ok, lets see if this worked [13:59:52] And the answer is no... [14:00:59] (03PS3) 10Jcrespo: prometheus-mysqld-exporter: Change db1061 to be the s6 master on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/350171 (https://phabricator.wikimedia.org/T162133) [14:01:00] (03PS1) 10Jcrespo: mariadb: promote db1062 as the new master of s7 eqiad [puppet] - 10https://gerrit.wikimedia.org/r/350209 (https://phabricator.wikimedia.org/T162133) [14:01:26] oh because touch needs a -h option [14:01:29] (03PS2) 10Matthias Mullie: Turn $wg3dProcessor into an array of arguments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350125 [14:01:31] (03PS1) 10Matthias Mullie: Enable Extension:3d in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350210 [14:01:37] (03CR) 10Marostegui: [C: 031] mariadb: promote db1062 as the new master of s7 eqiad [puppet] - 10https://gerrit.wikimedia.org/r/350209 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [14:01:56] (03CR) 10Jcrespo: [V: 032 C: 032] prometheus-mysqld-exporter: Change db1061 to be the s6 master on eqiad [puppet] - 10https://gerrit.wikimedia.org/r/350171 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [14:02:00] !log bawolff@naos Synchronized wmf-config/PrivateSettings.php: Hopefully cause previous changes to be picked up try2 (duration: 00m 44s) [14:02:00] (03CR) 10Matthias Mullie: [V: 04-1 C: 04-2] "This is prep, do not merge yet!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350210 (owner: 10Matthias Mullie) [14:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:25] It Works!!!! [14:02:28] oh thankfully [14:02:28] !log poweroff ms-be1016 for controller swap - T150206 [14:02:34] (03CR) 10Jcrespo: [C: 032] mariadb: promote db1062 as the new master of s7 eqiad [puppet] - 10https://gerrit.wikimedia.org/r/350209 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [14:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:36] T150206: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206 [14:02:50] you don;t know how much time I spent yesterday failing to make this work (thanks tgr for telling me what's wrong) [14:02:54] jynus: I'm done now [14:03:12] (03CR) 10Alexandros Kosiaris: "FWIW, the move to /lib happened in Ia600f969da73ae33bf4476d06079c5d333b4c304 after some considerable discussion on IRC (not present in the" [puppet] - 10https://gerrit.wikimedia.org/r/348665 (owner: 10Dzahn) [14:05:27] 06Operations, 07Beta-Cluster-reproducible, 05MW-1.29-release (WMF-deploy-2017-04-11_(1.29.0-wmf.20)), 05MW-1.29-release-notes, and 2 others: firejail for mediawiki converter leaks to stderr: "Reading profile /etc/firejail/mediawiki-converters.profile" - https://phabricator.wikimedia.org/T158649#3209914 (10h... [14:06:44] 06Operations, 06Labs: Investigate ceasing new Trusty instance creation in Labs - https://phabricator.wikimedia.org/T161899#3209922 (10chasemp) For fwiw's the second is what I intended, a better title here would be `Investigate ceasing self-service new Trusty instance creation in Labs`. That's on me, I thought... [14:07:26] !log moving s7 eqiad replicas under db1062 [14:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:44] actually, I have before to [14:08:02] !log restarting mariadb on db1062 [14:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:34] (03PS3) 10Filippo Giunchedi: base: fix run-puppet-agent --enable help [puppet] - 10https://gerrit.wikimedia.org/r/350124 [14:10:47] (03CR) 10Filippo Giunchedi: base: fix run-puppet-agent --enable help (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350124 (owner: 10Filippo Giunchedi) [14:11:55] huh, badpass log seems to have stopped on logstash https://logstash.wikimedia.org/goto/643a0601bc7108f19df562bb4e4b3cc6 [14:13:34] I wonder if that was my change, I was logging additional things to that log [14:13:37] (03CR) 10MarkTraceur: [C: 031] Turn $wg3dProcessor into an array of arguments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350125 (owner: 10Matthias Mullie) [14:13:50] (03CR) 10MarkTraceur: [C: 031] Turn $wg3dProcessor into an array of arguments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350125 (owner: 10Matthias Mullie) [14:17:05] !log Stop replication in sync on db1089 and db1083 for maintenance - https://phabricator.wikimedia.org/T130067 [14:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:10] logstash seems to have stopped logging mediawiki events(!) [14:18:50] jynus: could anything you're doing potentially affect logstash? or was it what I just did (but I don't see how what i just did could do this)? [14:19:00] bd808: where does the logs for logstash live? [14:19:45] bawolff, logstash doesn't use mysql [14:20:04] and I have not yet deployed any mediawiki change [14:20:08] ok, so that's suggestive its something I did [14:20:46] but I didn't touch logstash config or anything [14:20:50] !log uploaded WMF nginx-1.11.10-1+wmf1 packages to jessie-wikimedia repo [14:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:59] yes, it fits the 14:02 change [14:21:07] based on timestamp [14:21:29] or it could be an overload due to the previous deploys [14:21:43] huh, all the logs on mwlog1001 also stopped [14:21:49] I see lots of test errors [14:21:52] testwiki [14:21:55] before that [14:23:34] (03PS1) 10Alexandros Kosiaris: Lower TTL for etcd client records [dns] - 10https://gerrit.wikimedia.org/r/350212 (https://phabricator.wikimedia.org/T159687) [14:23:44] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:23:53] huh, its logging all mediawiki debug logs for testwiki to logstash [14:24:12] (03PS1) 10Faidon Liambotis: Add add_ip6_mapped to cobalt [puppet] - 10https://gerrit.wikimedia.org/r/350213 [14:24:21] its not supposed to do that, its only supposed to log them to the mwlog1001 [14:24:23] indeed all network traffic for mwlog1001 dropped at ~14 UTC [14:24:44] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [14:24:47] (03CR) 10Faidon Liambotis: [C: 032] Add add_ip6_mapped to cobalt [puppet] - 10https://gerrit.wikimedia.org/r/350213 (owner: 10Faidon Liambotis) [14:24:55] I am going to continue what I am doing because I cannot stop that in the middle [14:25:46] ok [14:26:04] an I am not touching any active part of mediawiki [14:26:12] only passive databases [14:26:34] bawolff, can you revert what you deployed, only as a test? [14:26:45] my change would cause PrivateSettings.php to be flushed, if there were old changes in that that were never deployed, it could have caused them to be deployed [14:26:51] jynus: ok, will do [14:27:13] look, better testing something that not doing anything [14:27:45] sometimes it is not your change, but it displays something that was arleady there [14:27:59] !log Logging has seemed to stop after last deploy to private settings :( [14:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:32] ok, deploying the reverted version [14:29:32] I do not think it is working [14:29:44] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:29:44] PROBLEM - Check correctness of the icinga configuration on tegmen is CRITICAL: Icinga configuration contains errors [14:30:25] !log bawolff@naos Synchronized private/PrivateSettings.php: rv change to T163477 to see if it fixes logging (duration: 01m 14s) [14:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:44] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [14:32:05] It has a small upwards trend [14:32:15] (03PS1) 10Alexandros Kosiaris: Swap etcd client records to point to codfw [dns] - 10https://gerrit.wikimedia.org/r/350214 [14:32:32] so, yes, it was that [14:32:41] I will tell him when he reconnects [14:32:59] yeah, it worked [14:33:21] hey, it worked [14:33:31] (03CR) 10Filippo Giunchedi: [C: 031] "To be merged later today" [puppet] - 10https://gerrit.wikimedia.org/r/349668 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [14:33:43] confirmed, mwlog1001 traffic is back [14:33:50] 99.9% sure it was that [14:33:50] (03PS1) 10Alexandros Kosiaris: Switch conftool etcd records to codfw [dns] - 10https://gerrit.wikimedia.org/r/350216 (https://phabricator.wikimedia.org/T159687) [14:34:03] now what *eaxactly* it was, I do not know [14:34:05] Maybe my patch was logging too much data and it was overloading things [14:34:14] unlikely [14:34:27] it would had at least some spike or something [14:34:44] some hidden syntax error or something undetected, maybe? [14:34:59] or something underlaying that is is already bad [14:35:40] I would stick around, but I *really* need to do my deployment [14:36:36] jynus: ok. I think i need to ask someone like bd808 who really understands logstash [14:36:48] yeah, he fixed it for me last time it happened [14:37:29] bawolff: I gave up my access to the prod logstash servers last week. [14:37:37] he he [14:37:40] bawolff: what's messed up? Lots of backscroll [14:37:52] we don't know [14:37:54] bd808: I deployed a patch. All logging of mediawiki stuff stopped [14:38:13] my patch logged a bunch of stuff to the badpass channel, but otherwise did nothing related to logging [14:38:13] that seems not good [14:38:26] patch is revereted now, and logging is fixed [14:38:34] where's the patch? [14:38:59] bd808: https://phabricator.wikimedia.org/T163756#3208738 [14:40:18] first guess is something related to structured log data types for those keys [14:40:22] or on servers at private/Guanaco.php [14:40:38] bd808: does logstash have a log I could look at? [14:41:00] bd808: It seemed to initially present as a spike in log entries at testwiki (as if full debug logging was turned on) [14:41:42] yes. If it's type collision stuff that would be logged on the logstash100[456] servers in their /var/log/elasticsearch logs [14:41:56] 06Operations, 10media-storage, 15User-fgiunchedi: Consider storage policies for swift - https://phabricator.wikimedia.org/T151648#3209998 (10fgiunchedi) WRT minimum swift version, we're running 2.2 and 2.10 is on the cards (https://phabricator.wikimedia.org/T162609) here's the relevant changelog entries betw... [14:42:16] if its in logstash parsing itself (unlikely) that would be on logstash100[123] in /var/log/logstash [14:42:41] bd808: Note the spike at https://logstash.wikimedia.org/goto/e5346896803fb9b2d3d8f37b1c611b53 which is super odd [14:43:12] bawolff: ebernhardson is probably your best helper for debugging on the logstash/elasticsearch side of things [14:43:13] 06Operations: Puppet facts around the primary network interface and IPv4/IPv6 address - https://phabricator.wikimedia.org/T163196#3210002 (10faidon) Thanks for doing all this work @Volans :) >>! In T163196#3206314, @Volans wrote: > - **[1] `cobalt.wikimedia.org`**: > The `ipaddress6_primary` is correct but the... [14:43:24] ok [14:43:38] bawolff: I can try to take a look [14:43:42] 06Operations, 06Labs: Investigate ceasing self-service new Trusty instance creation in Labs - https://phabricator.wikimedia.org/T161899#3210003 (10chasemp) [14:43:49] dcausse: thanks [14:44:03] I know next to nothing for mw logging et al, though traffic to mwlog1001 also dropped, which shouldn't if only logstash was affected (?) [14:44:18] 06Operations, 06Labs: Investigate ceasing self-service new Trusty instance creation in Labs - https://phabricator.wikimedia.org/T161899#3146739 (10chasemp) >>! In T161899#3209922, @chasemp wrote: > fwiw's the second is what I intended, a better title here would be `Investigate ceasing self-service new Trusty i... [14:44:26] godog: that's a good point [14:46:23] The spike in testwiki logs may just be a coincidence, I see similar spikes further back in time [14:46:27] (03PS4) 10ArielGlenn: pylint all the things: get rid of camelcase [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350180 [14:47:03] bawolff: I'm working off the assumption that a logstash problem isn't going to affect other destinations but might not be that [14:48:00] full logging is done when the X-Wikimedia-Debug header is passed with the log switch turned on. [14:48:41] If ua (the only field I added that was different) was some sort of reserved field in logstash for the log metadata items, wouldn't that just break the logging of my entries, and not other mediawiki log entries? [14:48:43] is mw2099 in the pool that gets those requests? If so that could explain the debug log spike [14:49:11] yeah mw2099 in an x-wikimedia-debug host [14:49:15] (03PS1) 10Filippo Giunchedi: swift: add ratelimit middleware [puppet] - 10https://gerrit.wikimedia.org/r/350220 (https://phabricator.wikimedia.org/T162793) [14:49:36] that mystery is probably solved then. now to figure out why monolog choked [14:50:53] bawolff: nothing obvious in elastic logs except something related to a "striker" field, but this problem was visible yesterday so most probably unrelated [14:51:40] ua is already used as a field in other log entries [14:51:47] so that can't be reserved [14:51:59] err a request field in the type "striker" [14:52:57] In fact, all the fields in the thing I was logging are used in other log entries [14:53:25] We can't be just fataling all requests or the users would have noticed ;) [14:54:20] monolog has a fail safe that swallows exceptions during logging to keep them from taking down the web request [14:54:47] !log upgrading nginx on cp1008 [14:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:11] bawolff: that's done with https://github.com/Seldaek/monolog/blob/master/src/Monolog/Handler/WhatFailureGroupHandler.php [14:55:18] (03CR) 10Jcrespo: [C: 032] mariadb: switch s7 eqiad master from db1041 to db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350205 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [14:56:25] (03Merged) 10jenkins-bot: mariadb: switch s7 eqiad master from db1041 to db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350205 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [14:56:33] (03CR) 10jenkins-bot: mariadb: switch s7 eqiad master from db1041 to db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350205 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [14:57:12] (03PS1) 10Jcrespo: Set db1062 as the last component of s7 [software] - 10https://gerrit.wikimedia.org/r/350221 (https://phabricator.wikimedia.org/T162133) [14:58:48] So umm, we could try redeploying without the logging and see if the other logging still breaks I guess? [14:59:06] !log jynus@naos Synchronized wmf-config/db-eqiad.php: switch s7 eqiad master from db1041 to db1062 (duration: 00m 54s) [14:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:13] (03CR) 10Jcrespo: [C: 032] Set db1062 as the last component of s7 [software] - 10https://gerrit.wikimedia.org/r/350221 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [15:07:44] (03PS1) 10Alexandros Kosiaris: Switch all pybals to using codfw etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/350223 (https://phabricator.wikimedia.org/T159687) [15:07:46] (03PS1) 10Alexandros Kosiaris: Revert "Use conf2001 for secondary eqiad LVS's pybal" [puppet] - 10https://gerrit.wikimedia.org/r/350224 [15:08:50] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3210094 (10Papaul) mw2017 plug into ps1-a3 has jut one PSU and the reading on pss1-a3 is higher than ps2-a3 I will like to power that server dow... [15:10:16] (03PS1) 10Alexandros Kosiaris: Increase TTL for etcd client records [dns] - 10https://gerrit.wikimedia.org/r/350225 (https://phabricator.wikimedia.org/T159687) [15:11:13] (03CR) 10Alexandros Kosiaris: [C: 032] Lower TTL for etcd client records [dns] - 10https://gerrit.wikimedia.org/r/350212 (https://phabricator.wikimedia.org/T159687) (owner: 10Alexandros Kosiaris) [15:13:47] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3210125 (10Papaul) @Gehel Thanks everything looks good. [15:14:28] !log filippo@neodymium conftool action : set/pooled=no; selector: name=mw2017.codfw.wmnet [15:14:29] !log Deploy alter table s7 on watchlist table directly on the master (db1062) - https://phabricator.wikimedia.org/T130067 [15:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:30] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3210138 (10Gehel) @RobH: the summary is in https://etherpad.wikimedia.org/p/elastic2020. Let me know if it looks good enough to you and if I ca... [15:16:51] (03PS1) 10Jcrespo: Set db1063 as the last server on s7 [software] - 10https://gerrit.wikimedia.org/r/350227 (https://phabricator.wikimedia.org/T162133) [15:18:03] !log start cache_text upgrade to linux 4.9 T162029 [15:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:11] T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 [15:18:14] (03PS2) 10Jcrespo: Set db1063 as the last server on s5 [software] - 10https://gerrit.wikimedia.org/r/350227 (https://phabricator.wikimedia.org/T162133) [15:21:50] (03PS1) 10Jcrespo: mariadb: promote db1063 as s5 master [puppet] - 10https://gerrit.wikimedia.org/r/350228 (https://phabricator.wikimedia.org/T162133) [15:22:21] (03CR) 10Jcrespo: [C: 04-2] "Not until tomorrow." [software] - 10https://gerrit.wikimedia.org/r/350227 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [15:22:44] (03PS2) 10Alexandros Kosiaris: Switch all pybals to using codfw etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/350223 (https://phabricator.wikimedia.org/T159687) [15:22:53] (03CR) 10Marostegui: [C: 031] Set db1063 as the last server on s5 [software] - 10https://gerrit.wikimedia.org/r/350227 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [15:22:53] 06Operations, 06Release-Engineering-Team, 10vm-requests, 07Security-General: New ganeti VM for MW release pipeline work - https://phabricator.wikimedia.org/T163743#3210152 (10demon) >>! In T163743#3208931, @hashar wrote: > We have CI hosts like contint1001 / contint2001. What about a generic name like: `c... [15:22:57] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Switch all pybals to using codfw etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/350223 (https://phabricator.wikimedia.org/T159687) (owner: 10Alexandros Kosiaris) [15:22:59] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Use conf2001 for secondary eqiad LVS's pybal" [puppet] - 10https://gerrit.wikimedia.org/r/350224 (owner: 10Alexandros Kosiaris) [15:23:04] (03PS2) 10Alexandros Kosiaris: Revert "Use conf2001 for secondary eqiad LVS's pybal" [puppet] - 10https://gerrit.wikimedia.org/r/350224 [15:23:09] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "Use conf2001 for secondary eqiad LVS's pybal" [puppet] - 10https://gerrit.wikimedia.org/r/350224 (owner: 10Alexandros Kosiaris) [15:23:21] (03CR) 10Jcrespo: [C: 04-2] "Not until tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/350228 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [15:23:23] (03CR) 10Marostegui: [C: 031] mariadb: promote db1063 as s5 master [puppet] - 10https://gerrit.wikimedia.org/r/350228 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [15:23:37] 06Operations, 06Release-Engineering-Team, 10vm-requests, 07Security-General: New ganeti VM for MW release pipeline work - https://phabricator.wikimedia.org/T163743#3210166 (10demon) I'd be fine with something like mwreleases1001! [15:25:59] (03PS1) 10Jcrespo: mariadb: Promote db1063 as the master of s5 eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350230 (https://phabricator.wikimedia.org/T162133) [15:26:49] !log mobrovac@naos Started deploy [changeprop/deploy@e0e3684]: Bring back the concurrency level - T163292 [15:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:58] T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292 [15:26:59] !log mobrovac@naos Finished deploy [changeprop/deploy@e0e3684]: Bring back the concurrency level - T163292 (duration: 00m 10s) [15:27:04] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3210176 (10Papaul) The reading on both PUD's shows ps1-a3 X=9.73 Y= 9.65 Z=12.84 ps2-a3 X=1.96 Y=9.65 z= 1.97 ps1-a3 is pulling more power th... [15:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:44] (03PS1) 10Brian Wolff: Test authmanager restricter in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350231 [15:28:05] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3210178 (10RobH) We should be able to balance it while the other tower 2 isnt being used, it will be more difficult but should be possible. [15:28:26] !log filippo@neodymium conftool action : set/pooled=yes; selector: name=mw2017.codfw.wmnet [15:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:49] (03PS2) 10Alexandros Kosiaris: Swap etcd client records to point to codfw [dns] - 10https://gerrit.wikimedia.org/r/350214 (https://phabricator.wikimedia.org/T159687) [15:30:35] (03CR) 10Jcrespo: [C: 04-2] "Not until tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350230 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [15:32:26] (03CR) 10Marostegui: [C: 031] mariadb: Promote db1063 as the master of s5 eqiad [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350230 (https://phabricator.wikimedia.org/T162133) (owner: 10Jcrespo) [15:33:15] !log restart pybal on lvs[2004-2006].codfw.wmnet,lvs3004.esams.wmnet,lvs4004.ulsfo.wmnet,lvs[1004-1006].wikimedia.org T159687 [15:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:23] T159687: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687 [15:33:36] !log stopping replication on dbstore1001 to change its replication topology [15:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:14] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3210235 (10RobH) I've sent off an email to Dasher, and cc'd both @papaul and @Gehel on the email thread. [15:35:44] !log mobrovac@naos Started deploy [changeprop/deploy@7521b2f]: Bring back the concurrency level - T163292 [15:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:52] T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T163292 [15:36:57] !log mobrovac@naos Finished deploy [changeprop/deploy@7521b2f]: Bring back the concurrency level - T163292 (duration: 01m 13s) [15:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:44] 06Operations, 10netops: Implement RPKI (Resource Public Key Infrastructure) - https://phabricator.wikimedia.org/T61115#3210252 (10Multichill) Hey, a new network engineer. :-) Fun info at https://stat.ripe.net/AS43821#tabId=routing and https://stat.ripe.net/AS14907#tabId=routing . Would love to see some progres... [15:40:49] 06Operations, 06Release-Engineering-Team, 10vm-requests, 07Security-General: New ganeti VM for MW release pipeline work - https://phabricator.wikimedia.org/T163743#3210254 (10RobH) >>! In T163743#3208931, @hashar wrote: > We have CI hosts like contint1001 / contint2001. What about a generic name like: `co... [15:41:32] jynus: Would I be able to deploy what I was previously working on to beta cluster (Its a config change, if I understand correctly policy also requires me to deploy the CommonSettings-labs.php file everywhere), or would that interfere with what your doing [15:42:00] 06Operations, 07HHVM: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795#3210271 (10fgiunchedi) [15:42:00] I have finished all deploys I think I wanted to do for today [15:42:06] bawolff ^ [15:42:27] cool - so I'm good to go as long as I finish before the next deploy window? [15:42:28] I will do more, but tomorrow [15:42:57] (03CR) 10Brian Wolff: [C: 032] Test authmanager restricter in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350231 (owner: 10Brian Wolff) [15:44:05] (03Merged) 10jenkins-bot: Test authmanager restricter in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350231 (owner: 10Brian Wolff) [15:45:29] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3210305 (10Cmjohnson) part has been dispatched Dispatch Reference Number #325751063 Scheduled to arrive: 4/26/2017 [15:45:43] (03CR) 10jenkins-bot: Test authmanager restricter in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350231 (owner: 10Brian Wolff) [15:46:25] !log Stop replication on db1086 and db1094 in sync - https://phabricator.wikimedia.org/T130067 [15:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:26] (03PS1) 10Fdans: Add AutomatedRequest to schema black list [puppet] - 10https://gerrit.wikimedia.org/r/350235 (https://phabricator.wikimedia.org/T67508) [15:47:34] !log restart pybal on lvs2003.codfw.wmnet,lvs3003.esams.wmnet,lvs4003.ulsfo.wmnet,lvs1003.wikimedia.org T159687 [15:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:42] T159687: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687 [15:48:13] !log bawolff@naos Synchronized wmf-config/CommonSettings-labs.php: Test account creation limits on labs (duration: 01m 14s) [15:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:10] !log installing libav security updates [15:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:28] bd808: So, it works totally fine on labs [15:50:37] (03PS8) 10Eevans: Create a Cassandra 3.7 configuration [puppet] - 10https://gerrit.wikimedia.org/r/349668 (https://phabricator.wikimedia.org/T160570) [15:53:44] godog: i'm ready for https://gerrit.wikimedia.org/r/349668 when you are [15:56:53] (03PS1) 10Faidon Liambotis: Fix ipaddress6_primary to ignore deprecated addresses [puppet] - 10https://gerrit.wikimedia.org/r/350238 (https://phabricator.wikimedia.org/T163196) [15:57:31] urandom: kk, I can merge now and do puppet swat patches in 5 min, if sth comes up we can take a look in ~15 min [15:57:46] godog: sure [15:57:55] godog: these are just dev boxes [15:57:55] (03CR) 10jerkins-bot: [V: 04-1] Fix ipaddress6_primary to ignore deprecated addresses [puppet] - 10https://gerrit.wikimedia.org/r/350238 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [15:58:01] :45 [15:58:03] (03PS1) 10Andrew Bogott: Switch labservices1002 to the primary designate/dns server. [puppet] - 10https://gerrit.wikimedia.org/r/350239 [15:58:05] (03CR) 10Filippo Giunchedi: [C: 032] Create a Cassandra 3.7 configuration [puppet] - 10https://gerrit.wikimedia.org/r/349668 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [15:58:24] godog: and i'm counting on the need to iterate, so i've silenced all alerts and am prepared for the worst :) [15:58:47] urandom: haha ok, patch is merged [15:58:54] godog: thank you! [15:59:14] yw [15:59:30] !log restart pybal on lvs[2001-2002].codfw.wmnet,lvs[3001-3002].esams.wmnet,lvs[4001-4002].ulsfo.wmnet,lvs[1001-1002].wikimedia.org T159687 [15:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:39] T159687: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687 [15:59:49] RECOVERY - Check correctness of the icinga configuration on tegmen is OK: Icinga configuration is correct [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170425T1600). Please do the needful. [16:00:04] thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:49] (03PS2) 10Filippo Giunchedi: Scap: update version to 3.5.6-1 [puppet] - 10https://gerrit.wikimedia.org/r/350096 (owner: 10Thcipriani) [16:01:00] thcipriani: I'll merge the scap upgrade patch first [16:01:07] godog: cool thanks! [16:01:38] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Scap: update version to 3.5.6-1 [puppet] - 10https://gerrit.wikimedia.org/r/350096 (owner: 10Thcipriani) [16:01:41] robh: Please kill https://upload.wikimedia.org/wikipedia/commons/f/f9/Calema-Nossa_Vez(Portal-Edman-news-Musik).webm [16:01:51] https://commons.wikimedia.org/wiki/File:Calema-Nossa_Vez(Portal-Edman-news-Musik).webm [16:02:06] It should not be visible, it’s deleted, but still ‘works' [16:02:51] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3210367 (10Papaul) I moved the PSU's that are pulling lest power too ps1-a3 and the once pulling more power on ps2-a3 we should be good for no... [16:03:01] (03PS1) 10Dereckson: Remove blog.wikiwix.com from fr and en planet feeds [puppet] - 10https://gerrit.wikimedia.org/r/350240 [16:03:18] May I add this to puppet SWAT? ^ [16:03:40] This is an urgent change to stop to relay spam on fr/en planet [16:03:53] ^ Or anyone with access, really… [16:04:33] For example, fr.planet.wikimedia.org currently starts by acquistare valacyclovir 500 mg in linea [16:04:42] who is doing puppet swat? [16:04:42] Revent: i have no idea how to fix that, but im finding out! [16:04:46] =] [16:04:55] Thanks. [16:04:57] andrewbogott: I am [16:04:58] its been years since i had to manually purge something [16:05:12] godog: can you ping me when you're done? I need to cause a brief CI outage but don't want to mess with you [16:05:12] Dereckson: yeah, please add to deployments page [16:05:15] interesting odd things i dont get to do is why im on clinic duty \o/ [16:05:17] Oh man, I forgot about planet! [16:05:20] andrewbogott: kk, will do [16:05:28] thx [16:05:35] It’s this guy…. https://www.facebook.com/PortalEdmannews/ repeated sockpuppeter. [16:06:08] Revent: looking at the issue, but probably it's just queue delays? [16:06:15] Hopefully. [16:06:15] Revent: it looks like caching due to the fact its gone in swift [16:06:19] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1016 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [16:06:25] Ok... [16:06:32] thcipriani: scap upgraded on naos [16:06:35] bblack: so how long shoudl something live like that when the file is deleted? [16:06:56] godog: cool, lemme do a quick test [16:07:43] purge from varnish? [16:08:06] still visible after activating x-wikimedia-debug though [16:08:25] If it’s helpful, I’m hitting 198.35.26.112 [16:09:08] also visible after adding http request parameters which usually invalidates the vanish cache [16:09:13] !log thcipriani@naos Synchronized README: test new scap version (duration: 01m 03s) [16:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:21] Ok, now 404... [16:09:32] Oh, maybe purgeList did work [16:09:35] ^ godog lgtm! [16:09:37] yeah I didn't fix anything [16:09:39] Unless bblack did something :) [16:09:45] purgelist did it? [16:09:46] (I was following in another channel) [16:09:51] I just got done testing, but it was fixed by the time I got my test going on all the caches [16:09:52] (03PS2) 10Faidon Liambotis: Fix ipaddress6_primary to ignore deprecated addresses [puppet] - 10https://gerrit.wikimedia.org/r/350238 (https://phabricator.wikimedia.org/T163196) [16:09:55] Guess I needed to wait more than 0.5 seconds [16:10:27] thcipriani: neat, I'll look at the logstash one [16:10:35] nah, computers are supposed to be fast, no waiting. ;] [16:10:38] for future ops, this is what I was checking with (to see which, if any, of the upload frontends were returning 200 vs 404 for the URL): [16:10:41] bblack@neodymium:~$ sudo cumin 'R:class = role::cache::upload' 'curl -sI "https://upload.wikimedia.org/wikipedia/commons/f/f9/Calema-Nossa_Vez(Portal-Edman-news-Musik).webm" --resolve upload.wikimedia.org:443:127.0.0.1|head -1' [16:10:45] godog: thanks, should be fairly innocuous [16:10:53] (03CR) 10jerkins-bot: [V: 04-1] Fix ipaddress6_primary to ignore deprecated addresses [puppet] - 10https://gerrit.wikimedia.org/r/350238 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [16:10:57] which then outputs stuff like: [16:10:59] ===== NODE GROUP ===== [16:11:00] zhuyifei1999_: we're stripping querystrings on upload so that wouldn't invalidate the cache FWIW [16:11:03] (39) cp[2002,2005,2008,2011,2014,2017,2020,2022,2024,2026].codfw.wmnet,cp[1048-1050,1062-1064,1071-1074,1099].eqiad.wmnet,cp[3034-3039,3044-3049].esams.wmnet,cp[4005-4007,4013-4015].ulsfo.wmnet [16:11:05] bblack: Sorry I didn't notice you were investigating [16:11:06] (03PS2) 10Filippo Giunchedi: Scap: canaries should include INFO-level messages [puppet] - 10https://gerrit.wikimedia.org/r/348475 (https://phabricator.wikimedia.org/T162974) (owner: 10Thcipriani) [16:11:09] ----- OUTPUT of 'curl -sI "https:...27.0.0.1|head -1' ----- [16:11:11] And shot it out from under you [16:11:13] HTTP/1.1 404 Not Found [16:11:16] it's ok [16:11:27] godog: hmm is that added recently? [16:11:34] my suspicion is that normal deletes go through the normal jobqueue which tends to get backlogged a bit [16:11:46] zhuyifei1999_: a few weeks [16:11:53] ok [16:11:54] (as opposed to an async purge send from MW itself right after the response) [16:12:09] 06Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: pdu phase inbalances: ps1-a3-codfw, ps1-c6-codfw, & ps1-d6-codfw - https://phabricator.wikimedia.org/T163339#3210378 (10RobH) @Papaul: That will need to be swapped back once we fix the bios on the machines. Idelally all PSU1 are pulling from PDU1(tower... [16:12:13] bblack: I'm....not sure that's true [16:12:13] (03CR) 10Filippo Giunchedi: [C: 032] Scap: canaries should include INFO-level messages [puppet] - 10https://gerrit.wikimedia.org/r/348475 (https://phabricator.wikimedia.org/T162974) (owner: 10Thcipriani) [16:12:23] Facebook’s report mechanism for such is crap. [16:12:27] Deletes should be pretty much instantaneous, unless you're using like Nuke (which does batches) [16:12:30] RainbowSprinkles: me either, I've kinda given up temporarily on understanding where PURGEs actually come from [16:12:33] godog: thanks, added, I think we need to prune files from cache directory to force feed rebuild, ie /var/cache/planet/fr/ and /var/cache/planet/en/ [16:13:20] godog: by default, venus only *adds* new entry, and rely on cache for older ones [16:14:18] (03CR) 10Filippo Giunchedi: [C: 04-1] "see en_config.erb comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/350240 (owner: 10Dereckson) [16:14:25] Dereckson: kk, see ^ [16:16:58] yes for the mispaste, fixing [16:17:18] Dereckson: also can you point me at current spam on planet for verification? thanks! [16:17:40] the top post of https://fr.planet.wikimedia.org/ [16:17:55] !log otto@naos Started deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) [16:18:00] (03PS2) 10Andrew Bogott: Switch labservices1002 to the primary designate/dns server. [puppet] - 10https://gerrit.wikimedia.org/r/350239 [16:18:02] !log otto@naos Finished deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) (duration: 00m 06s) [16:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:07] godog: oh, there was an update: http://blog.wikiwix.com/ isn't dns resolved [16:19:18] and last planet refresh seems to have cleaned it [16:20:21] I still see it, second entry [16:20:36] * Dereckson updates the commit message too so [16:21:25] (03CR) 10Alexandros Kosiaris: [C: 032] Swap etcd client records to point to codfw [dns] - 10https://gerrit.wikimedia.org/r/350214 (https://phabricator.wikimedia.org/T159687) (owner: 10Alexandros Kosiaris) [16:21:45] sr.planet hasn't had a single blog post since '08 [16:21:59] Actually, only 1 post ever [16:23:30] (03PS2) 10Dereckson: planet: Remove blog.wikiwix.com from fr and en feeds [puppet] - 10https://gerrit.wikimedia.org/r/350240 [16:24:29] PROBLEM - NTP on ms-be1016 is CRITICAL: NTP CRITICAL: Offset unknown [16:26:29] (03PS3) 10Filippo Giunchedi: planet: Remove blog.wikiwix.com from fr and en feeds [puppet] - 10https://gerrit.wikimedia.org/r/350240 (owner: 10Dereckson) [16:28:15] (03CR) 10Filippo Giunchedi: [C: 032] planet: Remove blog.wikiwix.com from fr and en feeds [puppet] - 10https://gerrit.wikimedia.org/r/350240 (owner: 10Dereckson) [16:28:57] Dereckson: merged, I'll take a look at the cache [16:29:26] !log otto@naos Started deploy [eventlogging/eventbus@e7da0cc]: enable wildcard topic config [16:29:30] !log otto@naos Finished deploy [eventlogging/eventbus@e7da0cc]: enable wildcard topic config (duration: 00m 04s) [16:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:23] (03CR) 10Volans: [C: 031] "LGTM, style comments inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/350238 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [16:31:22] (03PS1) 10Chad: Planet: Delete sr.planet [puppet] - 10https://gerrit.wikimedia.org/r/350242 [16:31:44] ottomata: not sure if it is the same repo you are deploying as this https://gerrit.wikimedia.org/r/#/c/347917/ but it might need that review to be merged too [16:31:49] godog: so there is cache dirctory, with files like ,p= [16:31:53] (03PS1) 10Chad: Drop sr.planet from dns, it's moribund [dns] - 10https://gerrit.wikimedia.org/r/350243 [16:32:06] Oh! [16:32:13] ok i just merged that [16:32:20] same diff, will abandon that one [16:33:02] (03PS1) 10Chad: Planet: Remove planetsun from fellow planet listing [puppet] - 10https://gerrit.wikimedia.org/r/350246 [16:33:11] (03PS3) 10Faidon Liambotis: Fix ipaddress6_primary to ignore deprecated addresses [puppet] - 10https://gerrit.wikimedia.org/r/350238 (https://phabricator.wikimedia.org/T163196) [16:33:51] godog: done? [16:34:21] (03PS2) 10Chad: Planet: Remove Fedora People / planetsun from fellow planet listing [puppet] - 10https://gerrit.wikimedia.org/r/350246 [16:34:34] andrewbogott: in practice yes, all merged [16:34:44] ok, thanks! [16:35:11] The planets are dying! [16:35:28] db2070 seems overloaded [16:35:30] Dereckson: indeed, ok to remove all wikiwix.com caches? [16:35:38] !log otto@naos Started deploy [eventlogging/eventbus@e7da0cc]: enable wildcard topic config [16:35:39] (03PS3) 10Andrew Bogott: Switch labservices1002 to the primary designate/dns server. [puppet] - 10https://gerrit.wikimedia.org/r/350239 [16:35:42] godog: yes, I think perhaps remove blog.wikiwix.com,p=* is enough [16:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:48] lots of deletes [16:36:31] !log otto@naos Finished deploy [eventlogging/eventbus@e7da0cc]: enable wildcard topic config (duration: 00m 53s) [16:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:43] something is not right [16:37:49] (03CR) 10Andrew Bogott: [C: 032] Switch labservices1002 to the primary designate/dns server. [puppet] - 10https://gerrit.wikimedia.org/r/350239 (owner: 10Andrew Bogott) [16:38:18] !log stopping nova-api for labservices switchover [16:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:36] s1 is overloaded [16:39:46] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/350238 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [16:39:57] godog: then, to force rebuild, we need to run planet.py manually: /usr/bin/planet /usr/share/planet-venus/wikimedia//config.ini [16:40:09] PROBLEM - nova-api http on labnet1001 is CRITICAL: connect to address 10.64.20.13 and port 8774: Connection refused [16:40:12] one for fr, one for en [16:40:19] PROBLEM - nova-api process on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-api [16:40:19] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [16:40:25] (03CR) 10ArielGlenn: [C: 032] pylint all the things: get rid of camelcase [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350180 (owner: 10ArielGlenn) [16:40:39] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [16:40:55] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2256.codfw.wmnet,service=apache2 [16:40:59] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [16:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:21] !log akosiaris@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2256.codfw.wmnet,service=apache2 [16:41:21] https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=1&fullscreen&orgId=1&from=now-1h&to=now [16:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:49] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:41:51] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw2255.codfw.wmnet,service=apache2 [16:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:59] something is wrong since 16:31 [16:42:12] (03PS1) 10ArielGlenn: more junk to .gitignore [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350248 [16:42:28] (03PS1) 10Eevans: Assign a hints directory (Cassandra >= 3.0) [puppet] - 10https://gerrit.wikimedia.org/r/350249 (https://phabricator.wikimedia.org/T160570) [16:42:34] jynus: ottomata's change to eventbus perhaps ? [16:42:42] Dereckson: ok [16:42:44] (03CR) 10ArielGlenn: [C: 032] more junk to .gitignore [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350248 (owner: 10ArielGlenn) [16:42:56] !log flush wikiwix cache from planet1001 and rebuild files [16:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:40] jynus: I 've had to restart confd across the fleet a few mins ago but I can't see how that would impage slave slag [16:43:43] lag* [16:43:52] impact slave slag* [16:44:00] !log otto@naos Started deploy [eventlogging/eventbus@e7da0cc]: enable wildcard topic config [16:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:20] !log otto@naos Finished deploy [eventlogging/eventbus@e7da0cc]: enable wildcard topic config (duration: 00m 20s) [16:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:02] Hmm puppet failing with [16:45:03] d to determine $::labsproject at /etc/puppet/manifests/realm.pp:41 on node puppet-paladox3.git.eqiad.wmflabs [16:45:42] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2255.codfw.wmnet,service=apache2 [16:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:00] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [16:46:13] I think it passed, but we hace 14 minutes of bad writes [16:46:16] 06Operations, 13Patch-For-Review, 15User-Joe: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687#3210603 (10akosiaris) [16:46:22] (03PS1) 10Andrew Bogott: Revert "Switch labservices1002 to the primary designate/dns server." [puppet] - 10https://gerrit.wikimedia.org/r/350250 [16:46:37] (03CR) 10Eevans: [C: 031] "PC output here: http://puppet-compiler.wmflabs.org/6230" [puppet] - 10https://gerrit.wikimedia.org/r/350249 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [16:46:46] (03CR) 10Andrew Bogott: [V: 032 C: 032] Revert "Switch labservices1002 to the primary designate/dns server." [puppet] - 10https://gerrit.wikimedia.org/r/350250 (owner: 10Andrew Bogott) [16:47:22] godog: https://gerrit.wikimedia.org/r/350249 <-- that would seem to be it for the time being [16:47:24] not sure how i managed that [16:47:36] godog: (if you have the time) [16:48:09] RECOVERY - nova-api http on labnet1001 is OK: HTTP OK: HTTP/1.1 200 OK - 499 bytes in 0.074 second response time [16:48:19] RECOVERY - nova-api process on labnet1001 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/nova-api [16:48:19] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [16:48:39] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [16:48:43] 06Operations, 13Patch-For-Review, 15User-Joe: etcd switchover/enhancements - https://phabricator.wikimedia.org/T159687#3210623 (10akosiaris) I 've restarted confd across the fleet after merging the DNS change above in order for it to be picked up by the daemons (5mins had passed and I saw no difference in th... [16:48:44] (03Abandoned) 10Muehlenhoff: role::analytics_cluster::hadoop::standby: Enable base::firewall in the role [puppet] - 10https://gerrit.wikimedia.org/r/341292 (owner: 10Muehlenhoff) [16:48:58] urandom: yep I'll take a look as soon as I'm done with planet [16:49:16] godog: sounds good! [16:50:54] !log labservices failover aborted due to cryptic routing/firewall issue [16:50:59] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:09] (03PS4) 10Faidon Liambotis: Fix ipaddress6_primary to ignore deprecated addresses [puppet] - 10https://gerrit.wikimedia.org/r/350238 (https://phabricator.wikimedia.org/T163196) [16:52:34] (03CR) 10Faidon Liambotis: [C: 032] Fix ipaddress6_primary to ignore deprecated addresses [puppet] - 10https://gerrit.wikimedia.org/r/350238 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [16:53:15] !log otto@naos Started deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) [16:53:15] (03PS1) 10Andrew Bogott: Switch labservices1002 to the primary designate/dns server. [puppet] - 10https://gerrit.wikimedia.org/r/350251 (https://phabricator.wikimedia.org/T163402) [16:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:21] Dereckson: LGTM when asking for fr.planet.wikimedia.org/?foo=barz to bypass varnish, the main page will eventually expire too [16:53:22] !log otto@naos Finished deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) (duration: 00m 07s) [16:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:36] !log otto@naos Started deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) [16:53:41] !log flush wikiwix cache from planet2001 and rebuild files [16:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:46] (03CR) 10Andrew Bogott: "This depends on various things (e.g. labcontrol1001) being able to access port 9001 on labcontrol1002. That's not currently possible, for" [puppet] - 10https://gerrit.wikimedia.org/r/350251 (https://phabricator.wikimedia.org/T163402) (owner: 10Andrew Bogott) [16:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:01] !log otto@naos Finished deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) (duration: 00m 25s) [16:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:23] (03PS2) 10Filippo Giunchedi: Assign a hints directory (Cassandra >= 3.0) [puppet] - 10https://gerrit.wikimedia.org/r/350249 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170425T1700). [17:00:57] godog: thanks [17:01:24] (03CR) 10Filippo Giunchedi: [C: 032] Assign a hints directory (Cassandra >= 3.0) [puppet] - 10https://gerrit.wikimedia.org/r/350249 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [17:01:46] urandom: ^ merged [17:01:48] godog: thanks! [17:03:53] No ORES today. [17:04:19] 06Operations, 13Patch-For-Review: Puppet facts around the primary network interface and IPv4/IPv6 address - https://phabricator.wikimedia.org/T163196#3210656 (10faidon) [17:08:42] !log otto@naos Started deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) [17:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:54] !log arlolra@naos Started deploy [parsoid/deploy@719d7bd]: Updating Parsoid to 55b90511 [17:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:01] !log otto@naos Finished deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) (duration: 02m 18s) [17:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:19] PROBLEM - eventlogging-service-eventbus endpoints health on kafka1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.11, port=8085): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [17:11:29] PROBLEM - Check systemd state on kafka1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:11:39] PROBLEM - Check that eventlogging-service-eventbus is running on kafka1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args /srv/deployment/eventlogging/eventbus/bin/eventlogging-service @/etc/eventlogging.d/services/eventbus [17:12:28] me ^ [17:12:33] it is also depooled at the moment, and in eqiad [17:13:29] RECOVERY - Check systemd state on kafka1001 is OK: OK - running: The system is fully operational [17:13:39] RECOVERY - Check that eventlogging-service-eventbus is running on kafka1001 is OK: PROCS OK: 9 processes with command name python, args /srv/deployment/eventlogging/eventbus/bin/eventlogging-service @/etc/eventlogging.d/services/eventbus [17:14:19] RECOVERY - eventlogging-service-eventbus endpoints health on kafka1001 is OK: All endpoints are healthy [17:15:19] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1016 is OK: OK ferm input default policy is set [17:17:47] !log otto@naos Started deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) [17:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:55] !log otto@naos Finished deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) (duration: 00m 07s) [17:17:56] !log arlolra@naos Finished deploy [parsoid/deploy@719d7bd]: Updating Parsoid to 55b90511 (duration: 08m 02s) [17:17:59] !log otto@naos Started deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) [17:18:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:07] !log otto@naos Finished deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) (duration: 00m 08s) [17:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:43] !log otto@naos Started deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) [17:18:49] !log otto@naos Finished deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) (duration: 00m 05s) [17:18:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:33] !log otto@naos Started deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) [17:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:40] !log otto@naos Finished deploy [eventlogging/eventbus@e7da0cc]: (no justification provided) (duration: 00m 07s) [17:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:53] !log rebooting ruthenium for update to Linux 4.9 [17:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:10] RainbowSprinkles: http://web.archive.org/web/20080516165217/http://sr.planet.wikimedia.org/ <- was more active [17:22:52] 4 posts, 2 to celebrate Planet [17:25:32] !log Updated Parsoid to 55b90511 (T153885, T163330, T89262, T154709, T162919, T161306) [17:25:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:47] T162919: Disable fostering lint error for category and other transparent tags - https://phabricator.wikimedia.org/T162919 [17:25:48] T153885: Parsoid doesn't handle templated template names yet - https://phabricator.wikimedia.org/T153885 [17:25:48] T89262: Read thumb sizes from siteinfo - https://phabricator.wikimedia.org/T89262 [17:25:48] T163330: Error in logs: Cannot read property '0' of undefined - https://phabricator.wikimedia.org/T163330 [17:25:49] T154709: Parsoid does not emit different HTML when the page=# property is set on paged media (PDFs/DjVus/TIFFs) - https://phabricator.wikimedia.org/T154709 [17:25:49] T161306: Investigate P-wrapping oddity that introduces long horizontal no-wrap lines on many navboxes on shwiki - https://phabricator.wikimedia.org/T161306 [17:26:48] (03PS2) 10Muehlenhoff: Add symlinks for Debian-packaged Bouncycastle Jars [puppet] - 10https://gerrit.wikimedia.org/r/348762 (https://phabricator.wikimedia.org/T163185) [17:30:50] (03PS2) 10Faidon Liambotis: Replace $::main_ipaddress by the new ipaddress fact [puppet] - 10https://gerrit.wikimedia.org/r/345569 (https://phabricator.wikimedia.org/T163196) [17:30:52] (03PS2) 10Faidon Liambotis: Switch add_ip6_mapped to use interface_primary [puppet] - 10https://gerrit.wikimedia.org/r/345568 (https://phabricator.wikimedia.org/T163196) [17:30:54] (03PS1) 10Faidon Liambotis: Rename ipaddress_primary to ipaddress (same for 6) [puppet] - 10https://gerrit.wikimedia.org/r/350254 (https://phabricator.wikimedia.org/T163196) [17:33:24] (03CR) 10Muehlenhoff: [C: 032] Add symlinks for Debian-packaged Bouncycastle Jars [puppet] - 10https://gerrit.wikimedia.org/r/348762 (https://phabricator.wikimedia.org/T163185) (owner: 10Muehlenhoff) [17:35:50] !log gerrit: Quick reboot to pick up new bouncycastle library [17:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:39] !log running test schema change on etwiki on eqiad (depooled) T17441 [17:36:44] ah of course that is exactly when I try to push from a newish repo [17:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:47] T17441: Some tables lack unique or primary keys, may allow confusing duplicate data - https://phabricator.wikimedia.org/T17441 [17:36:52] and am sure I have screwed something up, heh [17:37:25] Ok, everything's back [17:37:58] 06Operations, 13Patch-For-Review: Puppet facts around the primary network interface and IPv4/IPv6 address - https://phabricator.wikimedia.org/T163196#3210803 (10faidon) [17:38:07] (03PS1) 10ArielGlenn: Add option to skip specified namespaces [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350255 (https://phabricator.wikimedia.org/T68661) [17:38:16] so it is, thanks! [17:41:29] 06Operations, 13Patch-For-Review: Puppet facts around the primary network interface and IPv4/IPv6 address - https://phabricator.wikimedia.org/T163196#3210814 (10faidon) All of the afore-mentioned issues should be fixed with the latest patches above. I've also tested Facter's precedence rules (they work!) and s... [17:53:59] PROBLEM - puppet last run on analytics1068 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[cdh::hadoop::directory /user/spark/share/lib] [17:54:30] RECOVERY - NTP on ms-be1016 is OK: NTP OK: Offset 0.0001567900181 secs [17:56:16] hm [17:56:19] werid, looking at ^^ [17:57:59] RECOVERY - puppet last run on analytics1068 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:05:45] 06Operations, 13Patch-For-Review: Puppet facts around the primary network interface and IPv4/IPv6 address - https://phabricator.wikimedia.org/T163196#3210889 (10Volans) From the audit I got the same results of the tables in T163196#3206314 except the following ones, and all looks good now for the `ipaddress6_p... [18:06:24] 06Operations, 13Patch-For-Review: Puppet facts around the primary network interface and IPv4/IPv6 address - https://phabricator.wikimedia.org/T163196#3210891 (10Volans) [18:06:39] 06Operations, 13Patch-For-Review: Puppet facts around the primary network interface and IPv4/IPv6 address - https://phabricator.wikimedia.org/T163196#3189426 (10Volans) [18:17:34] 06Operations, 06Labs, 10hardware-requests: eqiad: (1) hardware access request for dedicated labmon1002 - https://phabricator.wikimedia.org/T161750#3210919 (10RobH) @chasemp: You list 32 cores, but labmon1001 has dual 8 core CPUs, for a total of 16 actual cores. It then has hyperthreading enabled, and shows... [18:24:57] 06Operations: Production Shell access denied - https://phabricator.wikimedia.org/T163568#3210981 (10Capt_Swing) Thank you, @MoritzMuehlenhoff. I've pasted my new rsa public key as instructed: P5328 [18:27:58] (03PS3) 10Dzahn: site/icinga: unify einsteinium/tegmen in single node section [puppet] - 10https://gerrit.wikimedia.org/r/350107 [18:31:07] (03PS4) 10Dzahn: site/icinga: unify einsteinium/tegmen in single node section [puppet] - 10https://gerrit.wikimedia.org/r/350107 [18:33:07] !log Deployment Train: Branching mediawiki wmf/1.29.0-wmf.21 from master refs T161733 [18:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:18] T161733: MW-1.29.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T161733 [18:33:30] RECOVERY - MariaDB Slave Lag: s1 on db1047 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:35:48] (03CR) 10Dzahn: [C: 032] "thanks for reviews. i added the comment which server is eqiad/codfw, per comments from Moritz" [puppet] - 10https://gerrit.wikimedia.org/r/350107 (owner: 10Dzahn) [18:37:57] (03PS2) 10Andrew Bogott: Switch labservices1002 to the primary designate/dns server. [puppet] - 10https://gerrit.wikimedia.org/r/350251 (https://phabricator.wikimedia.org/T163402) [18:38:10] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.456 second response time [18:38:52] ^madhu a consequence of the io load on teh master^? [18:38:54] !log disabling nova-api for another try at labservices failover [18:38:55] madhuvishy: ^ [18:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:15] (03CR) 10Andrew Bogott: [C: 032] Switch labservices1002 to the primary designate/dns server. [puppet] - 10https://gerrit.wikimedia.org/r/350251 (https://phabricator.wikimedia.org/T163402) (owner: 10Andrew Bogott) [18:41:30] (03CR) 10Dzahn: "alright, i see your point how it would break things, but i would still argue that systemd does it against Filesystem Hierarchy Standard. " [puppet] - 10https://gerrit.wikimedia.org/r/348665 (owner: 10Dzahn) [18:41:39] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [18:41:43] (03Abandoned) 10Dzahn: base::service_unit: add symlink from /etc into /var for systemd units [puppet] - 10https://gerrit.wikimedia.org/r/348665 (owner: 10Dzahn) [18:41:59] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [18:42:09] PROBLEM - nova-api http on labnet1001 is CRITICAL: connect to address 10.64.20.13 and port 8774: Connection refused [18:42:19] PROBLEM - nova-api process on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-api [18:42:19] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [18:42:59] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [18:43:09] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [18:43:09] RECOVERY - nova-api http on labnet1001 is OK: HTTP OK: HTTP/1.1 200 OK - 499 bytes in 0.075 second response time [18:43:19] RECOVERY - nova-api process on labnet1001 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/nova-api [18:43:19] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [18:44:42] ^ this is more or less expected i believe as andrewbogott is doing some maint [18:45:09] some of them are, at least [18:46:09] PROBLEM - nova-api http on labnet1001 is CRITICAL: connect to address 10.64.20.13 and port 8774: Connection refused [18:46:19] PROBLEM - nova-api process on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-api [18:46:59] PROBLEM - designate-api http on labservices1001 is CRITICAL: connect to address 208.80.155.117 and port 9001: Connection refused [18:47:05] PROBLEM - designate-central process on labservices1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-central [18:47:06] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [18:47:25] PROBLEM - designate-api process on labservices1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-api [18:47:32] PROBLEM - designate-mdns process on labservices1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-mdns [18:47:41] getting pages, all under control? [18:47:48] PROBLEM - designate-sink process on labservices1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-sink [18:47:54] volans: yes I think but andrewbogott knows better [18:47:55] PROBLEM - designate-pool-manager process on labservices1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-pool-manager [18:47:55] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [18:47:56] I'll silence [18:48:03] labs pages coming in [18:48:24] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.702 second response time [18:48:36] <_joe_> nothing needs to be done I gather [18:48:50] andrewbogott: I'm putting labservices1001 in downtime, _joe_ well I think andrewbogott is already doing it [18:49:07] ack [18:49:08] things should be clearing up now [18:49:32] sorry for the noise — I had a typo in one of my commands that delayed everything [18:49:34] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [18:49:49] ^ madhuvishy can you see why that's failing still? [18:50:01] chasemp: yeah looking [18:50:08] !log downtime labservices1001 as we fail away from it and puppet staleness on labservices1002 [18:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:24] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [18:50:24] RECOVERY - nova-api http on labnet1001 is OK: HTTP OK: HTTP/1.1 200 OK - 499 bytes in 0.074 second response time [18:50:34] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [18:50:36] ^ madhuvishy that could be why if nova-api was down till now... [18:51:20] chasemp: it is currently fine, poking logs [18:51:24] kk [18:52:27] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [18:52:37] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [18:53:07] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [18:54:07] RECOVERY - nova-api process on labnet1001 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/nova-api [18:55:18] !log restart nova-fullstack on labnet1001 [18:55:21] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3211117 (10RobH) a:05RobH>03chasemp We don't do 4 CPU options. So we can toss in dual Intel CPUs with more cores, but we don't have anything that has that many ac... [18:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:54] lovely... [18:56:10] somehow my new branch got named wmf/ew [18:56:23] which is somehow appropriate [18:57:50] Eww! [18:58:19] (03PS2) 10ArielGlenn: Add option to skip specified namespaces [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350255 (https://phabricator.wikimedia.org/T68661) [18:59:51] (03PS1) 10Dzahn: admins: replace SSH key for jmorgan [puppet] - 10https://gerrit.wikimedia.org/r/350263 (https://phabricator.wikimedia.org/T163568) [19:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170425T1900). Please do the needful. [19:01:46] (03CR) 10ArielGlenn: [C: 032] Add option to skip specified namespaces [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350255 (https://phabricator.wikimedia.org/T68661) (owner: 10ArielGlenn) [19:02:22] (03PS2) 10Dzahn: admins: replace SSH key for jmorgan [puppet] - 10https://gerrit.wikimedia.org/r/350263 (https://phabricator.wikimedia.org/T163568) [19:06:24] (03CR) 10Dzahn: [C: 032] "key as provided by jmorgan in phab, phab user is linked to MW (WMF) user. IRC cloak also matches." [puppet] - 10https://gerrit.wikimedia.org/r/350263 (https://phabricator.wikimedia.org/T163568) (owner: 10Dzahn) [19:08:04] 06Operations, 13Patch-For-Review: Production Shell access denied (update SSH key for jmorgan) - https://phabricator.wikimedia.org/T163568#3211158 (10Dzahn) [19:09:35] 06Operations, 13Patch-For-Review: Production Shell access denied (update SSH key for jmorgan) - https://phabricator.wikimedia.org/T163568#3202018 (10Dzahn) @Capt_Swing Your key has been replaced on stat1003 and bast1001 (and other bastions puppet will do it soon). It should work now again. [19:16:43] bawolff: :D exactly [19:16:52] * twentyafterfour managed to rename the branches [19:17:53] (03PS1) 10Madhuvishy: openstack: Temporarily disable self-service instance creation [puppet] - 10https://gerrit.wikimedia.org/r/350264 [19:20:16] (03CR) 10Rush: [C: 031] openstack: Temporarily disable self-service instance creation [puppet] - 10https://gerrit.wikimedia.org/r/350264 (owner: 10Madhuvishy) [19:20:23] (03PS1) 10ArielGlenn: toss extra whitespace [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350265 [19:21:33] 06Operations, 13Patch-For-Review: Production Shell access denied (update SSH key for jmorgan) - https://phabricator.wikimedia.org/T163568#3211271 (10Dzahn) 05Open>03Resolved a:03Dzahn We talked on IRC. It works again. [19:22:40] (03CR) 10Madhuvishy: [C: 032] openstack: Temporarily disable self-service instance creation [puppet] - 10https://gerrit.wikimedia.org/r/350264 (owner: 10Madhuvishy) [19:31:38] (03PS1) 10Madhuvishy: openstack: Also disable instance creation on horizon [puppet] - 10https://gerrit.wikimedia.org/r/350266 [19:31:59] (03CR) 10ArielGlenn: [C: 032] toss extra whitespace [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350265 (owner: 10ArielGlenn) [19:32:03] (03PS1) 10Madhuvishy: Revert "openstack: Temporarily disable self-service instance creation" [puppet] - 10https://gerrit.wikimedia.org/r/350267 [19:33:09] (03CR) 10Madhuvishy: [V: 032 C: 032] Revert "openstack: Temporarily disable self-service instance creation" [puppet] - 10https://gerrit.wikimedia.org/r/350267 (owner: 10Madhuvishy) [19:37:04] (03PS2) 10Madhuvishy: openstack: Temporarily disable self service instance creation and deletion on horizon [puppet] - 10https://gerrit.wikimedia.org/r/350266 [19:38:38] (03CR) 10Dzahn: [C: 032] repeat hostname for each record where missing in server list [dns] - 10https://gerrit.wikimedia.org/r/350104 (owner: 10Dzahn) [19:38:46] (03PS2) 10Dzahn: repeat hostname for each record where missing in server list [dns] - 10https://gerrit.wikimedia.org/r/350104 [19:40:03] (03PS1) 10ArielGlenn: remove the xmlfileutils from the ariel branch, leave pointer [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/350269 [19:41:08] (03CR) 10ArielGlenn: [C: 032] remove the xmlfileutils from the ariel branch, leave pointer [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/350269 (owner: 10ArielGlenn) [19:42:11] (03PS1) 10Niharika29: Turn off LoginNotify notifications for succesful logins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350270 (https://phabricator.wikimedia.org/T163816) [19:42:41] (03PS1) 10Cmjohnson: Adding production dns for db1106 [dns] - 10https://gerrit.wikimedia.org/r/350271 [19:44:00] (03PS2) 10Cmjohnson: Adding production dns for db1106 [dns] - 10https://gerrit.wikimedia.org/r/350271 [19:44:16] 06Operations, 10hardware-requests: codfw: (3) hardware access request for labtest - https://phabricator.wikimedia.org/T154664#3211370 (10RobH) [19:44:52] (03CR) 10Cmjohnson: [C: 032] Adding production dns for db1106 [dns] - 10https://gerrit.wikimedia.org/r/350271 (owner: 10Cmjohnson) [19:46:11] (03CR) 10Niharika29: [C: 032] "Gonna +2 this. Addresses a UBN in beta cluster." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350270 (https://phabricator.wikimedia.org/T163816) (owner: 10Niharika29) [19:47:26] (03Merged) 10jenkins-bot: Turn off LoginNotify notifications for succesful logins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350270 (https://phabricator.wikimedia.org/T163816) (owner: 10Niharika29) [19:47:35] (03CR) 10jenkins-bot: Turn off LoginNotify notifications for succesful logins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350270 (https://phabricator.wikimedia.org/T163816) (owner: 10Niharika29) [19:47:54] Niharika: Are you planning to deploy that to production too? [19:48:10] RainbowSprinkles: Nope. [19:48:24] RainbowSprinkles: If you mean the patch, that is. [19:48:35] Then you should not have merged it yet. [19:48:36] * RainbowSprinkles sighs [19:48:53] Oh boy. Did I earn a sticker? [19:49:12] I thought Labs stuff could be +2d and it'll show up on Labs. [19:49:13] Best practice says that beta-only changes also get sync'd to production :) [19:49:19] So people aren't confused later [19:49:51] (03PS3) 10Madhuvishy: openstack: Temporarily disable self service instance creation and deletion on horizon [puppet] - 10https://gerrit.wikimedia.org/r/350266 [19:50:00] RainbowSprinkles: Even if it is synced to prod, it won't change anything because that patch only modifies CommonSettings-labs. [19:50:20] I know [19:50:28] But it's best practice to sync it anyway [19:50:33] (03PS1) 10Cmjohnson: Adding mac address for db1106 [puppet] - 10https://gerrit.wikimedia.org/r/350274 [19:50:34] I'm doing it now [19:50:37] RainbowSprinkles: If you think that's better, I can sync it to prod as part of the swat later. [19:50:42] Okay. Thanks! [19:50:54] !log demon@naos Synchronized wmf-config/CommonSettings-labs.php: no-op, beta change (duration: 01m 58s) [19:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:07] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [19:51:28] RainbowSprinkles: For future, I should schedule Labs stuff for SWAT to be safe? [19:51:33] Yes [19:51:39] Or jfdi ;-) [19:51:48] (03CR) 10Cmjohnson: [C: 032] Adding mac address for db1106 [puppet] - 10https://gerrit.wikimedia.org/r/350274 (owner: 10Cmjohnson) [19:51:52] :) Gotcha. [19:55:41] (03PS4) 10Madhuvishy: openstack: Temporarily disable self service instance creation and deletion on horizon [puppet] - 10https://gerrit.wikimedia.org/r/350266 [19:56:07] Niharika: I tried to do the same thing earlier today :) [19:57:38] (03CR) 10Madhuvishy: [C: 032] openstack: Temporarily disable self service instance creation and deletion on horizon [puppet] - 10https://gerrit.wikimedia.org/r/350266 (owner: 10Madhuvishy) [19:57:44] (03PS5) 10Madhuvishy: openstack: Temporarily disable self service instance creation and deletion on horizon [puppet] - 10https://gerrit.wikimedia.org/r/350266 [19:57:56] (03CR) 10Madhuvishy: [V: 032 C: 032] openstack: Temporarily disable self service instance creation and deletion on horizon [puppet] - 10https://gerrit.wikimedia.org/r/350266 (owner: 10Madhuvishy) [20:00:26] !log Labs instance creation and deletion on horizon temporarily disabled via https://gerrit.wikimedia.org/r/350266 [20:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:07] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [20:10:33] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labcontrol1003/1004 - https://phabricator.wikimedia.org/T158207#3211444 (10RobH) [20:12:24] 06Operations, 10ops-eqiad, 10netops: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3211448 (10Cmjohnson) @ottomata I would like to do this first thing in the morning (0830) 04/26 before the racks are shutdown. I will update this task with the switc... [20:16:07] 06Operations, 06Labs, 10hardware-requests: Eqiad: (2) hardware access request for labnet1003/1004 - https://phabricator.wikimedia.org/T158204#3211463 (10RobH) I chatted with Chase about this, and the dual 12 core for a total presented (in /etc/cpuinfo) will show 48, since 12 (cores per cpu) * 2 (cpus) * 2 (e... [20:16:51] 06Operations, 10ops-eqiad, 15User-fgiunchedi: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#3211469 (10Cmjohnson) p:05High>03Triage Received the new controller card, installed it and the server would not boot to the logical drives. I booted into the raid bios and see that... [20:20:29] (03PS1) 10Cmjohnson: Removing dhcp entry for decom server ms1003 T157975 [puppet] - 10https://gerrit.wikimedia.org/r/350276 [20:23:51] (03PS1) 10Cmjohnson: Removing dns entries for ms1003, server is decom'd and unracked T157975 [dns] - 10https://gerrit.wikimedia.org/r/350278 [20:24:02] (03PS2) 10Cmjohnson: Removing dns entries for ms1003, server is decom'd and unracked T157975 [dns] - 10https://gerrit.wikimedia.org/r/350278 [20:24:19] (03PS2) 10Cmjohnson: Removing dhcp entry for decom server ms1003 T157975 [puppet] - 10https://gerrit.wikimedia.org/r/350276 [20:24:25] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for ms1003, server is decom'd and unracked T157975 [dns] - 10https://gerrit.wikimedia.org/r/350278 (owner: 10Cmjohnson) [20:24:43] 06Operations, 06Labs: Investigate ceasing self-service new Trusty instance creation in Labs - https://phabricator.wikimedia.org/T161899#3211512 (10Andrew) Another thing I'd suggest is that we get Stretch available to users before we start pushing them off Trusty. Jessie isn't in support for much longer than T... [20:26:05] (03CR) 10Cmjohnson: [C: 032] Removing dhcp entry for decom server ms1003 T157975 [puppet] - 10https://gerrit.wikimedia.org/r/350276 (owner: 10Cmjohnson) [20:27:05] 06Operations, 10ops-eqiad, 13Patch-For-Review: decommission ms1003 - https://phabricator.wikimedia.org/T157975#3211517 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson removed from racktables....resolved. [20:27:58] 06Operations, 10ops-eqiad: Degraded RAID on ocg1001 - https://phabricator.wikimedia.org/T161158#3211522 (10Cmjohnson) I will replace the disk again. The disk I used was a "used" disk but was wiped. [20:29:06] 06Operations, 10hardware-requests: eqiad: (2) hardware access request for dedicated Labs puppetmasters - https://phabricator.wikimedia.org/T147053#2679484 (10RobH) I'll create a #procuement sub-task for quotation and ordering. If these are indeed puppetmasters, we may want to model them off our production pup... [20:36:37] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 26 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:37:37] (03PS1) 10ArielGlenn: tabs to spaces, blame mutante. [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350282 [20:39:03] 06Operations, 06Labs: During labservices1001 failover fqdn changed from foo.project.eqiad.wmflabs to foo.eqiad.wmflabs - https://phabricator.wikimedia.org/T163823#3211561 (10chasemp) [20:41:37] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 16 probes of 282 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:43:12] 06Operations, 06Labs: During labservices1001 failover fqdn changed from foo.project.eqiad.wmflabs to foo.eqiad.wmflabs - https://phabricator.wikimedia.org/T163823#3211590 (10chasemp) I see a few requested certs for the foo.eqiad.wmflabs pattern on the Tools puppet master: ```root@tools-puppetmaster-02:~# pupp... [20:43:43] 06Operations, 06Labs: Investigate ceasing self-service new Trusty instance creation in Labs - https://phabricator.wikimedia.org/T161899#3211603 (10Paladox) +1 to stretch. [20:44:30] 06Operations, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3211604 (10Cmjohnson) @robh ms-fe1001 and 1002 switch ports have been allocated to other servers. ms-ff1003 and 1004 were still labeled but I have since removed... [20:44:52] 06Operations, 10hardware-requests, 13Patch-For-Review, 15User-fgiunchedi: Decommission ms-fe100[1-4] - https://phabricator.wikimedia.org/T160986#3211618 (10Cmjohnson) [20:49:46] 06Operations, 06Labs: During labservices1001 failover fqdn changed from foo.project.eqiad.wmflabs to foo.eqiad.wmflabs - https://phabricator.wikimedia.org/T163823#3211653 (10chasemp) [20:50:49] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3211669 (10Cmjohnson) Incoming ticket opened with equinix Order Number 1-101081657220 [20:59:23] 06Operations, 10ops-eqiad, 10netops: Spread eqiad analytics Kafka nodes to multiple racks ans rows - https://phabricator.wikimedia.org/T163002#3211701 (10Ottomata) Hm, we got a problem! These Kafka nodes are in the Analytics VLAN networks, AND have IPv6 configured. There is no IPv6 VLAN setup in Row B. I'... [21:05:02] (03CR) 10ArielGlenn: [C: 032] tabs to spaces, blame mutante. [dumps/import-tools] - 10https://gerrit.wikimedia.org/r/350282 (owner: 10ArielGlenn) [21:07:28] !log twentyafterfour@naos Started scap: sync 1.29.0-wmf.21 to testwikis (pre-group0) refs T161733 [21:07:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:37] T161733: MW-1.29.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T161733 [21:09:24] !log twentyafterfour@naos scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_3498979833" --threads=30 --lang en --quiet' returned non-zero exit status 1 (duration: 01m 56s) [21:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:13:38] (03PS3) 10Volans: Replace $::main_ipaddress by the new ipaddress fact [puppet] - 10https://gerrit.wikimedia.org/r/345569 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [21:13:46] wtf [21:16:52] (03CR) 10Volans: "@paravoid: I've updated this CR temporary without removing the comment on nrpe_local.cfg.erb, because the full puppet compiler run was sho" [puppet] - 10https://gerrit.wikimedia.org/r/345569 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [21:23:03] !log twentyafterfour@naos Started scap: sync 1.29.0-wmf.21 to testwikis (pre-group0) refs T161733 (attempt #2) [21:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:12] T161733: MW-1.29.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T161733 [21:23:58] !log twentyafterfour@naos scap failed: CalledProcessError Command 'cp -r "/tmp/scap_l10n_2414756836"/* "/srv/mediawiki-staging/php-1.29.0-wmf.21/cache/l10n"' returned non-zero exit status 1 (duration: 00m 54s) [21:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:25] tto: ping [21:30:07] !log twentyafterfour@naos Started scap: sync 1.29.0-wmf.21 to testwikis (pre-group0) refs T161733 (attempt #3) [21:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:15] T161733: MW-1.29.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T161733 [21:32:39] matanya: What's up? [21:32:59] tto: First i wanted to thank you for the expiring rights [21:33:08] saved me days and weeks [21:33:15] no worries! [21:33:28] tto: wanted to ask if you plan to do the same for global rights [21:33:30] (03PS11) 10Dzahn: mediawiki::maintenance: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/342777 [21:33:36] i.e ipbe [21:33:53] !log twentyafterfour@naos scap failed: CalledProcessError Command 'cp -r "/tmp/scap_l10n_930292683"/* "/srv/mediawiki-staging/php-1.29.0-wmf.21/cache/l10n"' returned non-zero exit status 1 (duration: 03m 46s) [21:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:32] matanya, I plan to get to it eventually, unless someone else does it first [21:34:52] The community wishlist doesn't actually mention global groups, so I consider that wishlist entry done [21:35:04] But I can see the need for global groups to expire as well [21:35:17] There's a Phab task for it that you might want to subscribe to, can't find the number now [21:35:57] tto: i you find, i'd love to watch, would save me the last bit of time :) many thanks again [21:36:04] *if you find [21:36:41] matanya, it's linked from the stewards' noticeboard on Meta [21:36:44] (03CR) 10Dzahn: "needed manual rebase because of icinga change in site.pp i merged earlier" [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn) [21:36:48] thanks tto [21:38:15] !log twentyafterfour@naos Started scap: sync 1.29.0-wmf.21 to testwikis (pre-group0) refs T161733 (attempt #4) [21:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:24] T161733: MW-1.29.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T161733 [21:38:32] (03CR) 10BBlack: [C: 032] debian patch: main source to nginx-1.11.13 [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348585 (owner: 10BBlack) [21:38:36] (03CR) 10BBlack: [C: 032] debian patches: forward-port WMF patches and quilt refresh [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348586 (owner: 10BBlack) [21:38:39] (03CR) 10BBlack: [C: 032] Add nginx-echo 1.11.x fixup patch [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/350177 (owner: 10BBlack) [21:38:41] (03CR) 10BBlack: [C: 032] nginx lua module fixups [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/350178 (owner: 10BBlack) [21:38:45] (03CR) 10BBlack: [C: 032] control: depend on libssl11-dev [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348587 (owner: 10BBlack) [21:38:48] (03CR) 10BBlack: [C: 032] Create nginx-{full,light,extras}-dbg by hand. [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348589 (owner: 10BBlack) [21:38:51] (03CR) 10BBlack: [C: 032] nginx (1.11.10-1+wmf1) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.10) - 10https://gerrit.wikimedia.org/r/348591 (owner: 10BBlack) [21:41:54] !log twentyafterfour@naos scap failed: CalledProcessError Command 'cp -r "/tmp/scap_l10n_66989801"/* "/srv/mediawiki-staging/php-1.29.0-wmf.21/cache/l10n"' returned non-zero exit status 1 (duration: 03m 38s) [21:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:16] WTF!!! [21:42:18] cp: target ‘/srv/mediawiki-staging/php-1.29.0-wmf.21/cache/l10n’ is not a directory [21:42:27] bs! it is [21:42:44] gr then it isn't [21:43:20] 3.6M -rw-rw-r-- 1 l10nupdate wikidev 3.6M Apr 25 21:39 l10n [21:43:20] [naos:/srv/mediawiki-staging/php-1.29.0-wmf.21/cache] $ file l10n [21:43:20] l10n: cannot open `l10n' (No such file or directory) [21:43:25] !log twentyafterfour@naos Started scap: sync 1.29.0-wmf.21 to testwikis (pre-group0) refs T161733 (attempt #5) [21:43:25] wut? [21:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:33] I don't get it [21:43:33] T161733: MW-1.29.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T161733 [21:43:38] that's weird [21:44:01] mutante: I removed and remade it as you were looking [21:44:08] sorry [21:44:10] and now it's a normal dir [21:44:15] twentyafterfour: ok :) [21:44:38] was this only on naos or on all 3 deploy servers [21:44:54] drwxr-sr-x 2 l10nupdate wikidev 4096 Apr 25 21:44 l10n [21:45:07] mutante: I'm only looking on naos [21:45:35] scap keeps failing for various reasons because the branch creation script went haywire and I've been fixing all the fallout for the past couple of hours :( [21:45:39] * twentyafterfour should have started fresh [21:45:46] instead of trying to fix it [21:46:12] sunk investment fallacy [21:46:17] (03PS1) 10Tjones: Enable BM25 for Chinese wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350312 (https://phabricator.wikimedia.org/T163829) [21:46:27] PROBLEM - designate-mdns process on labtestservices2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/designate-mdns [21:46:31] twentyafterfour: ugh :/ ok. so on tin and mira /srv/mediawiki-staging/php-1.29.0-wmf.21/cache exists but not that file in there [21:46:43] which is probably normal then [21:48:26] phew! finally made it past localistation cache [21:48:32] it's syncing now [21:48:34] thanks mutante [21:48:45] :) [21:59:53] 06Operations, 06Labs: During labservices1001 failover fqdn changed from foo.project.eqiad.wmflabs to foo.eqiad.wmflabs - https://phabricator.wikimedia.org/T163823#3212177 (10Andrew) Just turning off various dns services (including mdns) does not reproduce this issue. The change is probably in the puppet 'fqdn... [22:02:07] 06Operations, 10hardware-requests: eqiad: (2) hardware access request for dedicated Labs puppetmasters - https://phabricator.wikimedia.org/T147053#3212182 (10RobH) After reviewing the other labs hardware requests currently open, this one happens to be identical to the requirements for T154706 and T161764, whic... [22:02:24] (03PS1) 10Madhuvishy: Revert "openstack: Temporarily disable self service instance creation and deletion on horizon" [puppet] - 10https://gerrit.wikimedia.org/r/350315 [22:02:31] (03PS2) 10Madhuvishy: Revert "openstack: Temporarily disable self service instance creation and deletion on horizon" [puppet] - 10https://gerrit.wikimedia.org/r/350315 [22:02:58] !log causing an intentional outage of labs-ns0 and labs-recursor0 to make sure we're properly girded for tomorrow's switch replacement. [22:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:17] !log twentyafterfour@naos Finished scap: sync 1.29.0-wmf.21 to testwikis (pre-group0) refs T161733 (attempt #5) (duration: 21m 52s) [22:05:20] (03CR) 10Madhuvishy: [C: 032] Revert "openstack: Temporarily disable self service instance creation and deletion on horizon" [puppet] - 10https://gerrit.wikimedia.org/r/350315 (owner: 10Madhuvishy) [22:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:36] T161733: MW-1.29.0-wmf.21 deployment blockers - https://phabricator.wikimedia.org/T161733 [22:05:57] PROBLEM - Recursive DNS on 208.80.155.118 is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:06:27] PROBLEM - Check for gridmaster host resolution UDP on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:07:52] ACKNOWLEDGEMENT - Check for gridmaster host resolution UDP on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call andrew bogott this a test, on purpose [22:08:18] !log Reenabled labs instance creation and deletion on horizon [22:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:25] (03PS1) 1020after4: group0 wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350317 [22:10:27] (03CR) 1020after4: [C: 032] group0 wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350317 (owner: 1020after4) [22:11:55] (03Merged) 10jenkins-bot: group0 wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350317 (owner: 1020after4) [22:12:07] (03CR) 10jenkins-bot: group0 wikis to 1.29.0-wmf.21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/350317 (owner: 1020after4) [22:13:06] (03CR) 10Dzahn: "i'll take the +1 from PS3 and the linked compiler output that shows no-op on terbium and go ahead. then i'll follow-up with a change disab" [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn) [22:13:12] !log twentyafterfour@naos rebuilt wikiversions.php and synchronized wikiversions files: group0 wikis to 1.29.0-wmf.21 [22:13:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:22] (03PS12) 10Dzahn: mediawiki::maintenance: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/342777 [22:13:35] success! [22:17:52] twentyafterfour: Don't run `scap clean --keep-static` on wmf.19 just yet. I wanna land a few more fixups :) [22:18:18] (03CR) 10Dzahn: [C: 032] mediawiki::maintenance: convert to profile/role [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn) [22:22:10] !log mediawiki maintenance servers: making wasat identical to terbium. wasat is currently the active server running crons. no change there at all. on terbium where crons are inactive, some log files were removed [22:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:08] !log re-enabling dns on labservices1001 [22:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:34] !log mediawiki maintenance servers: last log entry was _before_ merging https://gerrit.wikimedia.org/r/#/c/342777/ and making a change [22:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:48] RECOVERY - Recursive DNS on 208.80.155.118 is OK: DNS OK: 0.052 seconds response time. www.wikipedia.org returns 208.80.153.224 [22:25:17] RECOVERY - Check for gridmaster host resolution UDP on labs-ns0.wikimedia.org is OK: DNS OK - 0.047 seconds response time (tools-grid-master.tools.eqiad.wmflabs. 60 IN A 10.68.20.158) [22:25:20] that's labs-recursor.. ok.. i get it then [22:25:26] was suprised for a moment [22:27:07] (03CR) 10Dzahn: "no-op on terbium. wasat is getting the additional resources from openldap::management. maintenance crons were running on wasat and are com" [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn) [22:33:20] RainbowSprinkles: ok [22:33:27] RECOVERY - designate-mdns process on labtestservices2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/designate-mdns [22:34:42] (03Abandoned) 10Chad: Scap clean: Also delete empty directories [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347057 (owner: 10Chad) [22:37:23] twentyafterfour: Basically, I want to land https://gerrit.wikimedia.org/r/#/c/347633/ first, but `find` doesn't behave nicely when you combine -delete with -prune [22:37:33] (because -prune implies -depth??) [22:37:39] * RainbowSprinkles puzzles again [22:42:09] (03PS1) 10Dzahn: openldap::mgmt: turn cross-validate-accounts into template [puppet] - 10https://gerrit.wikimedia.org/r/350325 [22:47:08] (03CR) 10jerkins-bot: [V: 04-1] openldap::mgmt: turn cross-validate-accounts into template [puppet] - 10https://gerrit.wikimedia.org/r/350325 (owner: 10Dzahn) [22:50:47] (03PS4) 10Volans: Replace $::main_ipaddress by the new ipaddress fact [puppet] - 10https://gerrit.wikimedia.org/r/345569 (https://phabricator.wikimedia.org/T163196) (owner: 10Faidon Liambotis) [22:52:51] (03PS2) 10Dzahn: openldap::mgmt: turn cross-validate-accounts into template [puppet] - 10https://gerrit.wikimedia.org/r/350325 [22:57:31] (03CR) 10Dzahn: [C: 032] "templates/wikimedia.org:ldap-labs.eqiad 1H IN CNAME seaborgium" [puppet] - 10https://gerrit.wikimedia.org/r/350325 (owner: 10Dzahn) [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170425T2300). Please do the needful. [23:02:54] (03CR) 10Dzahn: "no-op on terbium. and on wasat:" [puppet] - 10https://gerrit.wikimedia.org/r/350325 (owner: 10Dzahn) [23:12:22] (03PS1) 10Dzahn: openldap::mgmt: only run account-validation script in $::mw_primary [puppet] - 10https://gerrit.wikimedia.org/r/350327 [23:26:08] (03CR) 10Dzahn: [C: 032] "i confirmed running the script works fine on wasat and has same result as running it on terbium" [puppet] - 10https://gerrit.wikimedia.org/r/350327 (owner: 10Dzahn) [23:27:07] 06Operations, 06Multimedia, 10media-storage, 15User-fgiunchedi: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3212444 (10Revent) @fgiunchedi https://commons.wikimedia.org/wiki/File:Autonomous_bus_trials_South... [23:32:37] (03PS1) 10Dzahn: admins: quiddity in WMF group but missing in LDAP user list [puppet] - 10https://gerrit.wikimedia.org/r/350328 [23:42:53] (03CR) 10Dzahn: "cron was removed on terbium and is active on wasat" [puppet] - 10https://gerrit.wikimedia.org/r/350327 (owner: 10Dzahn) [23:44:50] (03CR) 10Dzahn: "follow-up 1: https://gerrit.wikimedia.org/r/#/c/350325/" [puppet] - 10https://gerrit.wikimedia.org/r/342777 (owner: 10Dzahn) [23:57:05] 06Operations, 13Patch-For-Review: ircecho - /etc/default/ircecho puppet issue - https://phabricator.wikimedia.org/T163476#3212596 (10Dzahn) 05Open>03Invalid