[00:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180308T0000). [00:00:05] tgr, Smalyshev, and bd808: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:04:13] here [00:05:16] o/ [00:06:28] I can SWAT [00:08:07] (03PS2) 10Thcipriani: Enable loginOnly mode for local auth provider on group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416631 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [00:08:22] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416631 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [00:09:34] (03Merged) 10jenkins-bot: Enable loginOnly mode for local auth provider on group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416631 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [00:10:19] (03CR) 10jenkins-bot: Enable loginOnly mode for local auth provider on group 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416631 (https://phabricator.wikimedia.org/T57420) (owner: 10Gergő Tisza) [00:10:43] tgr: ^ is live on mwdebug1002, check please [00:12:05] thcipriani: the site is not broken, which is all I can easily check [00:13:40] tgr: fair enough, going live :) [00:13:52] thx! [00:16:45] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:416631|Enable loginOnly mode for local auth provider on group 2]] T57420 (duration: 01m 16s) [00:16:50] tgr: ^ live now [00:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:01] T57420: Remove local wiki password hash when CentralAuth has attached account - https://phabricator.wikimedia.org/T57420 [00:17:25] (03PS4) 10Thcipriani: Add configuration for CirrusSearch to instantly index new Wikidata items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413899 (https://phabricator.wikimedia.org/T183053) (owner: 10Smalyshev) [00:17:45] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413899 (https://phabricator.wikimedia.org/T183053) (owner: 10Smalyshev) [00:19:38] (03Merged) 10jenkins-bot: Add configuration for CirrusSearch to instantly index new Wikidata items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413899 (https://phabricator.wikimedia.org/T183053) (owner: 10Smalyshev) [00:19:53] (03CR) 10jenkins-bot: Add configuration for CirrusSearch to instantly index new Wikidata items [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413899 (https://phabricator.wikimedia.org/T183053) (owner: 10Smalyshev) [00:21:42] SMalyshev: ^ is live on mwdebug, check please [00:21:49] thcipriani: checking [00:23:17] seems to be wotking ok [00:23:32] at least nothing is broken :) [00:23:49] on test.wikidata, can't test on real wikidata since the train is not there yet [00:24:00] so I think it's good to go [00:24:01] heh, ok, going live [00:25:42] wait, wikidata? isn't that group1? train should be there now. [00:26:03] FWIW https://www.wikidata.org/wiki/Special:Version [00:26:08] anyway, going live [00:28:12] !log thcipriani@tin Synchronized wmf-config/Wikibase-production.php: SWAT: [[gerrit:413899|Add configuration for CirrusSearch to instantly index new Wikidata items]] T183053 (duration: 01m 15s) [00:28:18] ^ SMalyshev live now [00:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:27] T183053: New Wikidata items appear in search with a delay - https://phabricator.wikimedia.org/T183053 [00:28:28] thanks! [00:28:38] yw :) [00:29:02] thcipriani: ah, you are right. I was confused, thought wikidata not deployed yet [00:29:40] just a few hours ago, train running on schedule this week (so far anyway) [00:30:14] (03PS2) 10Thcipriani: wikitech: use FQDNs for m5 cluster members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417173 (owner: 10BryanDavis) [00:30:32] bd808: I think I saw you right before swat started, still around right? [00:30:56] thcipriani: yup [00:31:07] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417173 (owner: 10BryanDavis) [00:31:15] cool [00:32:00] is there a way to test ^ with mwdebug? other than: ensure no explosions? [00:32:34] (03Merged) 10jenkins-bot: wikitech: use FQDNs for m5 cluster members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417173 (owner: 10BryanDavis) [00:32:50] (03CR) 10jenkins-bot: wikitech: use FQDNs for m5 cluster members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417173 (owner: 10BryanDavis) [00:32:52] thcipriani: I have a thing rigged up on mwdebug1002 that may give some clue [00:33:05] but the big one is no explosions :) [00:33:35] k, it's on mwdebug1002, now [00:34:24] the fqdn is coming through :) [00:35:03] and other wikis seem to still work. [00:35:08] k, awesome, going live [00:35:09] I think its good [00:37:34] !log thcipriani@tin Synchronized wmf-config/db-eqiad.php: SWAT: [[gerrit:417173|wikitech: use FQDNs for m5 cluster members]] (duration: 01m 16s) [00:37:38] ^ bd808 live now [00:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:18] thcipriani: wikitech still works so we didn't mess up badly :) [00:45:41] setting a high bar [00:45:43] :) [01:00:04] twentyafterfour: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Phabricator update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180308T0100). [01:00:05] No GERRIT patches in the queue for this window AFAICS. [01:02:28] (03PS10) 10Dzahn: icinga: script to send custom SMS to Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/400615 (https://phabricator.wikimedia.org/T82937) [01:02:58] (03CR) 10jerkins-bot: [V: 04-1] icinga: script to send custom SMS to Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/400615 (https://phabricator.wikimedia.org/T82937) (owner: 10Dzahn) [01:13:29] !log preparing for phabricator update 2018-03-07/1 [01:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:42] !log phabricator update completed [01:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:05] (03PS11) 10Dzahn: icinga: script to send custom SMS to Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/400615 (https://phabricator.wikimedia.org/T82937) [01:23:05] (03PS12) 10Dzahn: icinga: script to send custom SMS to Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/400615 (https://phabricator.wikimedia.org/T82937) [01:23:57] (03CR) 10Dzahn: [C: 032] icinga: script to send custom SMS to Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/400615 (https://phabricator.wikimedia.org/T82937) (owner: 10Dzahn) [01:24:13] (03PS13) 10Dzahn: icinga: script to send custom SMS to Icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/400615 (https://phabricator.wikimedia.org/T82937) [01:24:24] (03CR) 10Dzahn: [C: 032] "pep8 happy now too" [puppet] - 10https://gerrit.wikimedia.org/r/400615 (https://phabricator.wikimedia.org/T82937) (owner: 10Dzahn) [01:32:59] (03PS1) 10Dzahn: icinga: let non-root users schedule downtime/send sms [puppet] - 10https://gerrit.wikimedia.org/r/417182 [01:33:54] (03PS2) 10Dzahn: icinga: let non-root users schedule downtime/send sms [puppet] - 10https://gerrit.wikimedia.org/r/417182 [01:36:21] (03Abandoned) 10Dzahn: icinga: let non-root users schedule downtime/send sms [puppet] - 10https://gerrit.wikimedia.org/r/417182 (owner: 10Dzahn) [01:37:33] (03PS1) 10Chad: Gerrit: Make project deletion less destructive [puppet] - 10https://gerrit.wikimedia.org/r/417183 [01:48:46] (03CR) 10Smalyshev: wdqs: configure the new internal cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [01:51:07] (03PS1) 10Gergő Tisza: Add throttle exception for ptwiki editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417185 (https://phabricator.wikimedia.org/T189161) [01:53:02] anyone around for a quick sanity check on https://gerrit.wikimedia.org/r/417185 ? [01:59:30] (03CR) 10Greg Grossmeier: [C: 031] Add throttle exception for ptwiki editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417185 (https://phabricator.wikimedia.org/T189161) (owner: 10Gergő Tisza) [02:04:45] (03PS1) 10Dzahn: icinga-sms: improve help text [puppet] - 10https://gerrit.wikimedia.org/r/417186 [02:05:12] thx greg-g! I'll deploy it now, the editathon will have started by the next SWAT [02:05:26] twentyafterfour: done with the deploy? [02:05:58] tgr: yes done with phabricator deploy a while ago [02:08:05] tgr: thank you, I should have told you mukunda was done. [02:08:13] (03CR) 10Dzahn: [C: 032] icinga-sms: improve help text [puppet] - 10https://gerrit.wikimedia.org/r/417186 (owner: 10Dzahn) [02:08:54] (03CR) 10Gergő Tisza: [C: 032] Add throttle exception for ptwiki editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417185 (https://phabricator.wikimedia.org/T189161) (owner: 10Gergő Tisza) [02:10:08] (03Merged) 10jenkins-bot: Add throttle exception for ptwiki editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417185 (https://phabricator.wikimedia.org/T189161) (owner: 10Gergő Tisza) [02:10:22] (03CR) 10jenkins-bot: Add throttle exception for ptwiki editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417185 (https://phabricator.wikimedia.org/T189161) (owner: 10Gergő Tisza) [02:11:55] (03PS7) 10Aaron Schulz: [WIP] Add dynomite module and dynomite_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/415789 [02:12:56] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add dynomite module and dynomite_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/415789 (owner: 10Aaron Schulz) [02:15:06] !log tgr@tin Synchronized wmf-config/throttle.php: T189161 Temporarely remove account creation limit for event on Portuguese Wikipedia on March 08, 2018 (duration: 01m 10s) [02:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:15:24] T189161: Temporarely remove account creation limit for event on Portuguese Wikipedia on March 08, 2018 - https://phabricator.wikimedia.org/T189161 [02:16:08] (03PS8) 10Aaron Schulz: [WIP] Add dynomite module and dynomite_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/415789 [02:16:43] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add dynomite module and dynomite_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/415789 (owner: 10Aaron Schulz) [02:24:32] 10Operations, 10ops-eqsin, 10netops: return faulty MX104 to Juniper - https://phabricator.wikimedia.org/T189060#4034060 (10Papaul) [02:30:15] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.23) (duration: 07m 37s) [02:30:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:38:46] 10Operations, 10ops-eqsin, 10netops: return faulty MX104 to Juniper - https://phabricator.wikimedia.org/T189060#4034064 (10Papaul) 05Open>03Resolved {F14673832} [02:44:17] 10Operations, 10ops-codfw, 10hardware-requests, 10User-Elukey: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4034066 (10Papaul) a:05Papaul>03Joe @joe this needs to be assigned first to someone with root access to do the first 2 steps when complete, assign to me. Thanks [02:45:25] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#4034068 (10Papaul) @MoritzMuehlenhoff will do [03:20:17] (03CR) 10Krinkle: [WIP] Add dynomite module and dynomite_wancache profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415789 (owner: 10Aaron Schulz) [03:27:33] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 754.87 seconds [03:28:06] 10Operations, 10Incident-20150423-Commons, 10RESTBase, 10Traffic, and 4 others: RFC: Re-evaluate varnish-level request-restart behavior on 5xx - https://phabricator.wikimedia.org/T97206#4034097 (10Krinkle) [04:04:22] (03CR) 10Krinkle: [C: 04-1] NavigtationTiming: Enable oversampling for Singapore (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [04:04:41] (03CR) 10Krinkle: [C: 031] NavigtationTiming: Enable oversampling for Singapore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [04:23:52] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 241.73 seconds [04:27:45] !log Running whisper-mass-resize for ResourceLoader.* metrics on graphite1001 and graphite2001 (T179622) [04:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:04] T179622: Update our Graphite metrics for current retention rules - https://phabricator.wikimedia.org/T179622 [04:43:31] (03PS2) 10Andrew Bogott: openstack: Refactor /root/novaenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/417169 (owner: 10BryanDavis) [04:44:16] (03CR) 10Andrew Bogott: [C: 032] openstack: Refactor /root/novaenv.sh [puppet] - 10https://gerrit.wikimedia.org/r/417169 (owner: 10BryanDavis) [04:49:08] (03PS2) 10Andrew Bogott: openstack: Promote DnsManager to mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/417170 (owner: 10BryanDavis) [05:16:06] 10Operations, 10DNS, 10Mail, 10Traffic: SPF for Greenhouse - https://phabricator.wikimedia.org/T189065#4034192 (10tstarling) How about I change the task title so that it can stay open? Because the real problem here is that outbound email is broken, I don't care whether SPF or a subdomain is used to fix it.... [05:16:29] 10Operations, 10DNS, 10Mail, 10Traffic: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4034194 (10tstarling) [05:24:25] (03PS1) 10Revi: Disable upload for non-admins on kowikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417189 (https://phabricator.wikimedia.org/T189021) [07:01:19] (03CR) 10Elukey: "super great thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/417005 (https://phabricator.wikimedia.org/T188294) (owner: 10Ottomata) [07:01:26] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T188685#4034267 (10Marostegui) 05Open>03Resolved a:03Cmjohnson Thanks Chris, it looks good now: ``` root@db1064:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual... [07:03:36] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417196 [07:03:51] (03PS1) 10Marostegui: Revert "db1064: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/417197 [07:04:01] (03CR) 10Elukey: [C: 031] Switch mjolnir kafka broker to jumbo-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/417038 (https://phabricator.wikimedia.org/T188408) (owner: 10DCausse) [07:04:03] (03PS2) 10Marostegui: Revert "db1064: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/417197 [07:04:55] (03CR) 10Marostegui: [C: 032] Revert "db1064: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/417197 (owner: 10Marostegui) [07:10:25] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417196 [07:11:46] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417196 (owner: 10Marostegui) [07:13:18] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417196 (owner: 10Marostegui) [07:13:32] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417196 (owner: 10Marostegui) [07:15:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Revert: Depool db1064, it is not performing well with 2 failed disks - T188685 (duration: 01m 31s) [07:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:32] T188685: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T188685 [07:17:41] (03PS1) 10Marostegui: db-eqiad.php: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417198 (https://phabricator.wikimedia.org/T187530) [07:19:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417198 (https://phabricator.wikimedia.org/T187530) (owner: 10Marostegui) [07:20:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417198 (https://phabricator.wikimedia.org/T187530) (owner: 10Marostegui) [07:21:04] (03CR) 10jenkins-bot: db-eqiad.php: Depool es1019 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417198 (https://phabricator.wikimedia.org/T187530) (owner: 10Marostegui) [07:22:12] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: es1019 ipmi and mgmt unresponsive - https://phabricator.wikimedia.org/T187530#4034275 (10Marostegui) @Cmjohnson I saw your ping yesterday but I was already out for the day. I have now depooled es1019, so let me know when you are around today so I... [07:22:35] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool es1019 for maintenance - T187530 (duration: 01m 16s) [07:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:52] T187530: es1019 ipmi and mgmt unresponsive - https://phabricator.wikimedia.org/T187530 [07:24:03] !log reboot kafka2002 (eventbus codfw) for kernel updates [07:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:02] (03CR) 10Nikerabbit: ContentTranslation: Set cookieDomain to null for Production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416973 (owner: 10KartikMistry) [07:36:00] (03PS1) 10Urbanecm: Add hi.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/417199 (https://phabricator.wikimedia.org/T188366) [07:38:06] (03PS1) 10Urbanecm: Add hi.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/417200 (https://phabricator.wikimedia.org/T188366) [07:42:30] (03PS3) 10Urbanecm: Add gor to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/416929 (https://phabricator.wikimedia.org/T189109) [07:42:40] (03PS4) 10Urbanecm: Add gor to langs.tmpl [dns] - 10https://gerrit.wikimedia.org/r/416929 (https://phabricator.wikimedia.org/T189109) [07:44:44] !log reboot kafka2003 (eventbus codfw) for kernel updates [07:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:15] (03PS1) 10Urbanecm: Initial configuration for hiwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417201 (https://phabricator.wikimedia.org/T188366) [07:53:55] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [07:54:46] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [07:56:02] elukey: due to reboot? ^ [07:57:04] kart_: o/ - nope shouldn't be related, happened also yesterday morning iirc [07:57:25] yup [07:57:58] it's multiple hosts this time: cp3032, cp3030, cp3043 [07:58:02] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-3h&to=now [07:59:28] (03PS2) 10KartikMistry: ContentTranslation: Set cookieDomain to null for Production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416973 [08:00:05] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html{/title}{/revision} (Transform wikitext to html) timed out before a response was received [08:00:26] !log cp3032: varnish-be-restart T189085 [08:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:41] T189085: Resources and pages occasionally take seconds to respond or fail - https://phabricator.wikimedia.org/T189085 [08:00:56] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [08:03:37] https://en.wikisource.org/wiki/Special:RecentChanges?hidebots=1&hidecategorization=1&hideWikibase=1&limit=50&days=7&urlversion=2 [08:03:47] was slow to load and eventually loaded without styles [08:04:02] Followup to slow performance yesterda [08:04:25] <_joe_> ok that page is indeed quite slow to load [08:04:39] <_joe_> ema: ^^ [08:04:55] <_joe_> ShakespeareFan00: I think that page does abuse of the database though [08:05:30] https://en.wikisource.org/wiki/Page:A_general_history_for_colleges_and_high_schools_(Myers,_1890).djvu/593 [08:05:32] is another [08:06:22] <_joe_> ShakespeareFan00: use pages that are not the 600th page of a djvu or a particularly complex rc page to test for connectivity issues though [08:06:48] _joe_: https://en.wikisource.org/wiki/Main_Page simple enough? [08:06:52] That's taking a while [08:07:02] <_joe_> yes [08:07:20] Commons doesn't seem to be affected [08:07:20] <_joe_> that takes 0.1 s for me to render [08:08:01] And like yesterday it seems to be intermittent [08:09:15] !log cp3040: varnish-be-restart T189085 [08:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:31] T189085: Resources and pages occasionally take seconds to respond or fail - https://phabricator.wikimedia.org/T189085 [08:10:23] We had this spike of errors and I saw: Wikimedia\Rdbms\LoadBalancer::pickReaderIndex: all replica DBs lagged. Switch to read-only mode [08:10:48] <_joe_> marostegui: what wiki/section? [08:10:55] enwiki [08:11:05] but it was just a spike (I don't actually see any replicas lagged) [08:11:19] <_joe_> uhm the problems were reported for wikisource [08:11:59] <_joe_> and indeed the rc pages now load like a breeze [08:12:30] <_joe_> so i guess ema's restarts sealed the deal [08:12:39] <_joe_> ShakespeareFan00: still experiencing slowness? [08:12:58] status update: playing whack-a-mole [08:13:10] (carefully) [08:13:41] there's still a bunch of backends piling up connections which I'm following, restarting those causing 503s first [08:14:36] (03CR) 10Gehel: "puppet compiler is happy: https://puppet-compiler.wmflabs.org/compiler02/10349/relforge1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/417038 (https://phabricator.wikimedia.org/T188408) (owner: 10DCausse) [08:19:02] !log reboot analytics1001 (Hadoop master) for kernel upgrade (temp failover to analytics1002) [08:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:39] ok cp3041 seems to have recovered on its own [08:22:22] the two hosts two watch now are cp3043 and cp3041: varnish.service has been running for ~5 days on those, while all other text-esams varnish backends have been restarted in the last ~3 days tops and should thus be stable [08:23:23] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1520494767293&to=1520497367887&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All [08:25:12] (03PS1) 10Gehel: wdqs: use the raid10-gpt-srv-lvm-ext4 partman config for new wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/417202 (https://phabricator.wikimedia.org/T187766) [08:28:21] (03PS1) 10Marostegui: db-codfw.php: Depool db2046,db2053,db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417203 [08:28:43] !log Stop MySQL on db2046, db2053 and db2060 for kernel upgrade [08:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:37] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2046,db2053,db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417203 (owner: 10Marostegui) [08:30:49] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2046,db2053,db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417203 (owner: 10Marostegui) [08:31:03] (03CR) 10jenkins-bot: db-codfw.php: Depool db2046,db2053,db2060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417203 (owner: 10Marostegui) [08:31:13] !log reboot analytics1002 (Hadoop master standby) for kernel upgrades [08:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:32] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2046, db2053 and db2060 for kernel upgrade (duration: 01m 17s) [08:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:14] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2046,db2053,db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417205 [08:50:13] !log rebooting analytics1003 (Hadoop Hive, Oozie, etc..) for kernel updates [08:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:57] (03CR) 10Marostegui: [C: 032] Revert "db-codfw.php: Depool db2046,db2053,db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417205 (owner: 10Marostegui) [08:54:11] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2046,db2053,db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417205 (owner: 10Marostegui) [08:57:50] (03PS2) 10Marostegui: site.pp: Make db1073 master [puppet] - 10https://gerrit.wikimedia.org/r/416650 (https://phabricator.wikimedia.org/T183469) [08:58:00] (03PS2) 10Marostegui: dbproxy1005: Make db1073 master instead of db1009 [puppet] - 10https://gerrit.wikimedia.org/r/416658 (https://phabricator.wikimedia.org/T183469) [08:58:02] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2046,db2053,db2060" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417205 (owner: 10Marostegui) [08:58:03] !log reset RAC on bast1001, serial console was stuck [08:58:07] (03PS2) 10Marostegui: wmnet: Promote db1073 to become m5 master [dns] - 10https://gerrit.wikimedia.org/r/416680 (https://phabricator.wikimedia.org/T183469) [08:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:33] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2046, db2053 and db2060 after kernel upgrade (duration: 01m 15s) [08:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:54] !log restart varnish backend on cp3041 (failed fetches) [08:59:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:41] (03CR) 10Jcrespo: "!?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417173 (owner: 10BryanDavis) [09:04:03] jynus: ^ what's that? :| [09:06:44] I don't know [09:07:20] but it has been done on eqiad only, and it may break the proxy or the master failover [09:07:38] Yeah, once it is in place, it will definitely break it [09:08:08] (03CR) 10TerraCodes: [C: 031] "Ah, thanks for the suggestion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [09:08:29] (03Abandoned) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) (owner: 10TerraCodes) [09:08:45] !log rebooting bast1001 for kernel security update [09:08:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:29] (03PS1) 10Jcrespo: Revert "wikitech: use FQDNs for m5 cluster members" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417211 [09:10:43] (03CR) 10Marostegui: [C: 031] Revert "wikitech: use FQDNs for m5 cluster members" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417211 (owner: 10Jcrespo) [09:10:50] elukey: thanks :) [09:11:00] <3 [09:11:42] (03PS2) 10Gehel: Switch mjolnir kafka broker to jumbo-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/417038 (https://phabricator.wikimedia.org/T188408) (owner: 10DCausse) [09:12:07] (03CR) 10Jcrespo: "Thinking that the host name (actually, an alias ip) and the dns name have something to do, it is something that will break the proxies. If" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417211 (owner: 10Jcrespo) [09:12:25] (03CR) 10Gehel: [C: 032] Switch mjolnir kafka broker to jumbo-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/417038 (https://phabricator.wikimedia.org/T188408) (owner: 10DCausse) [09:12:51] !log cp3043: varnish-be-restart T189085 [09:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:07] T189085: Resources and pages occasionally take seconds to respond or fail - https://phabricator.wikimedia.org/T189085 [09:15:19] (03PS1) 10Elukey: Add analytics1071 to role::analytics_cluster::hadoop::worker [puppet] - 10https://gerrit.wikimedia.org/r/417212 (https://phabricator.wikimedia.org/T188294) [09:16:46] (03PS2) 10Elukey: Add analytics1071 to role::analytics_cluster::hadoop::worker [puppet] - 10https://gerrit.wikimedia.org/r/417212 (https://phabricator.wikimedia.org/T188294) [09:21:01] (03PS3) 10Giuseppe Lavagetto: codfw: decommission mw2017-2099 [puppet] - 10https://gerrit.wikimedia.org/r/416968 (https://phabricator.wikimedia.org/T187467) [09:21:03] (03PS1) 10Giuseppe Lavagetto: mediawiki: add a decommission script [puppet] - 10https://gerrit.wikimedia.org/r/417213 [09:21:45] 10Operations, 10ops-codfw, 10Patch-For-Review: Decommission mw2017 and mw2099 - https://phabricator.wikimedia.org/T187467#4034410 (10Joe) [09:21:49] 10Operations, 10vm-requests, 10Patch-For-Review, 10User-Joe: Create 2 VMs in codfw for mwdebug20001 and 2002 - https://phabricator.wikimedia.org/T187468#4034409 (10Joe) 05Open>03Resolved [09:22:57] !log cp-eqsin: reboot for retpoline kernel updates T188092 [09:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:44] gehel: thanks! [09:25:12] elukey: all thanks go to dcausse ! I just did the merge := [09:26:34] gehel: Kafka Analytics is slowly draining in these days :) https://grafana.wikimedia.org/dashboard/db/kafka?refresh=5m&orgId=1 [09:26:59] Wow, that's a big drop! [09:27:54] that was moving webrequest text to Jumbo [09:29:35] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [09:29:45] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [09:32:15] (03PS2) 10Giuseppe Lavagetto: mediawiki: add a decommission script [puppet] - 10https://gerrit.wikimedia.org/r/417213 [09:32:17] (03PS4) 10Giuseppe Lavagetto: codfw: decommission mw2017-2099 [puppet] - 10https://gerrit.wikimedia.org/r/416968 (https://phabricator.wikimedia.org/T187467) [09:34:43] (03CR) 10Elukey: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/417213 (owner: 10Giuseppe Lavagetto) [09:34:56] (03PS1) 10TerraCodes: Start $wmfRealm to $wmgRealm [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417215 [09:36:32] (03PS1) 10Jcrespo: mariadb: Depool partially db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417216 [09:37:59] (03CR) 10Jcrespo: [C: 032] mariadb: Depool partially db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417216 (owner: 10Jcrespo) [09:39:36] (03Merged) 10jenkins-bot: mariadb: Depool partially db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417216 (owner: 10Jcrespo) [09:39:50] (03CR) 10jenkins-bot: mariadb: Depool partially db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417216 (owner: 10Jcrespo) [09:40:43] !log rebooting neodymium for kernel security update [09:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:43] (03PS1) 10Marostegui: s4.hosts: Add db2090 [software] - 10https://gerrit.wikimedia.org/r/417218 (https://phabricator.wikimedia.org/T170662) [09:43:44] (03CR) 10Marostegui: [C: 032] s4.hosts: Add db2090 [software] - 10https://gerrit.wikimedia.org/r/417218 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [09:44:19] I just added these entries to the next SWAT and the SWAT for after mighnight (European time): https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1784846&oldid=1784827 [09:44:26] (03Merged) 10jenkins-bot: s4.hosts: Add db2090 [software] - 10https://gerrit.wikimedia.org/r/417218 (https://phabricator.wikimedia.org/T170662) (owner: 10Marostegui) [09:44:31] Do I need to be present here in IRC for these? They're pretty straight-forward [09:44:39] !log rearming keyholder on neodymium after reboot [09:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:42] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 partially (duration: 01m 16s) [09:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:05] Jhs: it's good to be here, yes. Even changes that we all think are trivial/easy sometimes go awry. [09:48:05] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: add a decommission script [puppet] - 10https://gerrit.wikimedia.org/r/417213 (owner: 10Giuseppe Lavagetto) [09:48:11] (03CR) 10Jayprakash12345: Add hi.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/417199 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [09:48:12] Logstash Error rate for mw1276.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD [09:48:33] either errors started or was something already ongoing [09:48:45] (03CR) 10Elukey: [C: 031] wdqs: use the raid10-gpt-srv-lvm-ext4 partman config for new wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/417202 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [09:49:50] (03CR) 10Jayprakash12345: [C: 031] Add hi.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/417200 (https://phabricator.wikimedia.org/T188366) (owner: 10Urbanecm) [09:50:37] I don't see changes on errors on mediawiki, though [09:52:18] (03CR) 10Giuseppe Lavagetto: [C: 032] codfw: decommission mw2017-2099 [puppet] - 10https://gerrit.wikimedia.org/r/416968 (https://phabricator.wikimedia.org/T187467) (owner: 10Giuseppe Lavagetto) [09:53:30] (03PS1) 10Phedenskog: Icinga: Add WebPageReplay Grafana performance alerts [puppet] - 10https://gerrit.wikimedia.org/r/417221 (https://phabricator.wikimedia.org/T188988) [09:53:59] (03CR) 10ArielGlenn: [C: 031] "Feel free to merge; I didn't because puppet is disabled on labstore1007 and I'd like to watch the output on all impacted hosts." [puppet] - 10https://gerrit.wikimedia.org/r/416980 (https://phabricator.wikimedia.org/T188727) (owner: 10Madhuvishy) [09:54:04] (03CR) 10ArielGlenn: [C: 031] "Feel free to merge; I didn't because puppet is disabled on labstore1007 and I'd like to watch the output on all impacted hosts." [puppet] - 10https://gerrit.wikimedia.org/r/416983 (https://phabricator.wikimedia.org/T188727) (owner: 10Madhuvishy) [09:54:11] (03CR) 10ArielGlenn: [C: 031] "Feel free to merge; I didn't because puppet is disabled on labstore1007 and I'd like to watch the output on all impacted hosts." [puppet] - 10https://gerrit.wikimedia.org/r/416984 (https://phabricator.wikimedia.org/T188727) (owner: 10Madhuvishy) [09:54:52] (03PS2) 10Phedenskog: Icinga: Add WebPageReplay Grafana performance alerts [puppet] - 10https://gerrit.wikimedia.org/r/417221 (https://phabricator.wikimedia.org/T188988) [09:56:31] <_joe_> !log decommissioning mw2017-2099 T187467 [09:56:45] !log restaring mjolnir-kafka-daemon.service on relforge1001 to switch to kafka jumbo [09:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:47] T187467: Decommission mw2017 and mw2099 - https://phabricator.wikimedia.org/T187467 [09:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:48] !log restaring mjolnir-kafka-daemon.service on relforge1002 to switch to kafka jumbo [09:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:58] dcausse: \o/ [09:59:03] elukey: o/ [10:01:45] PROBLEM - Nginx local proxy to apache on mw2099 is CRITICAL: connect to address 10.192.16.72 and port 443: Connection refused [10:01:46] PROBLEM - nutcracker port on mw2099 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [10:01:56] PROBLEM - HHVM rendering on mw2099 is CRITICAL: connect to address 10.192.16.72 and port 80: Connection refused [10:02:05] bye bye mws [10:02:15] PROBLEM - nutcracker process on mw2099 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (nutcracker), command name nutcracker [10:02:25] PROBLEM - Apache HTTP on mw2099 is CRITICAL: connect to address 10.192.16.72 and port 80: Connection refused [10:02:27] 10Operations, 10ops-codfw, 10Patch-For-Review: Decommission mw2017 and mw2099 - https://phabricator.wikimedia.org/T187467#4034543 (10Joe) [10:02:59] elukey: afaik we should be no longer using kafka 'eqiad' for search. The only last bit is mw (https://gerrit.wikimedia.org/r/#/c/417006/) iirc [10:03:15] PROBLEM - HHVM processes on mw2099 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [10:03:36] PROBLEM - Apache HTTP on mw2017 is CRITICAL: connect to address 10.192.0.56 and port 80: Connection refused [10:03:45] PROBLEM - nutcracker process on mw2017 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (nutcracker), command name nutcracker [10:03:55] PROBLEM - Check systemd state on mw2099 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:04:05] PROBLEM - HHVM rendering on mw2017 is CRITICAL: connect to address 10.192.0.56 and port 80: Connection refused [10:04:25] PROBLEM - nutcracker port on mw2017 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused [10:04:26] PROBLEM - Nginx local proxy to apache on mw2017 is CRITICAL: connect to address 10.192.0.56 and port 443: Connection refused [10:05:15] PROBLEM - Check systemd state on mw2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:05:23] moritzm: you ^ or could it be us? [10:05:45] PROBLEM - HHVM processes on mw2017 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [10:06:30] <_joe_> it's me [10:06:37] <_joe_> but that should *not* be happening [10:06:41] <_joe_> damn puppetdb lag [10:06:57] apergos, ok. (Y) [10:08:04] <_joe_> ok, the alerts are now gone [10:08:07] <_joe_> sorry :/ [10:08:36] dcausse: super thanks [10:13:49] !log conduct IO stresstests on ganeti1005 (sca1004 VM) with cache=none KVM flag on [10:13:56] !log conduct IO stresstests on ganeti1005 (sca1004 VM) with cache=none KVM flag on T181121 [10:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:20] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [10:17:25] (03PS1) 10Jcrespo: mariadb: Pool db1114 with main load, remove it from api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417224 [10:17:41] (03CR) 10Marostegui: [C: 031] mariadb: Pool db1114 with main load, remove it from api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417224 (owner: 10Jcrespo) [10:19:14] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1114 with main load, remove it from api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417224 (owner: 10Jcrespo) [10:19:43] (03PS1) 10Jcrespo: mariadb: Depool db1104 fully [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417225 [10:20:08] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4034571 (10Gehel) [10:20:28] (03Merged) 10jenkins-bot: mariadb: Pool db1114 with main load, remove it from api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417224 (owner: 10Jcrespo) [10:20:43] (03CR) 10jenkins-bot: mariadb: Pool db1114 with main load, remove it from api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417224 (owner: 10Jcrespo) [10:20:54] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, a couple nitpicks." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [10:22:51] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Change db1114 load (duration: 01m 16s) [10:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:31] (03CR) 10Giuseppe Lavagetto: [C: 031] Icinga: add EtcdConfig sync check on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [10:28:54] (03PS1) 10Alexandros Kosiaris: scap::target: Install git-lfs [puppet] - 10https://gerrit.wikimedia.org/r/417226 (https://phabricator.wikimedia.org/T180628) [10:29:32] (03CR) 10jerkins-bot: [V: 04-1] scap::target: Install git-lfs [puppet] - 10https://gerrit.wikimedia.org/r/417226 (https://phabricator.wikimedia.org/T180628) (owner: 10Alexandros Kosiaris) [10:32:59] dcausse: https://grafana.wikimedia.org/dashboard/db/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=mjolnir&from=now-3h&to=now [10:33:03] all good :) [10:33:31] the analytics one stopped afaics [10:33:48] nice! [10:33:48] <_joe_> akosiaris: we have git-lfs support now? *great* [10:34:04] <_joe_> we can finally maybe skip the damn git-fat-via-archiva mess [10:34:24] offsets will remain at 0 until we use it so everything looks good to me [10:35:02] dcausse: that one is the lag, so it should stay zero if everything goes fine :) [10:35:12] ah gotcha :) [10:35:19] I think we should also have metrics about the offset, lemme check, might be good to add them too [10:36:47] (03CR) 10Volans: "Ack fixed." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [10:36:56] (03PS9) 10Volans: Icinga: add sync check for MW config on etcd [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) [10:36:58] (03PS10) 10Volans: Icinga: add EtcdConfig sync check on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) [10:37:16] _joe_: we got just got (like yesterday). We haven't even run the basis tests so don't get your hopes up too much [10:37:29] (03PS1) 10Vgutierrez: Ensure per-service-MED is an integer [debs/pybal] - 10https://gerrit.wikimedia.org/r/417228 (https://phabricator.wikimedia.org/T165764) [10:37:30] _joe_: ^^^ if you're ok I'd merge the first one [10:38:00] <_joe_> volans: go [10:38:09] <_joe_> :) [10:39:35] (03CR) 10Volans: [C: 032] Icinga: add sync check for MW config on etcd [puppet] - 10https://gerrit.wikimedia.org/r/413355 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [10:39:47] dcausse: https://grafana.wikimedia.org/dashboard/db/kafka-consumer-lag?orgId=1&from=now-3h&to=now&panelId=2&fullscreen&var-datasource=eqiad%20prometheus%2Fops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=mjolnir [10:39:55] I've disabled temporarily puppet on einsteinium, testing first on tegmen [10:40:13] elukey: thanks! [10:40:35] it dropped for analytics too [10:40:35] all good [10:40:50] akosiaris: FYI i've changed https://gerrit.wikimedia.org/r/c/413355/9/modules/nrpe/manifests/monitor_systemd_unit_state.pp [10:42:51] (03PS2) 10Vgutierrez: Ensure per-service-MED is an integer [debs/pybal] - 10https://gerrit.wikimedia.org/r/417228 (https://phabricator.wikimedia.org/T165764) [10:43:00] !log installing libvpx security updates [10:43:03] volans: em.. ok ? I did not even remember we had that [10:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:20] the script support that, so I just added it to the puppet side of it [10:43:40] :) [10:44:22] the comment is not very helpful [10:44:36] In addition, if is specified, the checks [10:44:36] 39 returns Ok iff the unit was started no more than [10:44:36] 40 seconds ago (and this information is only [10:44:36] 41 valid when a timer exists for the unit) [10:44:44] what timer ? [10:44:50] a systemd timer [10:44:59] not systemd.timer ? [10:44:59] ah [10:45:02] hmmm [10:45:13] if you have foo.service and foo.timer [10:45:16] and we have that ? [10:45:18] that controls the service [10:45:24] I've just add one :D [10:45:54] ah you default to the empty string [10:46:02] ok that's more like it [10:46:08] yeah to avoid to duplicate code [10:46:19] a bit ugly I didn't like it very much [10:46:52] (03PS2) 10Jforrester: 2017 wikitext editor: Enable by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413653 (https://phabricator.wikimedia.org/T188028) [10:47:14] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4034668 (10elukey) I checked the HDFS datanode logs and everything looks good on analytics1070. The only "weird" log is for Yarn, name... [10:49:13] (03CR) 10Vgutierrez: [V: 032] "After applying this change in pybal-test2001, per-service-MED is working:" [debs/pybal] - 10https://gerrit.wikimedia.org/r/417228 (https://phabricator.wikimedia.org/T165764) (owner: 10Vgutierrez) [10:53:28] (03CR) 10Muehlenhoff: [C: 031] adding dynkm/oliver keyes to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/416996 (https://phabricator.wikimedia.org/T188945) (owner: 10RobH) [10:56:27] (03PS3) 10ArielGlenn: cheap image dump script that might be ok for wikitech [dumps] - 10https://gerrit.wikimedia.org/r/417009 (https://phabricator.wikimedia.org/T188915) [10:56:44] (03CR) 10jerkins-bot: [V: 04-1] cheap image dump script that might be ok for wikitech [dumps] - 10https://gerrit.wikimedia.org/r/417009 (https://phabricator.wikimedia.org/T188915) (owner: 10ArielGlenn) [10:57:25] (03PS4) 10ArielGlenn: cheap image dump script that might be ok for wikitech [dumps] - 10https://gerrit.wikimedia.org/r/417009 (https://phabricator.wikimedia.org/T188915) [10:57:58] (03CR) 10Muehlenhoff: "The change itself is fine, but it should be noted that Oliver previously already had access (under the username "ironholds"), so if he e.g" [puppet] - 10https://gerrit.wikimedia.org/r/416993 (https://phabricator.wikimedia.org/T188945) (owner: 10RobH) [11:01:37] (03PS1) 10Muehlenhoff: Give Roan the privileges to restart maps-related services [puppet] - 10https://gerrit.wikimedia.org/r/417230 (https://phabricator.wikimedia.org/T189153) [11:02:03] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint, 10Patch-For-Review: Give Roan Kattouw the rights to deploy maps and restart maps-related services - https://phabricator.wikimedia.org/T189153#4034709 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:03:15] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4034711 (10MoritzMuehlenhoff) p:05Triage>03Normal [11:05:27] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10355/analytics1071.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/417212 (https://phabricator.wikimedia.org/T188294) (owner: 10Elukey) [11:05:29] (03PS3) 10Elukey: Add analytics1071 to role::analytics_cluster::hadoop::worker [puppet] - 10https://gerrit.wikimedia.org/r/417212 (https://phabricator.wikimedia.org/T188294) [11:05:43] (03CR) 10Volans: "-1 for me, see inline." (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/400615 (https://phabricator.wikimedia.org/T82937) (owner: 10Dzahn) [11:07:18] (03CR) 10Alexandros Kosiaris: [C: 031] mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [11:09:21] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [11:10:21] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy [11:12:49] (03PS1) 10Muehlenhoff: Temporarily remove mwdebug2001 from debug proxy aliases [puppet] - 10https://gerrit.wikimedia.org/r/417231 [11:13:31] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [11:17:32] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [11:32:17] !log installing isc-dhcp security updates [11:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:13] (03PS11) 10Volans: Icinga: add EtcdConfig sync check on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) [11:33:15] (03PS1) 10Volans: Icinga: use systemd::service for etcd_mw_config [puppet] - 10https://gerrit.wikimedia.org/r/417232 (https://phabricator.wikimedia.org/T182597) [11:33:29] (03CR) 10Ema: [C: 031] Ensure per-service-MED is an integer [debs/pybal] - 10https://gerrit.wikimedia.org/r/417228 (https://phabricator.wikimedia.org/T165764) (owner: 10Vgutierrez) [11:34:28] (03PS1) 10Gilles: Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) [11:35:00] (03CR) 10jerkins-bot: [V: 04-1] Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [11:35:45] (03PS2) 10Gilles: Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) [11:36:10] (03CR) 10jerkins-bot: [V: 04-1] Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [11:38:26] (03CR) 10Volans: "Compiler: https://puppet-compiler.wmflabs.org/compiler02/10356/tegmen.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/417232 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [11:38:31] _joe_: ^^^ [11:38:50] (03CR) 10Giuseppe Lavagetto: [C: 032] Icinga: use systemd::service for etcd_mw_config [puppet] - 10https://gerrit.wikimedia.org/r/417232 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [11:39:12] wow, are you puppet-merging it too? [11:39:54] <_joe_> yes [11:40:06] <_joe_> also running puppet on tegmen [11:40:11] ok [11:40:12] :D [11:40:40] I'm still not sure if we need the service for the unit itself, it must be enabled, but not started [11:40:52] <_joe_> volans: let's see [11:41:04] <_joe_> we might need to enable it [11:41:21] and I bet our puppettization doesn't contemplate this case [11:41:26] <_joe_> ok, it doesn't work [11:41:36] <_joe_> the puppet 'service' resource is broken [11:41:42] <_joe_> I hate them. [11:41:44] :( [11:41:50] error? [11:42:05] <_joe_> Execution of '/usr/sbin/service update-etcd-mw-config-lastindex.timer start' [11:42:12] <_joe_> /o\ [11:42:44] <_joe_> we *might* need to explicitly set provider => 'systemd' [11:42:51] No such file or directory? [11:42:54] <_joe_> lemme try that [11:43:08] <_joe_> volans: yes, it's trying to start update-etcd-mw-config-lastindex.timer.service [11:43:17] ah yeah [11:43:31] I need to pass $unit_type = 'timer [11:43:44] didnt' saw that, I though was the same abstraction of ::unit [11:43:50] _joe_: ^^ [11:43:53] <_joe_> no, that's not enough [11:44:28] yeah, I guess you need to explicitly tell it to use systemd semantics [11:44:41] ouch [11:44:50] <_joe_> the problem is it's executing /usr/sbin/service [11:45:17] <_joe_> volans: lemme test if there's a quick fix [11:45:22] ack [11:46:19] <_joe_> puppet apply -e 'service{ "update-etcd-mw-config-lastindex.timer": ensure => "running", provider => "systemd" }' works [11:47:51] <_joe_> but yeah, we also need to activate the underlying service [11:48:20] <_joe_> volans: uhm, lemme try to fix this quickly, then we can make a more general abstraction [11:48:22] now technically if puppet also starts it, is no problem, it will run once [11:48:25] and that's ok [11:48:53] so for the service we can probably use the systemd::service current abstraction [11:49:09] <_joe_> let me fix this for now [11:55:35] (03PS1) 10Elukey: role::eventlogging:analytics: include zmq config only on eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/417240 (https://phabricator.wikimedia.org/T114199) [11:57:19] (03CR) 10TerraCodes: [C: 031] Disable upload for non-admins on kowikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417189 (https://phabricator.wikimedia.org/T189021) (owner: 10Revi) [11:57:45] (03PS1) 10Giuseppe Lavagetto: icinga::monitor: fix mw_etcd_config [puppet] - 10https://gerrit.wikimedia.org/r/417241 [11:58:24] (03CR) 10MarcoAurelio: [C: 031] Disable upload for non-admins on kowikiversity (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417189 (https://phabricator.wikimedia.org/T189021) (owner: 10Revi) [11:58:43] <_joe_> volans: ^^ [11:59:26] (03PS2) 10Elukey: role::eventlogging:analytics: include zmq config only on eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/417240 (https://phabricator.wikimedia.org/T114199) [11:59:40] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/417241 (owner: 10Giuseppe Lavagetto) [11:59:44] (03CR) 10MarcoAurelio: Initial configuration for romdwikimedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/412902 (https://phabricator.wikimedia.org/T187184) (owner: 10Urbanecm) [11:59:48] _joe_: thanks, seems reasonable to me, let's try it [12:01:33] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10359/eventlog1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/417240 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [12:01:35] (03PS11) 10Jcrespo: mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) [12:01:45] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1104 fully [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417225 (owner: 10Jcrespo) [12:02:07] (03CR) 10Giuseppe Lavagetto: [C: 032] icinga::monitor: fix mw_etcd_config [puppet] - 10https://gerrit.wikimedia.org/r/417241 (owner: 10Giuseppe Lavagetto) [12:02:14] (03PS2) 10Giuseppe Lavagetto: icinga::monitor: fix mw_etcd_config [puppet] - 10https://gerrit.wikimedia.org/r/417241 [12:03:01] (03Merged) 10jenkins-bot: mariadb: Depool db1104 fully [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417225 (owner: 10Jcrespo) [12:03:15] (03CR) 10jenkins-bot: mariadb: Depool db1104 fully [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417225 (owner: 10Jcrespo) [12:06:20] <_joe_> volans: it works, it seems [12:06:22] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 fully (duration: 01m 16s) [12:06:24] (03PS3) 10Phedenskog: Icinga: Add WebPageReplay Grafana performance alerts [puppet] - 10https://gerrit.wikimedia.org/r/417221 (https://phabricator.wikimedia.org/T188988) [12:06:34] _joe_: yeah the files are updated [12:06:34] (03PS12) 10Jcrespo: mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) [12:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:03] (03PS1) 10Elukey: Apply role::eventlogging::analytics to eventlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/417242 (https://phabricator.wikimedia.org/T114199) [12:07:12] I don't see the unit in systemctl's output and seems correct to me [12:07:19] <_joe_> yes [12:07:29] (03CR) 10Jcrespo: [C: 032] mariadb-backups: Allow backup consolidation and recovery [puppet] - 10https://gerrit.wikimedia.org/r/416353 (https://phabricator.wikimedia.org/T184696) (owner: 10Jcrespo) [12:07:33] <_joe_> it's type=oneshot, so it's not even "exited" [12:07:48] <_joe_> nice [12:07:59] I'll run puppet on einsteinium then [12:08:08] <_joe_> please do, it should do the same [12:08:32] <_joe_> in the meantime, please merge the other patch too, while I work on generalizing a systemd::timer class [12:08:57] sure will do [12:09:05] thanks for taking care of this part [12:10:05] elukey: FYI running puppet on einsteinium a lot of analytics checks got enabled [12:10:44] <_joe_> brb [12:10:53] volans: gooooood [12:11:05] we are deploying new hadoop nodes [12:12:23] (03PS12) 10Volans: Icinga: add EtcdConfig sync check on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) [12:17:40] (03PS2) 10Elukey: Apply role::eventlogging::analytics to eventlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/417242 (https://phabricator.wikimedia.org/T114199) [12:20:58] akosiaris: I might need to change the systemd state unit script, ExecMainStartTimestamp is not returned on a timer [12:21:38] :( [12:23:09] *on a oneshot unit managed by a timer to be precise [12:23:30] (03PS3) 10Gilles: Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) [12:24:00] (03CR) 10jerkins-bot: [V: 04-1] Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [12:25:26] and I have a bigger problem... I don't see any time that reports the last execution time [12:28:48] (03PS4) 10Gilles: Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) [12:29:20] (03CR) 10jerkins-bot: [V: 04-1] Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [12:30:08] (03PS5) 10Gilles: Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) [12:30:09] akosiaris: and magically now it works, don't ask me why [12:30:18] lol [12:30:37] I'll keep an eye on it, don't trust it at all [12:30:49] not your script, the systemd+timer+puppet [12:31:14] (03CR) 10jerkins-bot: [V: 04-1] Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [12:31:56] (03CR) 10Volans: [C: 032] Icinga: add EtcdConfig sync check on MW hosts [puppet] - 10https://gerrit.wikimedia.org/r/413356 (https://phabricator.wikimedia.org/T182597) (owner: 10Volans) [12:33:12] (03PS6) 10Gilles: Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) [12:33:19] let's see if I'm able to make some spam-noise or not :D [12:34:02] (03PS7) 10Gilles: Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) [12:38:22] (03CR) 10Gilles: "https://puppet-compiler.wmflabs.org/compiler02/10360/thumbor1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [12:38:55] 10Operations, 10Analytics, 10DBA, 10EventBus, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4034945 (10jcrespo) [12:39:13] 10Operations, 10Analytics, 10DBA, 10EventBus, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4034956 (10jcrespo) p:05Triage>03High [12:47:42] !log oblivian@puppetmaster1001 conftool action : edit; selector: scope=codfw,name=ReadOnly [12:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:06] !log oblivian@puppetmaster1001 conftool action : edit; selector: scope=codfw,name=ReadOnly [12:50:06] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3922494 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['eventlog1002.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage... [12:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:30] (03PS2) 10Muehlenhoff: Temporarily remove mwdebug2001 from debug proxy aliases [puppet] - 10https://gerrit.wikimedia.org/r/417231 [13:00:01] (03CR) 10Vgutierrez: [V: 032 C: 032] Ensure per-service-MED is an integer [debs/pybal] - 10https://gerrit.wikimedia.org/r/417228 (https://phabricator.wikimedia.org/T165764) (owner: 10Vgutierrez) [13:00:49] (03Merged) 10jenkins-bot: Ensure per-service-MED is an integer [debs/pybal] - 10https://gerrit.wikimedia.org/r/417228 (https://phabricator.wikimedia.org/T165764) (owner: 10Vgutierrez) [13:03:14] (03PS1) 10Vgutierrez: Ensure per-service-MED is an integer [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/417249 (https://phabricator.wikimedia.org/T165764) [13:11:10] (03CR) 10Vgutierrez: [C: 032] Ensure per-service-MED is an integer [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/417249 (https://phabricator.wikimedia.org/T165764) (owner: 10Vgutierrez) [13:11:17] (03CR) 1020after4: [C: 031] scap::target: Install git-lfs [puppet] - 10https://gerrit.wikimedia.org/r/417226 (https://phabricator.wikimedia.org/T180628) (owner: 10Alexandros Kosiaris) [13:11:39] (03Merged) 10jenkins-bot: Ensure per-service-MED is an integer [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/417249 (https://phabricator.wikimedia.org/T165764) (owner: 10Vgutierrez) [13:17:28] (03PS2) 10Gehel: wdqs: cleanup reference to wdqs class as default parameter of wdqs::gui [puppet] - 10https://gerrit.wikimedia.org/r/416941 [13:18:18] (03CR) 10Gehel: [C: 032] wdqs: cleanup reference to wdqs class as default parameter of wdqs::gui [puppet] - 10https://gerrit.wikimedia.org/r/416941 (owner: 10Gehel) [13:20:16] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#4035006 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['eventlog1002.eqiad.wmnet'] ``` and were **ALL** successful. [13:21:53] (03PS1) 10Vgutierrez: Release 1.15.2: Ensure per-service-MED is an integer [debs/pybal] - 10https://gerrit.wikimedia.org/r/417252 [13:22:14] (03PS6) 10ArielGlenn: Split out retrieving globals and use a more machine-readable format [dumps] - 10https://gerrit.wikimedia.org/r/348002 (https://phabricator.wikimedia.org/T185116) (owner: 10Awight) [13:22:50] (03PS10) 10Gehel: wdqs: configure the new internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) [13:23:01] (03PS11) 10Gehel: wdqs: configure the new internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) [13:23:09] (03CR) 10ArielGlenn: [C: 032] Split out retrieving globals and use a more machine-readable format [dumps] - 10https://gerrit.wikimedia.org/r/348002 (https://phabricator.wikimedia.org/T185116) (owner: 10Awight) [13:24:00] (03PS3) 10Elukey: Apply role::eventlogging::analytics to eventlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/417242 (https://phabricator.wikimedia.org/T114199) [13:25:31] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035008 (10Marostegui) [13:26:18] (03PS5) 10Giuseppe Lavagetto: Enable use of EtcdConfig everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416483 (https://phabricator.wikimedia.org/T182597) [13:26:20] (03CR) 10Giuseppe Lavagetto: Enable use of EtcdConfig everywhere. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416483 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [13:26:27] (03CR) 10Gehel: [C: 032] wdqs: configure the new internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [13:28:03] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416483 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [13:28:06] (03CR) 10Elukey: [C: 032] Apply role::eventlogging::analytics to eventlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/417242 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [13:28:14] (03PS4) 10Elukey: Apply role::eventlogging::analytics to eventlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/417242 (https://phabricator.wikimedia.org/T114199) [13:29:40] !log cp-ulsfo: reboot for retpoline kernel updates T188092 [13:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:11] (03PS1) 10Marostegui: db1009.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/417254 [13:32:26] (03PS1) 10ArielGlenn: minor cleanup for utils.py [dumps] - 10https://gerrit.wikimedia.org/r/417255 [13:33:16] (03CR) 10ArielGlenn: [C: 032] minor cleanup for utils.py [dumps] - 10https://gerrit.wikimedia.org/r/417255 (owner: 10ArielGlenn) [13:34:04] (03PS2) 10Marostegui: db1009.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/417254 (https://phabricator.wikimedia.org/T189005) [13:35:13] !log ariel@tin Started deploy [dumps/dumps@f26c114]: fix prefetch stubs; retrieval globals more robustly [13:35:16] !log ariel@tin Finished deploy [dumps/dumps@f26c114]: fix prefetch stubs; retrieval globals more robustly (duration: 00m 03s) [13:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:29] (03CR) 10Marostegui: [C: 032] db1009.yaml: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/417254 (https://phabricator.wikimedia.org/T189005) (owner: 10Marostegui) [13:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:18] (03CR) 10Muehlenhoff: [C: 032] Temporarily remove mwdebug2001 from debug proxy aliases [puppet] - 10https://gerrit.wikimedia.org/r/417231 (owner: 10Muehlenhoff) [13:36:25] (03PS3) 10Muehlenhoff: Temporarily remove mwdebug2001 from debug proxy aliases [puppet] - 10https://gerrit.wikimedia.org/r/417231 [13:36:56] (03CR) 10Muehlenhoff: [V: 032 C: 032] Temporarily remove mwdebug2001 from debug proxy aliases [puppet] - 10https://gerrit.wikimedia.org/r/417231 (owner: 10Muehlenhoff) [13:40:15] !log eventlogging analytics migrated from eventlog1001 to eventlog1002 [13:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:01] jouncebot: next [13:41:03] In 0 hour(s) and 18 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180308T1400) [13:43:59] !log sbisson@tin Started deploy [kartotherian/deploy@42b3280]: Deploying kartotherian with updated dependencies and zoom lovel 19 to test servers [13:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:36] !log depooling mwdebug2001, the host will temporarily be using an HHVM build linked against libicu57 to perform some tests [13:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:46] Krinkle: as FYI about eventlogging migrated to eventlog1002: I left the zmq forwarder active on eventlog1001 [13:46:52] will remove it once you guys will be ready [13:50:02] (03PS2) 10Gehel: wdqs: use the raid10-gpt-srv-lvm-ext4 partman config for new wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/417202 (https://phabricator.wikimedia.org/T187766) [13:50:06] (03PS1) 10MarcoAurelio: Create the 'rollbacker' group at ar.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417257 (https://phabricator.wikimedia.org/T189206) [13:51:47] (03CR) 10Gehel: [C: 031] "LGTM, but pending validation in weekly SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/417230 (https://phabricator.wikimedia.org/T189153) (owner: 10Muehlenhoff) [13:52:30] (03PS2) 10Gehel: Add consumer ID to Updater launch string [puppet] - 10https://gerrit.wikimedia.org/r/417115 (https://phabricator.wikimedia.org/T188716) (owner: 10Smalyshev) [13:52:31] !log sbisson@tin Finished deploy [kartotherian/deploy@42b3280]: Deploying kartotherian with updated dependencies and zoom lovel 19 to test servers (duration: 08m 31s) [13:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:17] (03CR) 10Elukey: "Hi everybody, as FYI with https://gerrit.wikimedia.org/r/#/c/417242/ I moved all the eventlogging daemons except zmq-forwarder to eventlog" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [13:54:02] (03CR) 10Gehel: [C: 032] Add consumer ID to Updater launch string [puppet] - 10https://gerrit.wikimedia.org/r/417115 (https://phabricator.wikimedia.org/T188716) (owner: 10Smalyshev) [13:56:04] !log restart wdqs-updater on wdqs1005 to validate new config option - T188716 [13:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:20] T188716: wdqs-updater kafka poller should use an explicit consumer group - https://phabricator.wikimedia.org/T188716 [13:56:20] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#4035038 (10elukey) [13:56:29] 10Operations, 10Analytics, 10hardware-requests: EQIAD: (1) hardware request for eventlog1001 replacement - eventlog1002. - https://phabricator.wikimedia.org/T184551#4035040 (10elukey) [13:56:36] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3922494 (10elukey) 05Open>03Resolved [13:58:51] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:59:39] ^ that's me, seems that icinga-downtime failed... [14:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180308T1400). [14:00:05] Jhs, James_F, and Hauskatze: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:15] Present. [14:00:18] (03PS1) 10Gehel: Revert "Add consumer ID to Updater launch string" [puppet] - 10https://gerrit.wikimedia.org/r/417259 (https://phabricator.wikimedia.org/T188716) [14:00:50] I can SWAT today [14:00:53] Heya. [14:01:09] (03CR) 10Gehel: [C: 032] Revert "Add consumer ID to Updater launch string" [puppet] - 10https://gerrit.wikimedia.org/r/417259 (https://phabricator.wikimedia.org/T188716) (owner: 10Gehel) [14:01:14] James_F: want to deploy your changes, or should i? [14:01:31] 10Operations, 10Wikimedia-Site-requests, 10media-storage: outdated DjVu file page thumbnail in cache - https://phabricator.wikimedia.org/T186153#4035061 (10Ankry) As resolving this problem takes over a month and the incorect thumbnail disorganizes volunteers work in Polish Wikisource concerning the "Old Sure... [14:01:33] Please do. [14:03:00] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational [14:04:21] Jhs: around for SWAT? [14:04:47] Jhs: is there a task associated with the scripts? [14:05:14] Hauskatze: I'll deploy your change first, since there is only one, then James_F [14:05:38] CI seems to be busy, this will be fun [14:05:54] zeljkof: Hvala vam puno [14:06:00] Hauskatze: :D [14:06:09] Hauskatze: nema na čemu [14:06:39] Hauskatze: for future reference, there is no need for both `Fixes T189206` and `Bug: T189206` [14:06:40] T189206: Creation of Rollbacker group on ar.wikinews - https://phabricator.wikimedia.org/T189206 [14:06:57] `Bug: T189206` is the convention [14:07:00] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035074 (10Marostegui) [14:07:20] zeljkof: I know, just lazy. if we add 'fixes T...' phabricator autocloses the task associated :) [14:07:35] Hauskatze: oh, did not even know that [14:08:18] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417257 (https://phabricator.wikimedia.org/T189206) (owner: 10MarcoAurelio) [14:09:33] (03Merged) 10jenkins-bot: Create the 'rollbacker' group at ar.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417257 (https://phabricator.wikimedia.org/T189206) (owner: 10MarcoAurelio) [14:10:05] (03CR) 10jenkins-bot: Create the 'rollbacker' group at ar.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417257 (https://phabricator.wikimedia.org/T189206) (owner: 10MarcoAurelio) [14:10:11] (03CR) 10Vgutierrez: [C: 032] "Package properly generated and tested on pybal-test2001" [debs/pybal] - 10https://gerrit.wikimedia.org/r/417252 (owner: 10Vgutierrez) [14:10:42] (03PS3) 10Zfilipin: 2017 wikitext editor: Enable by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413653 (https://phabricator.wikimedia.org/T188028) (owner: 10Jforrester) [14:10:48] (03Merged) 10jenkins-bot: Release 1.15.2: Ensure per-service-MED is an integer [debs/pybal] - 10https://gerrit.wikimedia.org/r/417252 (owner: 10Vgutierrez) [14:11:24] zeljkof: https://phabricator.wikimedia.org/T189206#4035081 :) [14:11:33] (03CR) 10Zfilipin: [C: 031] 2017 wikitext editor: Enable by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413653 (https://phabricator.wikimedia.org/T188028) (owner: 10Jforrester) [14:12:17] did the train go yet today ? [14:12:26] Hauskatze: the commit is at mwdebug1002, please test and let me know if I can deploy [14:12:43] thedj: this is all I know https://tools.wmflabs.org/versions/ [14:12:45] testing [14:13:05] thedj: group 0 and 1 at wmf.24, group 2 at wmf.23 [14:13:32] zeljkof: looks good to me, you can deploy [14:13:50] Hauskatze: deploying [14:14:00] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [14:14:29] large spike [14:15:16] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:417257|Create the rollbacker group at ar.wikinews (T189206)]] (duration: 01m 16s) [14:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:30] T189206: Creation of Rollbacker group on ar.wikinews - https://phabricator.wikimedia.org/T189206 [14:15:36] Hauskatze: deployed, please check and thanks for deploying with #releng! ;) [14:15:41] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:15:49] ktnx [14:15:49] <_joe_> on misc [14:15:53] James_F: reviewing your patches, please stand by for testing [14:15:54] graphite-labs ? [14:15:57] yes, 502s [14:16:01] <_joe_> yes [14:16:11] Sure. [14:16:19] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413653 (https://phabricator.wikimedia.org/T188028) (owner: 10Jforrester) [14:16:46] (03PS3) 10Gehel: wdqs: use the raid10-gpt-srv-lvm-ext4 partman config for new wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/417202 (https://phabricator.wikimedia.org/T187766) [14:17:34] (03CR) 10Gehel: [C: 032] wdqs: use the raid10-gpt-srv-lvm-ext4 partman config for new wdqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/417202 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [14:17:39] James_F: I'll merge the config change first, then VE changes. [14:17:52] Sure [14:17:53] . [14:18:11] (03Merged) 10jenkins-bot: 2017 wikitext editor: Enable by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413653 (https://phabricator.wikimedia.org/T188028) (owner: 10Jforrester) [14:19:00] (03PS1) 10Vgutierrez: Release 1.15.2: Ensure per-service-MED is an integer [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/417264 [14:20:34] (03CR) 10Vgutierrez: [C: 032] Release 1.15.2: Ensure per-service-MED is an integer [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/417264 (owner: 10Vgutierrez) [14:20:58] James_F: config change (413653) is at mwdebug1002, please test and let me know if I can deploy it [14:21:00] (03Merged) 10jenkins-bot: Release 1.15.2: Ensure per-service-MED is an integer [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/417264 (owner: 10Vgutierrez) [14:21:12] (03PS1) 10Andrew Bogott: wikitech: on labweb, make mediawiki aware that it's behind varnishes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417265 (https://phabricator.wikimedia.org/T168470) [14:21:44] (03PS1) 10BBlack: eqsin: set the real IPs in geo-dns config [dns] - 10https://gerrit.wikimedia.org/r/417266 (https://phabricator.wikimedia.org/T156027) [14:21:50] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:21:54] (03PS2) 10Andrew Bogott: wikitech: on labweb, make mediawiki aware that it's behind varnishes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417265 (https://phabricator.wikimedia.org/T189168) [14:22:04] (03PS1) 10BBlack: eqsin: configure public endpoints monitoring [puppet] - 10https://gerrit.wikimedia.org/r/417267 (https://phabricator.wikimedia.org/T156027) [14:23:01] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [14:23:01] (03CR) 10jerkins-bot: [V: 04-1] wikitech: on labweb, make mediawiki aware that it's behind varnishes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417265 (https://phabricator.wikimedia.org/T189168) (owner: 10Andrew Bogott) [14:23:03] andrewbogott: are we still good for 15:30UTC? (so in an hour) [14:23:52] (03PS3) 10Andrew Bogott: wikitech: on labweb, make mediawiki aware that it's behind varnishes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417265 (https://phabricator.wikimedia.org/T189168) [14:24:00] (03CR) 10BBlack: [C: 032] eqsin: set the real IPs in geo-dns config [dns] - 10https://gerrit.wikimedia.org/r/417266 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [14:24:27] zeljkof: Hmm. Works ish, but seems a little broken. Let's sync for now and I'll do a follow-up later today. [14:24:39] James_F: ok, deploying [14:25:32] marostegui: yep! [14:26:01] andrewbogott: excellent - thanks! [14:26:22] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:413653|2017 wikitext editor: Enable by default on officewiki (T188028)]] (duration: 01m 16s) [14:26:34] !log uploaded pybal_1.15.2_all.deb to apt.wikimedia.org jessie-wikimedia [14:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:39] T188028: Deploy the 2017 wikitext mode as the default editing environment for all users at office.wiki - https://phabricator.wikimedia.org/T188028 [14:26:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:00] James_F: 413653 deployed, still waiting for CI for VE changes [14:27:08] Thanks. [14:27:54] !log mobrovac@tin Started deploy [cpjobqueue/deploy@4fa1cf0]: Lower the refreshLinks concurrency to 175 - T185052 [14:27:55] James_F: both VE commits are merged, they will be at mwdebug in a few minutes [14:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:08] T185052: Migrate RefreshLinks job to kafka - https://phabricator.wikimedia.org/T185052 [14:28:26] !log mobrovac@tin Finished deploy [cpjobqueue/deploy@4fa1cf0]: Lower the refreshLinks concurrency to 175 - T185052 (duration: 00m 33s) [14:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:06] James_F: both VE commits are at mwdebug1002 [14:31:12] Thanks. [14:31:29] let me know if you want them deployed separately or together [14:31:48] (03CR) 10BBlack: [C: 032] eqsin: configure public endpoints monitoring [puppet] - 10https://gerrit.wikimedia.org/r/417267 (https://phabricator.wikimedia.org/T156027) (owner: 10BBlack) [14:32:45] zeljkof: Looks great. Deploy away. [14:33:09] (03CR) 10Ottomata: [C: 031] ":)" [puppet] - 10https://gerrit.wikimedia.org/r/417038 (https://phabricator.wikimedia.org/T188408) (owner: 10DCausse) [14:34:04] zeljkof: (Both of them together is fine.) [14:35:09] (03CR) 10jenkins-bot: 2017 wikitext editor: Enable by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/413653 (https://phabricator.wikimedia.org/T188028) (owner: 10Jforrester) [14:35:36] James_F: already started separate deployment :) [14:35:42] !log zfilipin@tin Synchronized php-1.31.0-wmf.24/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.ArticleTarget.js: SWAT: [[gerrit:417023|Follow-up I5357a909: Fix logic for autosave from edited state (T189071)]] (duration: 01m 16s) [14:35:44] it should take a minute or two longer [14:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:58] T189071: "Changes recovered" should not be shown if an edit was recovered containing no changes - https://phabricator.wikimedia.org/T189071 [14:36:40] Excellent, thank you zeljkof. [14:37:04] PROBLEM - Host ripe-atlas-eqsin IPv6 is DOWN: CRITICAL - Destination Unreachable (2001:df2:e500:201:103:102:166:20) [14:37:24] !log zfilipin@tin Synchronized php-1.31.0-wmf.24/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.Target.js: SWAT: [[gerrit:417177|Blacklist Web of Trust junk from being added to pages (T189148)]] (duration: 01m 15s) [14:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:38] T189148: "donut-container"
added in every - https://phabricator.wikimedia.org/T189148 [14:37:46] James_F: all commits deployed, please check and thanks for deploying with #releng! ;) [14:37:56] zeljkof: My compliments again. :-) [14:37:57] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4035154 (10Ottomata) +1 elukey! But, isn't https://gerrit.wikimedia.org/r/#/c/395923/ already merged? [14:38:19] !log Stop mysql on es1019 - T187530 [14:38:22] <_joe_> is it my turn then? :) [14:38:29] ACKNOWLEDGEMENT - Host ripe-atlas-eqsin IPv6 is DOWN: CRITICAL - Destination Unreachable (2001:df2:e500:201:103:102:166:20) Brandon Black T179042 [14:38:30] Jhs: your script will not be run if you are not around; also, please leave a note in the calendar why the script should be run, is there a phabricator task? [14:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:34] T187530: es1019 ipmi and mgmt unresponsive - https://phabricator.wikimedia.org/T187530 [14:38:37] !log EU SWAT finished [14:38:49] <_joe_> zeljkof: I had a patch in SWAT... [14:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:56] _joe_: your turn, as far as I am concerned [14:39:05] _joe_: oh, I don't see it in the calendar [14:39:10] <_joe_> zeljkof: ah [14:39:14] * zeljkof is refreshing the page [14:39:15] <_joe_> zeljkof: "Sorry! We could not process your edit due to a loss of session data. " [14:39:20] <_joe_> and I didn't see it [14:39:31] _joe_: ok, makes sense now :D [14:39:40] want to deploy it yourself? or should I? [14:39:53] <_joe_> I can do it [14:40:00] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4035162 (10elukey) >>! In T188294#4035154, @Ottomata wrote: > +1 elukey! But, isn't https://gerrit.wikimedia.org/r/#/c/395923/ alread... [14:40:08] _joe_: swat is all yours :) [14:40:36] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4035163 (10Ottomata) AH hm ok! [14:41:31] 10Operations, 10ops-eqsin, 10Traffic, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4035164 (10BBlack) Ping monitoring for this anchor merged in with: https://gerrit.wikimedia.org/r/#/c/417267/1/modules/netops/manifests/monitoring.pp What we're missing in configuratio... [14:41:52] <_joe_> volans: I changed my mind regarding https://gerrit.wikimedia.org/r/#/c/416483 [14:41:57] <_joe_> sending a new version [14:42:12] ack [14:43:14] 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157#4035174 (10BBlack) Reminder: after hardware level is fixed and the host is installed, we'll need to uncomment its entry in `hieradata/common/cache/upload.yaml` before it will successfully puppetize and join... [14:44:26] (03PS6) 10Giuseppe Lavagetto: Enable use of EtcdConfig everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416483 (https://phabricator.wikimedia.org/T182597) [14:44:33] <_joe_> volans: ^^ [14:44:42] <_joe_> this allows us to sync a single file [14:44:59] <_joe_> and not to have to do two syncs [14:45:15] 10Operations, 10Traffic: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#4035182 (10BBlack) a:03ayounsi We're still missing rancid definitions in puppet's `modules/rancid/files/core/router.db`, ping @ayounsi [14:45:53] fair enough [14:46:09] (03CR) 10Volans: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416483 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [14:47:14] (03CR) 10Giuseppe Lavagetto: [C: 032] Enable use of EtcdConfig everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416483 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [14:48:30] (03Merged) 10jenkins-bot: Enable use of EtcdConfig everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416483 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [14:49:09] 10Operations, 10Traffic: Enable Service in Asia Cache DC - https://phabricator.wikimedia.org/T156026#4035206 (10BBlack) [14:49:12] 10Operations, 10Traffic, 10Patch-For-Review: Configuration for Asia Cache DC hosts - https://phabricator.wikimedia.org/T156027#4035199 (10BBlack) 05Open>03Resolved a:03BBlack With the last merges above, all the known issues that actually belong here are resolved other than 3 cases from the previous lis... [14:50:22] <_joe_> volans: syncing now [14:50:28] * volans run away [14:50:40] how's is looking on mwdebug? [14:51:06] <_joe_> testing right now [14:51:18] (03CR) 10jenkins-bot: Enable use of EtcdConfig everywhere. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416483 (https://phabricator.wikimedia.org/T182597) (owner: 10Giuseppe Lavagetto) [14:52:59] <_joe_> uhm, scap is actually still running [14:54:03] <_joe_> oh scap is running cdb-rebuild there, ofc [14:54:54] btw the current puppettization is restarting the update unit at every puppet run :( [14:54:57] will need some love [14:55:03] <_joe_> volans: whatever [14:55:30] <_joe_> now syncing to the cluster [14:56:30] <_joe_> zeljkof: is it normal for scap to spend 33 seconds in cache_git_info? [14:56:57] !log oblivian@tin Synchronized wmf-config/CommonSettings.php: Use EtcdConfig everywhere (duration: 01m 15s) [14:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:45] _joe_: I think there is a task for that :) [14:58:58] I remember somebody discussing it recently [15:00:08] !log Change topology in m5, db2037 to become a slave of db1073 - T189005 [15:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:25] T189005: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005 [15:00:52] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035219 (10Marostegui) [15:01:15] (03PS3) 10Marostegui: site.pp: Make db1073 master [puppet] - 10https://gerrit.wikimedia.org/r/416650 (https://phabricator.wikimedia.org/T183469) [15:01:21] !log Disable puppet on db1073 - T189005 [15:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:05] (03CR) 10Marostegui: [C: 032] site.pp: Make db1073 master [puppet] - 10https://gerrit.wikimedia.org/r/416650 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [15:03:37] (03PS1) 10Ottomata: [WIP] Apply geocode and deuplicate transform function for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/417287 (https://phabricator.wikimedia.org/T186833) [15:04:11] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Apply geocode and deuplicate transform function for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/417287 (https://phabricator.wikimedia.org/T186833) (owner: 10Ottomata) [15:04:20] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035238 (10Marostegui) [15:05:21] (03PS2) 10Ottomata: [WIP] Apply geocode and deuplicate transform function for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/417287 (https://phabricator.wikimedia.org/T186833) [15:07:13] (03CR) 10BryanDavis: "> Thinking that the host name (actually, an alias ip) and the dns" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417211 (owner: 10Jcrespo) [15:10:58] marostegui: what will be the new IP for m5-master? [15:11:28] andrewbogott: https://gerrit.wikimedia.org/r/#/c/416680/ [15:13:27] (03PS1) 10Andrew Bogott: wikitech: change the IP address for m5-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417290 (https://phabricator.wikimedia.org/T189005) [15:13:32] marostegui: ^ [15:14:02] marostegui: can the new host serve traffic already or only after the switch-over? [15:14:11] (03CR) 10Jcrespo: "> By convention we use the host portion of the FQDN for this otherwise opaque label." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417211 (owner: 10Jcrespo) [15:15:11] andrewbogott: only read traffic, but I would suggest we wait till db1009 is down, to avoid possible issues (just in case) [15:15:31] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035288 (10Andrew) [15:15:37] ok. I added that patch to your checklist [15:16:11] check this: https://gerrit.wikimedia.org/r/#/c/417211/ there is a discussion going on [15:16:30] (which I agree with Jaime) [15:17:19] (03CR) 10Jcrespo: "> not super relevant until we provision MediaWiki hosts for labswiki in codfw." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417211 (owner: 10Jcrespo) [15:17:50] !log silencing nova and other openstack alerts in anticipation of service interruptions for https://phabricator.wikimedia.org/T189005 [15:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:27] (03CR) 10Marostegui: [C: 031] "> > By convention we use the host portion of the FQDN for this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417211 (owner: 10Jcrespo) [15:22:29] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035305 (10Andrew) [15:23:48] (03PS3) 10Ottomata: [WIP] Apply geocode, deduplicate and monitoring for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/417287 (https://phabricator.wikimedia.org/T186833) [15:24:01] marostegui: CI will be off during this switch-over, so you'll want to make sure that https://gerrit.wikimedia.org/r/#/c/416680/ is merge-ready before we start [15:24:20] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Apply geocode, deduplicate and monitoring for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/417287 (https://phabricator.wikimedia.org/T186833) (owner: 10Ottomata) [15:24:34] andrewbogott: ah right :) [15:24:41] (03PS3) 10Marostegui: wmnet: Promote db1073 to become m5 master [dns] - 10https://gerrit.wikimedia.org/r/416680 (https://phabricator.wikimedia.org/T183469) [15:25:00] (03PS3) 10Marostegui: dbproxy1005: Make db1073 master instead of db1009 [puppet] - 10https://gerrit.wikimedia.org/r/416658 (https://phabricator.wikimedia.org/T183469) [15:25:23] (03PS4) 10Ottomata: [WIP] Apply geocode, deduplicate and monitoring for refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/417287 (https://phabricator.wikimedia.org/T186833) [15:26:06] marostegui: let me know when you're ready for me to switch it off. Should be done at roughly the last possible opportunity :) [15:26:08] (03CR) 10Marostegui: [C: 032] dbproxy1005: Make db1073 master instead of db1009 [puppet] - 10https://gerrit.wikimedia.org/r/416658 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [15:26:18] andrewbogott: yeah, in 4 minutes :) [15:26:45] (03PS3) 10Ottomata: Revert "Revert "Point Mediawiki Monolog at Kafka jumbo in production"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417006 (https://phabricator.wikimedia.org/T188136) [15:26:47] andrewbogott: Going to do the "noop" changes now [15:27:31] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035323 (10Marostegui) [15:27:53] (03CR) 10BryanDavis: "My change was to make the opaque label work as `sql` expected as a hostname for a `mysql -h LABEL ...` command when the bare host was not " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417211 (owner: 10Jcrespo) [15:28:12] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4027038 (10Marostegui) [15:29:01] !log disabling puppet on labnodepool1001 [15:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:20] !log merging and then deploying mediawiki-config to point monolog avro kafka producer at new kafka jumbo cluster: https://phabricator.wikimedia.org/T188136 [15:29:25] (03CR) 10Ottomata: [C: 032] Revert "Revert "Point Mediawiki Monolog at Kafka jumbo in production"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417006 (https://phabricator.wikimedia.org/T188136) (owner: 10Ottomata) [15:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:38] andrewbogott: let's go for it? [15:29:42] (03CR) 10jenkins-bot: Revert "Revert "Point Mediawiki Monolog at Kafka jumbo in production"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417006 (https://phabricator.wikimedia.org/T188136) (owner: 10Ottomata) [15:29:48] andrewbogott: let me know when I can set read_only on m5 master [15:30:10] marostegui: have at. It'll break things but I'm as ready as I'm going to be [15:30:23] ok [15:30:26] going for it then [15:30:29] 10Operations, 10Analytics, 10DBA, 10EventBus, and 5 others: High (2-3x) write and connection load on enwiki databases - https://phabricator.wikimedia.org/T189204#4035332 (10mobrovac) After [lowering the concurrency](https://gerrit.wikimedia.org/r/#/c/417270/) the number of new connections slightly dropped,... [15:30:32] (03CR) 10Jcrespo: "m5-master is not the hostname of db1009. Nor it is m5-master.eqiad.wmnet. if mysql trys to connct to it, it should and eventually will fai" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417211 (owner: 10Jcrespo) [15:30:40] !log Set m5 master db1009 read only for the failover - T189005 [15:30:51] done [15:31:06] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035335 (10Andrew) [15:31:37] marostegui: Failed to log message to wiki. Somebody should check the error logs. [15:31:38] T189005: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005 [15:32:21] 10Operations, 10MediaWiki-Configuration, 10MW-1.31-release-notes (WMF-deploy-2018-02-27 (1.31.0-wmf.23)), 10Patch-For-Review, 10discovery-system: Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597#4035336 (10Joe) [15:32:25] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035337 (10Marostegui) [15:32:39] 10Operations, 10MediaWiki-Configuration, 10MW-1.31-release-notes (WMF-deploy-2018-02-27 (1.31.0-wmf.23)), 10Patch-For-Review, 10discovery-system: Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597#3828259 (10Joe) [15:32:42] 10Operations, 10MediaWiki-Configuration, 10monitoring, 10Patch-For-Review: EtcdConfig: add Icinga check - https://phabricator.wikimedia.org/T188922#4035338 (10Joe) 05Open>03Resolved [15:32:49] !log otto@tin Synchronized wmf-config/ProductionServices.php: Point Mediawiki Monolog at new Kafka jumbo-eqiad cluster: T188136 (duration: 01m 16s) [15:33:14] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4027038 (10Marostegui) [15:33:18] 10Operations, 10MediaWiki-Configuration, 10MW-1.31-release-notes (WMF-deploy-2018-02-27 (1.31.0-wmf.23)), 10Patch-For-Review, 10discovery-system: Use EtcdConfig in production to allow automation of a datacenter switch - https://phabricator.wikimedia.org/T182597#3828259 (10Joe) The main part of the goal i... [15:33:41] (03CR) 10Marostegui: [C: 032] wmnet: Promote db1073 to become m5 master [dns] - 10https://gerrit.wikimedia.org/r/416680 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [15:33:59] <_joe_> wikitech being down is part of your migration I guess? [15:34:02] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [15:34:05] _joe_: yep [15:34:14] andrewbogott: changing dns now, new master still on read_only [15:34:20] <_joe_> SO I CAN'T WRITE DOCUMENTATION, TOO BAD :D [15:34:32] PROBLEM - Check systemd state on ruthenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:34:33] _joe_: yes, briefly [15:34:43] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating [15:34:43] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035345 (10Marostegui) [15:34:46] andrewbogott: you want to merge: https://gerrit.wikimedia.org/r/417290 ? [15:34:53] everyone forgot that that setup included a hard-coded ip :( [15:34:54] andrewbogott: ^ nova-api consumers throwing alerts, I'll try to catch icinga before alerts [15:34:58] marostegui: sure, is it ready? [15:35:00] chasemp: thanks [15:35:05] yeah, the dns change has been deployed [15:35:06] (03CR) 10Andrew Bogott: [C: 032] wikitech: change the IP address for m5-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417290 (https://phabricator.wikimedia.org/T189005) (owner: 10Andrew Bogott) [15:35:09] (03PS1) 10Ottomata: Remove no longer needed camus mediawiki-analytics job [puppet] - 10https://gerrit.wikimedia.org/r/417294 (https://phabricator.wikimedia.org/T188136) [15:35:09] I guess it is now spreading [15:35:13] master still on read_only [15:35:32] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [15:35:39] <_joe_> marostegui: do you need to purge the recursors? I think I can help maybe [15:35:43] PROBLEM - Check systemd state on labstore1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:35:51] * andrewbogott hopes that mediawiki-config merging doesn't need nodepool [15:36:13] marostegui: I don't think you should wait for that wikitech patch to move on with other things [15:36:17] it's not coupled to anything else [15:36:26] ok - let's go writabble then [15:36:29] the other server has mysql down [15:36:29] <_joe_> andrewbogott: you can force-merge mediawiki-config if needed [15:36:35] (03Merged) 10jenkins-bot: wikitech: change the IP address for m5-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417290 (https://phabricator.wikimedia.org/T189005) (owner: 10Andrew Bogott) [15:36:38] so if somethign tries to write, it will fail anyways [15:36:46] (03CR) 10Ottomata: [C: 032] Remove no longer needed camus mediawiki-analytics job [puppet] - 10https://gerrit.wikimedia.org/r/417294 (https://phabricator.wikimedia.org/T188136) (owner: 10Ottomata) [15:36:58] _joe_: yep. I honestly don't know if this needs nodepool or if it's happy. Will force merge in a moment [15:37:13] andrewbogott: the master is now writabble [15:37:14] !log downtime nfs-exportd labstore1004/5 as part of nova db maint [15:37:16] ah, nope, it merged fine [15:37:43] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active [15:37:58] marostegui: great, nova seems to be happy again [15:38:02] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1005 is OK: OK - nfs-exportd is active [15:38:10] !log restarting nodepool [15:38:12] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035352 (10Marostegui) [15:38:13] andrewbogott: yeah, I see connections there already [15:38:43] (03CR) 10jenkins-bot: wikitech: change the IP address for m5-master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417290 (https://phabricator.wikimedia.org/T189005) (owner: 10Andrew Bogott) [15:39:00] !log andrew@tin Synchronized wmf-config/db-eqiad.php: new m5 IP (duration: 01m 15s) [15:39:13] wikitech backup [15:39:14] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035354 (10Andrew) [15:39:15] trying to write now [15:39:25] (03CR) 10BryanDavis: "> m5-master is not the hostname of db1009. Nor it is" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417211 (owner: 10Jcrespo) [15:39:35] (03PS1) 10Rush: wip: openstack: nova bootstrapping with mitaka and debian [puppet] - 10https://gerrit.wikimedia.org/r/417296 (https://phabricator.wikimedia.org/T188266) [15:39:36] I can edit it fine [15:40:04] !log Power off es1019 - T187530 [15:40:27] ^ checking if that works [15:40:29] !log andrew@tin Synchronized wmf-config/db-codfw.php: new m5 IP (duration: 01m 15s) [15:41:30] I can edit my own page in wikitech [15:41:33] RECOVERY - Check systemd state on ruthenium is OK: OK - running: The system is fully operational [15:41:52] !log Power off es1019 - T187530 [15:42:07] !log stopping nodepool again because something isn't quite right [15:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:08] T187530: es1019 ipmi and mgmt unresponsive - https://phabricator.wikimedia.org/T187530 [15:42:15] ok, that works [15:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:03] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:45:15] ^ is that because of the m5 failover? [15:45:19] marostegui: I think we lost a grant on the new db service [15:45:20] maybe [15:45:31] a service on labnet1001.wikimedia.org can't connect [15:45:33] andrewbogott: which one? [15:45:35] ok, checking [15:45:54] you've got an user for me? [15:46:03] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:46:30] user should just be 'nova' [15:46:48] and the db is also 'nova' [15:47:16] that user has privileges for "%" [15:47:27] which error is it getting? could it be trying to connect to the old host? [15:47:31] which has mysql off? [15:47:54] it should be connecting to m5-master.eqiad.wmnet [15:48:11] yeah, I mean, did the dns updated on that host? [15:48:17] looks like it did [15:48:20] andrewbogott: this hangs for me labnet1001:~# telnet db1073.eqiad.wmnet 3306 [15:48:21] andrewbogott: did you restart it? I bet the pools cache dns [15:48:23] did it work for you? [15:48:42] chasemp: oh, you're right, I was using the api port [15:48:44] so, firewall! [15:48:50] marostegui: ^ [15:48:51] aha! [15:48:53] maybe [15:49:04] let's see iptables [15:49:19] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#4035364 (10Cmjohnson) [15:49:28] 10Operations, 10ops-eqiad, 10monitoring, 10Patch-For-Review: es1019 ipmi and mgmt unresponsive - https://phabricator.wikimedia.org/T187530#4035361 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Reset the server, drained all power, removed power cables, held in power button for 10 secs. Restored every... [15:50:03] andrewbogott: what's the task for this? [15:50:15] https://phabricator.wikimedia.org/T189005 [15:51:46] The rules are the same on db1009 and db1073 [15:51:49] (iptables rules) [15:51:57] ok, I think I'm wrong about everything and it probably is routing [15:52:07] yeah hang on [15:52:10] sorry chasemp, I confused my tests, you should probably not trust anything I said [15:53:03] It's working now [15:53:06] was that you, chase? [15:53:08] yes [15:53:08] \o/ [15:53:36] marostegui: fyi there is an ACL that allows these hosts to talk to internal things and it had the IP for db1009 fixed exactly [15:53:37] ok, you would've gotten there a lot faster without me :) [15:53:48] I extended that ACL to also include 10.64.16.79 [15:53:50] chasemp: ta [15:54:00] !log restarting nodepool again [15:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:21] marostegui: those aer the only two valid IPs right? [15:54:36] 10.64.16.79 and 10.64.0.13 [15:54:56] chasemp: correct, the slave is db2037, but we are not using a proxy to fail it over, but I would include it just in case [15:55:33] ok, I can do that [15:55:35] 0.13 will be gone in a few days as the old host will be decommissioend, but let's leave it for know [15:55:52] PROBLEM - haproxy failover on dbproxy1005 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [15:56:01] going to check that [15:56:28] !log restarting nova-fullstack on labnet1001 [15:56:41] marostegui: db2037.codfw.wmnet has address 10.192.32.8 should be allowed? (just confirming) [15:56:49] db1073 got a nice spike of connections…that is why the proxy was complaining [15:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:59] chasemp: correct [15:57:06] kk, doing [15:57:24] chasemp: shouldn't be used, but just in case, let's have it there to avoid headaches if we ever have to failover to codfw for some reasons [15:57:39] * chasemp nods [15:57:41] tx marostegui [15:57:57] RECOVERY - Check systemd state on labstore1004 is OK: OK - running: The system is fully operational [15:57:58] RECOVERY - haproxy failover on dbproxy1005 is OK: OK check_failover servers up 2 down 0 [15:58:12] chasemp: I will ping you once we decomm db1009 (10.64.0.13) to clean that one up - will be next week probably [15:58:24] marostegui: great, that'll work [15:59:19] chasemp: I will make a note on the decommissioning task [15:59:55] How are things looking from your side andrewbogott? [16:00:05] marostegui: I'm waiting for one more test to complete [16:00:11] but I haven't found anything else wrong so far [16:00:33] I am seeing lots of sleeping connections from nova [16:00:44] I don't think that's unusual [16:00:56] might be mitigated by https://gerrit.wikimedia.org/r/#/c/415619/ [16:01:04] but I was holding off on that because… only so many changes at a time :) [16:01:09] yeah [16:01:42] marostegui: a full-stack nova run just completed, so I think we're all good [16:01:45] There are 255 connections, it is not changing [16:01:49] at least, everything that I can think to look at [16:02:00] marostegui: ok, let me restart some things in case it's upset from the switch over still [16:02:01] andrewbogott: I have also tested wikitech and that looks good [16:02:07] ok [16:02:14] did that do anything? [16:02:16] yes [16:02:19] decreased them [16:02:19] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035401 (10chasemp) In real time it was realized the ACL from labs-hosts VLAN was blocking access to the new m5 backing DB. > commit comment "T189005 nova... [16:02:27] To 188, which looks better [16:02:34] marostegui: I bet it climbs right back up to 255 in a minute or two [16:02:43] andrewbogott: I haven't seen any puppet recoveries yet [16:02:45] yeah, it is going up [16:03:27] 10Operations, 10ops-eqiad, 10Analytics-Kanban: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#4035402 (10Cmjohnson) @elukey Did a check again today, the errors have not come back. Do you want to put back in production? [16:03:35] let's forget it about it now, it might be normal and mitigated by the patch you sent above [16:03:53] chasemp: tools-worker-1028 (which we should pool, btw) just got a clean puppet run [16:03:58] ok [16:04:13] I'll kick off a toolforge run then via clush [16:06:27] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1004 is OK: OK - maintain-dbusers is active [16:07:11] marostegui: Everything on my end is fine, all that I know of that's left is chasemp finalizing the router change (if that's not done already). [16:07:22] it is [16:07:25] But probably would be good if you hang around for another hour or two :) [16:07:30] andrewbogott: Same here - I will create the follow up task, which is decommissioning db1009 [16:07:42] andrewbogott: yep, I will be here [16:07:42] I made a note on that task w/ the commits but I didn't post the ACL as I wasn't sure on a public task if that was wise [16:07:57] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035404 (10Andrew) [16:08:02] andrewbogott: puppet run via clush in tools is ongoing [16:08:06] no surprises yet [16:08:07] great [16:08:16] cool [16:10:29] (03PS3) 10Andrew Bogott: openstack: Promote DnsManager to mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/417170 (owner: 10BryanDavis) [16:11:09] (03CR) 10Andrew Bogott: [C: 032] openstack: Promote DnsManager to mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/417170 (owner: 10BryanDavis) [16:14:27] (03PS1) 10Marostegui: site.pp: db1009 is not a master anymore [puppet] - 10https://gerrit.wikimedia.org/r/417300 (https://phabricator.wikimedia.org/T189216) [16:14:29] !log wdqs1004 down for systemboard replacement [16:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:14] (03CR) 10Jcrespo: "Let's abandon this- there is cleary an issue here, but it is not with the commit, the commit is part of a larger issue with the overal mod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417211 (owner: 10Jcrespo) [16:15:18] (03Abandoned) 10Jcrespo: Revert "wikitech: use FQDNs for m5 cluster members" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417211 (owner: 10Jcrespo) [16:17:01] (03CR) 10Marostegui: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10363/" [puppet] - 10https://gerrit.wikimedia.org/r/417300 (https://phabricator.wikimedia.org/T189216) (owner: 10Marostegui) [16:18:30] (03PS2) 10Muehlenhoff: Add a thirdparty/php72 component for use by Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/415856 [16:19:08] (03PS2) 10Muehlenhoff: Add repository configuration for thirdparty/php71 [puppet] - 10https://gerrit.wikimedia.org/r/415857 [16:20:17] PROBLEM - Host wdqs1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:21:30] ^ downtime expired, issues on wdqs1004 are known... [16:22:09] Oh, this is the management interface... that's new... [16:22:27] (03CR) 10Muehlenhoff: "This is initially limited for use by Phabricator, we can revisit a wider adoption of 7.2 separately at a later point." [puppet] - 10https://gerrit.wikimedia.org/r/415856 (owner: 10Muehlenhoff) [16:22:28] ACKNOWLEDGEMENT - Host wdqs1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% Chris Johnson This is down for system board replacement [16:23:28] cmjohnson: Oh, you're working on wdqs1004! Great! [16:24:58] (03PS1) 10BryanDavis: openstack: Update wikireplica_dns authn config [puppet] - 10https://gerrit.wikimedia.org/r/417302 [16:25:18] (03CR) 10Andrew Bogott: [C: 031] "This is better!" [puppet] - 10https://gerrit.wikimedia.org/r/416991 (owner: 10BryanDavis) [16:25:19] (03CR) 10Elukey: [C: 031] Add a thirdparty/php72 component for use by Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/415856 (owner: 10Muehlenhoff) [16:25:29] 10Operations, 10ops-codfw, 10hardware-requests, 10User-Elukey: decommission mw2097-mw2134 - https://phabricator.wikimedia.org/T189111#4035507 (10RobH) [16:25:46] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417303 [16:26:02] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417303 [16:27:00] (03PS3) 10Muehlenhoff: Add repository configuration for thirdparty/php72 [puppet] - 10https://gerrit.wikimedia.org/r/415857 [16:27:31] (03PS2) 10Andrew Bogott: openstack: Update wikireplica_dns authn config [puppet] - 10https://gerrit.wikimedia.org/r/417302 (owner: 10BryanDavis) [16:28:11] (03CR) 10Andrew Bogott: [C: 032] openstack: Update wikireplica_dns authn config [puppet] - 10https://gerrit.wikimedia.org/r/417302 (owner: 10BryanDavis) [16:30:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Memory test failure on elastic1021 - https://phabricator.wikimedia.org/T188595#4035526 (10Cmjohnson) The error did follow the DIMM, The server is out of warranty but I will see if I can snag a similar DIMM from a decommissioned server. [16:32:35] !log Running wikireplica_dns from labcontrol1001 [16:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:39] (03PS3) 10Andrew Bogott: openstack: Refactor dns-floating-ip-updater.py script [puppet] - 10https://gerrit.wikimedia.org/r/416991 (owner: 10BryanDavis) [16:34:17] PROBLEM - Host elastic1021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:34:54] (03CR) 10Andrew Bogott: [C: 032] openstack: Refactor dns-floating-ip-updater.py script [puppet] - 10https://gerrit.wikimedia.org/r/416991 (owner: 10BryanDavis) [16:35:07] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services (watching): Choose a server for the chromium-render service - https://phabricator.wikimedia.org/T187821#4035548 (10Jdlrobson) 05Open>03stalled [16:35:10] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 4 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4035549 (10Jdlrobson) [16:36:42] 10Operations, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Memory test failure on elastic1021 - https://phabricator.wikimedia.org/T188595#4035554 (10Cmjohnson) I do not have anything that size as a spare. @robh and @faidon will need to comment on buying a new DIMM. Current DIMM is Samsung 16G... [16:39:35] (03CR) 10Paladox: [C: 031] Gerrit: Make project deletion less destructive [puppet] - 10https://gerrit.wikimedia.org/r/417183 (owner: 10Chad) [16:41:57] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417306 [16:42:09] (03Abandoned) 10Marostegui: Revert "db-eqiad.php: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417303 (owner: 10Marostegui) [16:43:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417306 (owner: 10Marostegui) [16:44:33] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417306 (owner: 10Marostegui) [16:44:47] (03PS12) 10Ottomata: Refactor cache::kafka::eventlogging into profile [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [16:45:29] (03CR) 10jerkins-bot: [V: 04-1] Refactor cache::kafka::eventlogging into profile [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [16:46:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool es1019 with less weight after HW maintenance (duration: 01m 15s) [16:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:40] (03PS13) 10Ottomata: Refactor cache::kafka::eventlogging into profile [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [16:47:56] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417306 (owner: 10Marostegui) [16:49:22] (03PS2) 10Rush: openstack: nova bootstrapping with mitaka and debian [puppet] - 10https://gerrit.wikimedia.org/r/417296 (https://phabricator.wikimedia.org/T188266) [16:49:25] (03CR) 10Herron: [C: 032] add puppetdb role to puppetdb[12]001 servers [puppet] - 10https://gerrit.wikimedia.org/r/409995 (https://phabricator.wikimedia.org/T185499) (owner: 10Herron) [16:49:32] (03PS6) 10Herron: add puppetdb role to puppetdb[12]001 servers [puppet] - 10https://gerrit.wikimedia.org/r/409995 (https://phabricator.wikimedia.org/T185499) [16:49:45] (03PS3) 10Rush: openstack: nova bootstrapping with mitaka and debian [puppet] - 10https://gerrit.wikimedia.org/r/417296 (https://phabricator.wikimedia.org/T188266) [16:50:03] (03CR) 10Anomie: [C: 031] "The rest look good to me." [puppet] - 10https://gerrit.wikimedia.org/r/417024 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [16:50:26] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova bootstrapping with mitaka and debian [puppet] - 10https://gerrit.wikimedia.org/r/417296 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [16:50:52] (03PS3) 10Bstorm: wiki replicas: Small fix in maintain-views changes for comment table [puppet] - 10https://gerrit.wikimedia.org/r/417024 (https://phabricator.wikimedia.org/T181650) [16:53:10] (03PS4) 10Rush: openstack: nova bootstrapping with mitaka and debian [puppet] - 10https://gerrit.wikimedia.org/r/417296 (https://phabricator.wikimedia.org/T188266) [16:56:43] (03CR) 10Bstorm: [C: 032] wiki replicas: Small fix in maintain-views changes for comment table [puppet] - 10https://gerrit.wikimedia.org/r/417024 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [16:57:14] (03PS1) 10Ottomata: Remove force_protocol_version for cache webrequest varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/417308 (https://phabricator.wikimedia.org/T185136) [16:57:27] RECOVERY - Host wdqs1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 3.73 ms [16:57:46] (03CR) 10jerkins-bot: [V: 04-1] Remove force_protocol_version for cache webrequest varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/417308 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata) [16:57:48] (03CR) 10Ottomata: "Tested in beta, works fine." [puppet] - 10https://gerrit.wikimedia.org/r/417308 (https://phabricator.wikimedia.org/T185136) (owner: 10Ottomata) [16:58:19] (03PS2) 10Ottomata: Remove force_protocol_version for cache webrequest varnishkafka [puppet] - 10https://gerrit.wikimedia.org/r/417308 (https://phabricator.wikimedia.org/T185136) [16:58:31] (03PS1) 10Andrew Bogott: mediawiki scap: add labweb1001 and 1002 targets [puppet] - 10https://gerrit.wikimedia.org/r/417309 (https://phabricator.wikimedia.org/T168470) [17:00:05] godog, moritzm, and _joe_: How many deployers does it take to do Puppet SWAT(Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180308T1700). [17:00:05] Krinkle: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:15] Im here, ready to roll :) [17:00:22] (03PS14) 10Ottomata: Refactor cache::kafka::eventlogging into profile [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [17:01:35] (03PS5) 10Rush: openstack: nova bootstrapping with mitaka and debian [puppet] - 10https://gerrit.wikimedia.org/r/417296 (https://phabricator.wikimedia.org/T188266) [17:01:37] (03PS4) 10Andrew Bogott: wikitech: on labweb, make mediawiki aware that it's behind varnishes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417265 (https://phabricator.wikimedia.org/T189168) [17:01:53] (03PS6) 10Rush: openstack: nova bootstrapping with mitaka and debian [puppet] - 10https://gerrit.wikimedia.org/r/417296 (https://phabricator.wikimedia.org/T188266) [17:03:48] andrewbogott: there db1009:/srv/andrewwork/labswikidump.sql don't know if you want to keep it or not, but next week if all goes fine, this host will be sent to DC Ops for final decommissioning [17:04:19] marostegui: no need to save, I'll delete it now [17:04:26] ok! thanks [17:04:32] 10Operations, 10DBA, 10cloud-services-team, 10Patch-For-Review: Failover m5 master from db1009 to db1073 - https://phabricator.wikimedia.org/T189005#4035640 (10Marostegui) [17:04:35] (03CR) 10Herron: [C: 032] puppetdb: add puppetdb4 apt component when puppetdb_major_version == 4 [puppet] - 10https://gerrit.wikimedia.org/r/417054 (https://phabricator.wikimedia.org/T185502) (owner: 10Herron) [17:04:49] (03PS3) 10Herron: puppetdb: add puppetdb4 apt component when puppetdb_major_version == 4 [puppet] - 10https://gerrit.wikimedia.org/r/417054 (https://phabricator.wikimedia.org/T185502) [17:04:55] (03PS2) 10Andrew Bogott: mediawiki scap: add labweb1001 and 1002 targets [puppet] - 10https://gerrit.wikimedia.org/r/417309 (https://phabricator.wikimedia.org/T168470) [17:05:24] !log stop and reboot db1114 for kernel regression [17:05:33] ... testing [17:05:39] not to create one :-) [17:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:18] haha [17:06:18] (03CR) 10Andrew Bogott: [C: 032] wikitech: on labweb, make mediawiki aware that it's behind varnishes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417265 (https://phabricator.wikimedia.org/T189168) (owner: 10Andrew Bogott) [17:06:45] (03CR) 10jerkins-bot: [V: 04-1] wikitech: on labweb, make mediawiki aware that it's behind varnishes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417265 (https://phabricator.wikimedia.org/T189168) (owner: 10Andrew Bogott) [17:08:34] _joe_: godog: ping :) [17:10:50] (03PS5) 10Andrew Bogott: wikitech: on labweb, make mediawiki aware that it's behind varnishes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417265 (https://phabricator.wikimedia.org/T189168) [17:10:52] (03CR) 10Rush: [C: 032] openstack: nova bootstrapping with mitaka and debian [puppet] - 10https://gerrit.wikimedia.org/r/417296 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [17:10:57] (03PS7) 10Rush: openstack: nova bootstrapping with mitaka and debian [puppet] - 10https://gerrit.wikimedia.org/r/417296 (https://phabricator.wikimedia.org/T188266) [17:11:24] (03CR) 10Andrew Bogott: wikitech: on labweb, make mediawiki aware that it's behind varnishes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417265 (https://phabricator.wikimedia.org/T189168) (owner: 10Andrew Bogott) [17:12:57] (03PS6) 10Andrew Bogott: wikitech: on labweb, make mediawiki aware that it's behind varnishes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417265 (https://phabricator.wikimedia.org/T189168) [17:13:20] (03PS3) 10Andrew Bogott: mediawiki scap: add labweb1001 and 1002 targets [puppet] - 10https://gerrit.wikimedia.org/r/417309 (https://phabricator.wikimedia.org/T168470) [17:13:58] (03PS15) 10Ottomata: Refactor cache::kafka::eventlogging into profile [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [17:14:42] (03CR) 10Andrew Bogott: [C: 032] wikitech: on labweb, make mediawiki aware that it's behind varnishes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417265 (https://phabricator.wikimedia.org/T189168) (owner: 10Andrew Bogott) [17:14:48] (03CR) 10Andrew Bogott: [C: 032] mediawiki scap: add labweb1001 and 1002 targets [puppet] - 10https://gerrit.wikimedia.org/r/417309 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [17:15:13] <_joe_> Krinkle: sorry, meeting running late [17:15:17] <_joe_> :/ [17:15:27] <_joe_> still here? let's go with puppet swat [17:16:04] (03Merged) 10jenkins-bot: wikitech: on labweb, make mediawiki aware that it's behind varnishes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417265 (https://phabricator.wikimedia.org/T189168) (owner: 10Andrew Bogott) [17:16:37] PROBLEM - Check status of defined EventLogging jobs on eventlog1002 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m4-master-00 consumer/mysql-eventbus consumer/client-side-events-log consumer/all-events-log processor/client-side-11 processor/client-side-10 processor/client-side-09 processor/client-side-08 processor/client-side-07 processor/client-side-06 processor/client-side-05 processor/client-side-04 proce [17:16:37] processor/client-side-02 processor/client-side-01 processor/client-side-00 [17:16:44] (03PS4) 10Giuseppe Lavagetto: mediawiki: Enable auto_prepend_file on canary_appserver [puppet] - 10https://gerrit.wikimedia.org/r/416637 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [17:17:02] _joe_: Yep, here. [17:17:04] Ready for testing. [17:17:10] _joe_: Can we start with an mwdebug? [17:17:15] For manual verification one more time :) [17:17:39] <_joe_> yes, my usual procedure for mildly dangerous changes is to disable puppet on the affected hosts and verify on one [17:17:50] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: Enable auto_prepend_file on canary_appserver [puppet] - 10https://gerrit.wikimedia.org/r/416637 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [17:18:04] (03CR) 10jenkins-bot: wikitech: on labweb, make mediawiki aware that it's behind varnishes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417265 (https://phabricator.wikimedia.org/T189168) (owner: 10Andrew Bogott) [17:18:13] _joe_: cool [17:18:40] <_joe_> cumin makes that super-easy :) [17:19:32] !log andrew@tin Synchronized wmf-config/wikitech.php: wikitech varnish updates (duration: 01m 15s) [17:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:18] <_joe_> Krinkle: running on mwdebug1002 [17:21:22] thx [17:21:32] <_joe_> takes some time to kill hhvm [17:21:59] the eventlogging alerts are a false positive [17:22:11] there are some upstart-related nagios check that I just discovered [17:22:12] <_joe_> Krinkle: you can test on mwdebug1002 [17:22:17] Yep, doing [17:23:07] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:23:41] <_joe_> uhm [17:23:48] <_joe_> that's... not true, icinga [17:23:57] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 76196 bytes in 0.134 second response time [17:24:25] gehel: mainboard has been swapped for wdqs1004...the mac address changed so it will need to be reimaged [17:25:09] <_joe_> Krinkle: I did run some of my smoke tests [17:26:18] <_joe_> I'll re-enable puppet on the other machines, I guess the effects on TC will be noticeable in the metrics in a couple days [17:27:03] _joe_: Hm.. I'm not seeing what I'm expecting to see from this change [17:27:12] _joe_: canary applies to mwdebug1002, yes? [17:27:16] <_joe_> Krinkle: what did you expect? [17:27:19] <_joe_> Krinkle: indeed [17:27:41] <_joe_> +hhvm.auto_prepend_file = /srv/mediawiki/wmf-config/PhpAutoPrepend.php [17:27:53] <_joe_> this is what gets added to the config [17:27:59] (03PS1) 10Rush: openstack: rabbit codify nova user [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) [17:28:06] _joe_: nono, it's not an hhvm-prefix setting [17:28:16] That would be the problem :/ [17:28:23] _joe_: I'm expecting that when I use X-Wikimedia-Debug to profile something, the code in profiler.php will now execute before anything else (even before /w/index.php and as such before multiversion) [17:28:33] (03CR) 10jerkins-bot: [V: 04-1] openstack: rabbit codify nova user [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [17:28:47] But I see it's still executing late. [17:28:54] * Krinkle re-reviews patch [17:29:05] zeljkof, sorry i forgot the swat window earlier. Did you run the namespaceDupes script? [17:29:15] _joe_: Ha, I didn't do the same in canary as in beta [17:29:21] Do you see the difference too? [17:29:23] Jhs: no, I did not know why they need to run [17:29:43] Jhs: there is a swat window soon, please move it to the new window [17:29:49] (03CR) 10Krinkle: mediawiki: Enable auto_prepend_file on canary_appserver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/416637 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [17:29:57] zeljkof, ok. (Y) [17:30:00] <_joe_> Krinkle: that's why I was asking before :P [17:30:38] <_joe_> ok let's fix this [17:30:47] (03PS1) 10Cmjohnson: dhcp mac address change wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/417311 (https://phabricator.wikimedia.org/T188045) [17:30:58] (03PS1) 10Krinkle: mediawiki: Actually enable auto_prepend_file on canary_appserver [puppet] - 10https://gerrit.wikimedia.org/r/417312 (https://phabricator.wikimedia.org/T180183) [17:31:02] (03PS2) 10Rush: openstack: rabbit codify nova user [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) [17:31:11] (03CR) 10Krinkle: mediawiki: Enable auto_prepend_file on canary_appserver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/416637 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [17:31:15] _joe_: ^ [17:31:34] (03CR) 10Cmjohnson: [C: 032] dhcp mac address change wdqs1004 [puppet] - 10https://gerrit.wikimedia.org/r/417311 (https://phabricator.wikimedia.org/T188045) (owner: 10Cmjohnson) [17:31:44] (03CR) 10jerkins-bot: [V: 04-1] openstack: rabbit codify nova user [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [17:31:46] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: Actually enable auto_prepend_file on canary_appserver [puppet] - 10https://gerrit.wikimedia.org/r/417312 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [17:31:54] (03PS2) 10Giuseppe Lavagetto: mediawiki: Actually enable auto_prepend_file on canary_appserver [puppet] - 10https://gerrit.wikimedia.org/r/417312 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [17:31:58] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki: Actually enable auto_prepend_file on canary_appserver [puppet] - 10https://gerrit.wikimedia.org/r/417312 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [17:32:41] <_joe_> cmjohnson: merging your change too [17:32:49] thanks _ [17:33:40] <_joe_> Krinkle: running puppet on mwdebug1002 [17:34:25] _joe_: Does this config change require/cause a restart? [17:34:33] <_joe_> yes [17:34:51] (03PS3) 10Rush: openstack: rabbit codify nova user [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) [17:34:52] Hm.. that's unfortunate. So for non-canary, we'll need a slow rolling restart. [17:34:55] Anyway, testing now :) [17:35:06] <_joe_> Krinkle: that's again pretty easy [17:35:10] Eh.. getting 503 [17:35:17] <_joe_> yeah wait a sec [17:35:22] <_joe_> it's restarting hhv, [17:35:24] and... back now [17:35:37] _joe_: Ah right, for mwdebug we don't have depool/repool? [17:35:43] <_joe_> nope [17:35:55] <_joe_> they're targeted directly [17:36:02] <_joe_> pybal doesn't protect you [17:36:32] <_joe_> and since they're debug servers, it's ok [17:37:13] Yeah [17:37:44] 10Operations, 10ops-eqiad, 10Discovery, 10Wikidata, and 3 others: wdqs1004 broken - https://phabricator.wikimedia.org/T188045#4035718 (10Cmjohnson) a:05Cmjohnson>03RobH The system board has been swapped, updated the bios and idrac. Also, update the mac address in puppet. assigning to @robh for reinstall [17:37:53] I suppose there isn't really much difference between the in-between director knowing the node is down and sending you an error (given you ask for a specific node) )and the director not knowing and sending you the same error when trying. [17:38:10] Initial test looks good now, very nice [17:38:17] Just doing a couple more tests [17:38:48] <_joe_> well you get the 503 from nginx on hafnium, indeed [17:38:58] <_joe_> then mediated by varnish :P [17:39:03] hafnium !? [17:39:14] <_joe_> hafnium is the eqiad debug proxy [17:39:19] Really! [17:39:25] I had no idea the debug-director proxy ran there [17:39:29] That's python, right? [17:39:33] <_joe_> maybe I remember wrong [17:39:35] I didn't know it was nginx [17:39:40] <_joe_> no, it's an nignx config [17:39:51] <_joe_> a few lines of nginx config actually [17:40:06] hassaleh|hassium [17:40:08] Not hafnium [17:40:14] <_joe_> hah [17:40:14] hafnium is where webperf/navtiming runs [17:40:21] <_joe_> got confused, sorry [17:40:23] And there is only 1 role there, and I thought I knew what it was :P [17:40:28] <_joe_> hassium then :P [17:40:39] Would not have been that surprising, though. [17:41:05] _joe_: OK. Let's roll forward to the other canaries? [17:41:07] is it expected that "Ops Clinic Duty" is blank in the topic (just wondering) :) [17:41:22] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4035724 (10Marostegui) [17:41:33] <_joe_> Krinkle: puppet is enabled there, it will run within ~ 30 minutes [17:41:35] (03PS1) 10Ottomata: Set scheduler_maximum_allocation_mb globally, vary nodemanager_resource_memory_mb [puppet] - 10https://gerrit.wikimedia.org/r/417313 (https://phabricator.wikimedia.org/T188294) [17:41:38] _joe_: perfect [17:42:13] (03CR) 10jerkins-bot: [V: 04-1] Set scheduler_maximum_allocation_mb globally, vary nodemanager_resource_memory_mb [puppet] - 10https://gerrit.wikimedia.org/r/417313 (https://phabricator.wikimedia.org/T188294) (owner: 10Ottomata) [17:42:34] <_joe_> Krinkle: requests to mwdebug will now typically have some time spent fetching config from etcd [17:42:43] <_joe_> because they receive few requests [17:42:53] <_joe_> so every time the cache is stale [17:43:46] (03PS2) 10Ottomata: Global scheduler_maximum_allocation_mb, vary nodemanager_resource_memory_mb [puppet] - 10https://gerrit.wikimedia.org/r/417313 (https://phabricator.wikimedia.org/T188294) [17:48:03] (03CR) 10Ottomata: "Gr8 https://puppet-compiler.wmflabs.org/compiler02/10371/" [puppet] - 10https://gerrit.wikimedia.org/r/417313 (https://phabricator.wikimedia.org/T188294) (owner: 10Ottomata) [17:48:05] (03CR) 10Ottomata: [C: 032] Global scheduler_maximum_allocation_mb, vary nodemanager_resource_memory_mb [puppet] - 10https://gerrit.wikimedia.org/r/417313 (https://phabricator.wikimedia.org/T188294) (owner: 10Ottomata) [17:51:01] _joe_: Yeah, noticed that. [17:58:51] 10Puppet, 10cloud-services-team (Kanban): role::puppet::self referenced in puppet_ssldir.rb - https://phabricator.wikimedia.org/T187622#4035753 (10Andrew) bump [18:08:24] (03PS1) 10Elukey: eventlogging: fix the nagios check when using systemd [puppet] - 10https://gerrit.wikimedia.org/r/417317 (https://phabricator.wikimedia.org/T114199) [18:09:37] (03PS16) 10Ottomata: Refactor cache::kafka::eventlogging into profile [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [18:11:11] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10372/eventlog1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/417317 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [18:12:14] (03CR) 10Ottomata: "No op!" [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [18:12:53] (03CR) 10Ottomata: [C: 032] Refactor cache::kafka::eventlogging into profile [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [18:13:03] (03PS17) 10Ottomata: Refactor cache::kafka::eventlogging into profile [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [18:13:04] RECOVERY - Check status of defined EventLogging jobs on eventlog1002 is OK: OK: All defined EventLogging jobs are runnning. [18:15:17] (03PS18) 10Ottomata: Refactor cache::kafka::eventlogging into profile [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [18:16:09] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4035832 (10RobH) Ok, just to confirm everything: > wdqs200[12] have a RAID controller (LSI Logic / Symbios Logic MegaRAID SAS-3 3108) but we are not using it. (I do re... [18:17:09] (03PS19) 10Ottomata: Refactor cache::kafka::eventlogging into profile [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) [18:18:22] (03CR) 10Ottomata: [C: 032] Refactor cache::kafka::eventlogging into profile [puppet] - 10https://gerrit.wikimedia.org/r/403067 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [18:24:29] (03PS1) 10Ottomata: Point eventlogging varnishkafka at Kafka jumbo-eqiad with TLS [puppet] - 10https://gerrit.wikimedia.org/r/417319 (https://phabricator.wikimedia.org/T183297) [18:27:50] (03PS1) 10Ottomata: Set up temporary secondary EventLogging camus-analytics job [puppet] - 10https://gerrit.wikimedia.org/r/417321 (https://phabricator.wikimedia.org/T183297) [18:28:26] (03PS1) 10Elukey: statistics::rsync::eventlogging: change rsync target to eventlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/417322 (https://phabricator.wikimedia.org/T114199) [18:30:45] (03PS2) 10Ottomata: Set up temporary secondary EventLogging camus-analytics job [puppet] - 10https://gerrit.wikimedia.org/r/417321 (https://phabricator.wikimedia.org/T183297) [18:31:07] (03CR) 10Ottomata: [C: 031] statistics::rsync::eventlogging: change rsync target to eventlog1002 [puppet] - 10https://gerrit.wikimedia.org/r/417322 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [18:31:17] (03PS2) 10Odder: Update logos for Banyumasan and Urdu Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417020 (https://phabricator.wikimedia.org/T189155) [18:32:06] (03CR) 10Ottomata: [C: 032] Set up temporary secondary EventLogging camus-analytics job [puppet] - 10https://gerrit.wikimedia.org/r/417321 (https://phabricator.wikimedia.org/T183297) (owner: 10Ottomata) [18:36:39] (03PS1) 10BryanDavis: labswiki: Replace 'm5-master' CNAME with backing db name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417324 [18:37:00] !log bsitzmann@tin Started deploy [mobileapps/deploy@d6819a0]: Update mobileapps to afb0167 [18:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:36] (03CR) 10BryanDavis: [C: 04-1] "I want to hear what if any objections Jaime and Manuel have to this before it is merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417324 (owner: 10BryanDavis) [18:43:14] !log bsitzmann@tin Finished deploy [mobileapps/deploy@d6819a0]: Update mobileapps to afb0167 (duration: 06m 14s) [18:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:18] (03PS6) 10Ottomata: Point eventlogging analytics and webperf processes at Kafka jumbo [puppet] - 10https://gerrit.wikimedia.org/r/404773 (https://phabricator.wikimedia.org/T183297) [18:48:22] (03CR) 10Andrew Bogott: [C: 04-1] labswiki: Replace 'm5-master' CNAME with backing db name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417324 (owner: 10BryanDavis) [18:53:06] (03CR) 10BryanDavis: [C: 04-1] labswiki: Replace 'm5-master' CNAME with backing db name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417324 (owner: 10BryanDavis) [19:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Morning SWAT (Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180308T1900). [19:00:04] James_F, Jhs, marlier, and twkozlowski: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:46] Jouncebot is a bit snarky, eh? [19:01:50] marlier: it has an attitude. originally it was very subservient and some folks didn't like that :) [19:02:02] * Jhs is present [19:02:24] Some people don't like the snarky one. :P [19:02:41] I can SWAT. [19:02:54] (03PS3) 10Muehlenhoff: Add a thirdparty/php72 component for use by Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/415856 [19:04:01] (03CR) 10Muehlenhoff: [C: 032] Add a thirdparty/php72 component for use by Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/415856 (owner: 10Muehlenhoff) [19:04:09] Do we have a James_F? [19:04:32] Always. [19:04:51] Perfect. [19:05:23] (03PS4) 10Rush: openstack: rabbit codify nova user [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) [19:06:25] Jhs: Ran the scripts. They turned up nothing. [19:06:57] (03PS7) 10Niharika29: NavigtationTiming: Enable oversampling for Singapore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [19:07:11] Niharika, thanks! that's good [19:07:52] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [19:08:46] I'm wondering if making jouncebot link to https://media.giphy.com/media/ojkd463uBS716/giphy.gif for SWAT deploys would be acceptable. :P [19:09:08] (03Merged) 10jenkins-bot: NavigtationTiming: Enable oversampling for Singapore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [19:09:21] (03CR) 10jenkins-bot: NavigtationTiming: Enable oversampling for Singapore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [19:09:33] (03PS1) 10MaxSem: Enable ping from edit summary in prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417329 (https://phabricator.wikimedia.org/T188469) [19:09:55] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint, 10Patch-For-Review: Give Roan Kattouw the rights to deploy maps and restart maps-related services - https://phabricator.wikimedia.org/T189153#4036105 (10RobH) So this will need review in next Monday's operations meeting, since it increases sudo rights for... [19:10:46] Niharika: https://gerrit.wikimedia.org/r/#/c/417330/ [19:11:23] (anyone) For https://gerrit.wikimedia.org/r/#/c/415618/, I will sync IS before CS, does that sound right? [19:11:42] (03PS1) 10Imarlier: wmf-config: enable Singapore oversample as default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417331 (https://phabricator.wikimedia.org/T188652) [19:12:25] yeah, looks so [19:12:51] (03PS1) 10Rush: openstack: nova seed for db setup [puppet] - 10https://gerrit.wikimedia.org/r/417333 (https://phabricator.wikimedia.org/T188266) [19:13:07] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova seed for db setup [puppet] - 10https://gerrit.wikimedia.org/r/417333 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [19:13:21] marlier: https://gerrit.wikimedia.org/r/#/c/415618/ is up for testing, if you can? [19:14:40] (03PS5) 10Rush: openstack: rabbit codify nova user [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) [19:15:19] (03PS6) 10Rush: openstack: rabbit codify nova user [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) [19:15:56] marlier: Ping. [19:16:13] Niharika: er, up where? [19:16:21] marlier: mwdebug1002. [19:19:54] Niharika: looks good, thanks [19:20:12] marlier: Syncing. [19:20:44] PROBLEM - High lag on wdqs2001 is CRITICAL: CRITICAL - WDQS_Lag is 3608 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:20:54] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL - WDQS_Lag is 3620 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:21:15] PROBLEM - High lag on wdqs1005 is CRITICAL: CRITICAL - WDQS_Lag is 3643 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:21:20] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: NavigtationTiming: Enable oversampling for Singapore T188652 (duration: 01m 16s) [19:21:27] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#4036145 (10RobH) So I've emailed @MeganHernandez_WMF on 2018-02-23 & 2018-03-07, but have not... [19:21:34] PROBLEM - High lag on wdqs2003 is CRITICAL: CRITICAL - WDQS_Lag is 3658 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:21:35] PROBLEM - High lag on wdqs2002 is CRITICAL: CRITICAL - WDQS_Lag is 3659 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:38] T188652: Enable oversampling for Singapore - https://phabricator.wikimedia.org/T188652 [19:21:46] (03PS7) 10Rush: openstack: rabbit codify nova user [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) [19:22:10] !log sbisson@tin Started deploy [kartotherian/deploy@a839a16]: Deploying kartotherian with updated dependencies and zoom lovel 19 to maps-test* [19:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:04] !log niharika29@tin Synchronized wmf-config/CommonSettings.php: NavigtationTiming: Enable oversampling for Singapore T188652 (duration: 01m 15s) [19:23:08] (03PS3) 10Niharika29: Update logos for Banyumasan and Urdu Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417020 (https://phabricator.wikimedia.org/T189155) (owner: 10Odder) [19:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:10] Do we have a two Odd [19:24:25] twk [19:24:35] Maybe not. [19:24:51] !log sbisson@tin Finished deploy [kartotherian/deploy@a839a16]: Deploying kartotherian with updated dependencies and zoom lovel 19 to maps-test* (duration: 02m 40s) [19:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:18] "19:24:23 1 of 1 canary targets failed, exceeding limit" <- what does that mean? [19:25:49] James_F: Your patch is on mwdebug1002. [19:26:55] Niharika: Yup, works great. [19:28:03] James_F: Ack. [19:29:13] !log niharika29@tin Synchronized php-1.31.0-wmf.24/extensions/VisualEditor/: Hooks: Don't register beta features if they're enabled for all https://gerrit.wikimedia.org/r/#/c/417277/ (duration: 01m 14s) [19:29:16] James_F: ^Done. [19:29:22] Thanks! [19:29:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:04] !log Running `cleanupUsersWithNoId.php --table recentchanges --prefix wikidata --force` on wikidata client wikis for T181731. This shouldn't create any local SUL accounts. [19:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:22] T181731: Run maintenance/cleanupUsersWithNoId.php on all wikis - https://phabricator.wikimedia.org/T181731 [19:35:02] (03PS1) 10Smalyshev: Temporarily disable Kafka poller due to Kafka problems [puppet] - 10https://gerrit.wikimedia.org/r/417341 [19:35:58] (03CR) 10Gehel: [C: 032] Temporarily disable Kafka poller due to Kafka problems [puppet] - 10https://gerrit.wikimedia.org/r/417341 (owner: 10Smalyshev) [19:37:44] RECOVERY - High lag on wdqs2002 is OK: OK - WDQS_Lag is 29 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:38:04] RECOVERY - High lag on wdqs1003 is OK: OK - WDQS_Lag is 49 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:38:34] RECOVERY - High lag on wdqs1005 is OK: OK - WDQS_Lag is 74 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:38:37] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/10377/" [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [19:38:44] RECOVERY - High lag on wdqs2003 is OK: OK - WDQS_Lag is 90 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:38:55] RECOVERY - High lag on wdqs2001 is OK: OK - WDQS_Lag is 101 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:42:05] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:43:08] (03PS1) 10RobH: changing all wdqs nodes to use similar partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/417346 (https://phabricator.wikimedia.org/T189192) [19:44:34] (03CR) 10RobH: [C: 032] changing all wdqs nodes to use similar partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/417346 (https://phabricator.wikimedia.org/T189192) (owner: 10RobH) [19:44:40] (03PS2) 10RobH: changing all wdqs nodes to use similar partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/417346 (https://phabricator.wikimedia.org/T189192) [19:45:05] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL - WDQS_Lag is 4606 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [19:46:55] * Niharika kicks zuul [19:52:04] !log niharika29@tin Synchronized php-1.31.0-wmf.24/extensions/Echo/: https://gerrit.wikimedia.org/r/#/c/417330/ and https://gerrit.wikimedia.org/r/#/c/417340/ (duration: 01m 21s) [19:52:08] MaxSem: ^ All deployed. [19:52:14] SWAT over. One no-show. [19:52:19] weee [19:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:48] (03PS1) 10Herron: WIP: add careers.wikimedia.org spf/dkim/mx records [dns] - 10https://gerrit.wikimedia.org/r/417350 (https://phabricator.wikimedia.org/T189065) [19:56:07] (03PS1) 10RobH: typo in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/417351 [19:56:17] (03PS2) 10RobH: typo in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/417351 [19:56:27] (03CR) 10RobH: [C: 032] typo in netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/417351 (owner: 10RobH) [19:57:12] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4036318 (10Niedzielski) This seems to always happen on deployment-cache-text04: ``` Request from 73.252.38.252 via deployment-cache-text04 deployment-cache-text04, Varnish X... [19:57:34] (03PS1) 10Cmjohnson: Adding dns entries wdqs1006-8 [dns] - 10https://gerrit.wikimedia.org/r/417352 (https://phabricator.wikimedia.org/T188432) [20:00:05] thcipriani: (Dis)respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180308T2000). Please do the needful. [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:00:09] 10Operations, 10Beta-Cluster-Infrastructure: Beta cluster Obama page often responds with 503 - https://phabricator.wikimedia.org/T188913#4036348 (10Niedzielski) https://en.m.wikipedia.beta.wmflabs.org/wiki/Main_Page seems to always work while the Obama article fails: ``` Request from 73.252.38.252 via deploym... [20:00:17] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4036349 (10RobH) I had a typo in that patchset, but I followed it up with a fix (just neglected to link the bug in the fix's commit message.) [20:01:06] (03CR) 10Andrew Bogott: openstack: rabbit codify nova user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/417310 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [20:01:22] 10Operations, 10DNS, 10Mail, 10Traffic, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4029896 (10herron) Here's a patch to get the ball rolling on a subdomain for this. It's WIP since we will need an admin on the greenhouse account to supply the... [20:02:17] * thcipriani trains [20:09:27] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:09:46] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:12:09] (03PS1) 10Bstorm: wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 [20:12:42] (03CR) 10jerkins-bot: [V: 04-1] wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 (owner: 10Bstorm) [20:17:55] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:18:30] (03PS1) 10Dzahn: netboot: temp remove deploy hosts for debugging [puppet] - 10https://gerrit.wikimedia.org/r/417360 (https://phabricator.wikimedia.org/T175288) [20:18:35] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:19:09] (03PS2) 10Bstorm: wiki replicas: script index creation for easier maintenance [puppet] - 10https://gerrit.wikimedia.org/r/417357 [20:20:09] (03PS1) 10Thcipriani: All wikis to 1.31.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417362 [20:20:48] (03PS2) 10Dzahn: netboot: temp remove deploy hosts for debugging [puppet] - 10https://gerrit.wikimedia.org/r/417360 (https://phabricator.wikimedia.org/T175288) [20:22:07] (03CR) 10Dzahn: [C: 032] "to let us boot into installer but get manual partitioning step to ensure install issues are not related to partman" [puppet] - 10https://gerrit.wikimedia.org/r/417360 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [20:22:31] (03CR) 10Thcipriani: [C: 032] All wikis to 1.31.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417362 (owner: 10Thcipriani) [20:23:44] (03Merged) 10jenkins-bot: All wikis to 1.31.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417362 (owner: 10Thcipriani) [20:26:04] !log thcipriani@tin Synchronized php: Ensure symlink for 1.31.0-wmf.24 is up-to-date (duration: 01m 15s) [20:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:06] (03CR) 10jenkins-bot: All wikis to 1.31.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417362 (owner: 10Thcipriani) [20:30:07] !log thcipriani@tin rebuilt and synchronized wikiversions files: All wikis to php-1.31.0-wmf.24 [20:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:01] (03PS1) 10Ottomata: Conditionally include MirrorMaker instances [puppet] - 10https://gerrit.wikimedia.org/r/417364 [20:34:46] !log set compression chunk length to 32, page_summary tables - T189057 [20:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:02] T189057: Understand (and if possible, improve) performance of new storage strategy - https://phabricator.wikimedia.org/T189057 [20:37:00] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/10378/" [puppet] - 10https://gerrit.wikimedia.org/r/417364 (owner: 10Ottomata) [20:37:01] (03CR) 10Ottomata: [C: 032] Conditionally include MirrorMaker instances [puppet] - 10https://gerrit.wikimedia.org/r/417364 (owner: 10Ottomata) [20:39:22] (03PS1) 10Herron: add puppetdb4 support to puppetmaster::puppetdb::client [puppet] - 10https://gerrit.wikimedia.org/r/417370 (https://phabricator.wikimedia.org/T177253) [20:40:09] (03CR) 10jerkins-bot: [V: 04-1] add puppetdb4 support to puppetmaster::puppetdb::client [puppet] - 10https://gerrit.wikimedia.org/r/417370 (https://phabricator.wikimedia.org/T177253) (owner: 10Herron) [20:40:53] !log set compression chunk length to 32, mobile tables - T189057 [20:41:07] (03PS2) 10Herron: add puppetdb4 support to puppetmaster::puppetdb::client [puppet] - 10https://gerrit.wikimedia.org/r/417370 (https://phabricator.wikimedia.org/T177253) [20:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:11] T189057: Understand (and if possible, improve) performance of new storage strategy - https://phabricator.wikimedia.org/T189057 [20:42:16] PROBLEM - Check systemd state on kafka1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:42:25] PROBLEM - Check systemd state on kafka1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:42:35] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad/producer\.properties [20:42:56] ^ this is me [20:42:56] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad on kafka1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad/producer\.properties [20:43:05] its ok ah shoulda run puppet elsewhere [20:43:06] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad on kafka1014 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_jumbo-eqiad/producer\.properties [20:43:10] i'm removing these instances from these nodes [20:43:15] PROBLEM - Check systemd state on kafka1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:45:05] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [20:45:06] PROBLEM - Kafka MirrorMaker main-eqiad_to_analytics on kafka1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args kafka.tools.MirrorMaker.+/etc/kafka/mirror/main-eqiad_to_analytics/producer\.properties [20:45:26] PROBLEM - Check systemd state on kafka1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:45:36] PROBLEM - Check systemd state on kafka1023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:45:37] RECOVERY - High lag on wdqs1003 is OK: OK - WDQS_Lag is 1192 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:45:37] PROBLEM - Check systemd state on kafka1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:46:40] I've removed those montors ^ via puppet [20:55:37] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/10379/" [puppet] - 10https://gerrit.wikimedia.org/r/417370 (https://phabricator.wikimedia.org/T177253) (owner: 10Herron) [21:04:33] !log sbisson@tin Started deploy [kartotherian/deploy@6dcacbc]: Deploying kartotherian with updated dependencies and zoom level 19 to maps-test* (take 3) [21:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:02] (03PS1) 10Herron: change rhodium to use puppetdb4 server puppetdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/417376 (https://phabricator.wikimedia.org/T188544) [21:06:09] (03PS1) 10Ottomata: Having trouble with too many tcp connections! This is also breaking mirrormaker?!?! [puppet] - 10https://gerrit.wikimedia.org/r/417377 [21:06:23] (03PS2) 10Ottomata: Having trouble with too many tcp connections! This is also breaking mirrormaker?!?! [puppet] - 10https://gerrit.wikimedia.org/r/417377 [21:06:50] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational [21:06:55] (03CR) 10jerkins-bot: [V: 04-1] Having trouble with too many tcp connections! This is also breaking mirrormaker?!?! [puppet] - 10https://gerrit.wikimedia.org/r/417377 (owner: 10Ottomata) [21:08:10] (03PS3) 10Ottomata: Having trouble with too many tcp connections! [puppet] - 10https://gerrit.wikimedia.org/r/417377 [21:08:30] (03PS1) 10BBlack: eqsin: asset tag DNS entries for server hosts [dns] - 10https://gerrit.wikimedia.org/r/417378 (https://phabricator.wikimedia.org/T181554) [21:08:47] (03CR) 10Herron: [C: 032] add puppetdb4 support to puppetmaster::puppetdb::client [puppet] - 10https://gerrit.wikimedia.org/r/417370 (https://phabricator.wikimedia.org/T177253) (owner: 10Herron) [21:08:54] (03PS3) 10Herron: add puppetdb4 support to puppetmaster::puppetdb::client [puppet] - 10https://gerrit.wikimedia.org/r/417370 (https://phabricator.wikimedia.org/T177253) [21:09:05] (03PS1) 10Ottomata: Having trouble with too many tcp connections! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417379 [21:09:15] !log sbisson@tin Finished deploy [kartotherian/deploy@6dcacbc]: Deploying kartotherian with updated dependencies and zoom level 19 to maps-test* (take 3) (duration: 04m 42s) [21:09:16] (03CR) 10Ottomata: [C: 032] Having trouble with too many tcp connections! [puppet] - 10https://gerrit.wikimedia.org/r/417377 (owner: 10Ottomata) [21:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:12] (03CR) 10BBlack: [C: 032] eqsin: asset tag DNS entries for server hosts [dns] - 10https://gerrit.wikimedia.org/r/417378 (https://phabricator.wikimedia.org/T181554) (owner: 10BBlack) [21:11:45] (03PS4) 10Herron: add puppetdb4 support to puppetmaster::puppetdb::client [puppet] - 10https://gerrit.wikimedia.org/r/417370 (https://phabricator.wikimedia.org/T177253) [21:12:27] 10Operations, 10ops-eqsin, 10Traffic, 10Patch-For-Review: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#4036493 (10BBlack) [21:12:40] 10Operations, 10ops-eqsin, 10netops: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#4036496 (10BBlack) [21:12:45] 10Operations, 10ops-eqsin, 10Traffic, 10Patch-For-Review: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#3793949 (10BBlack) 05Open>03Resolved a:03BBlack [21:12:55] (03CR) 10Ottomata: [C: 032] Having trouble with too many tcp connections! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417379 (owner: 10Ottomata) [21:13:11] (03CR) 10jenkins-bot: Having trouble with too many tcp connections! [mediawiki-config] - 10https://gerrit.wikimedia.org/r/417379 (owner: 10Ottomata) [21:13:14] 10Operations, 10ops-eqsin, 10Traffic, 10Patch-For-Review: rack/setup/install dns500[12] - https://phabricator.wikimedia.org/T181556#4036499 (10BBlack) [21:13:17] 10Operations, 10ops-eqsin, 10Traffic, 10Patch-For-Review: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#3793949 (10BBlack) [21:13:23] 10Operations, 10ops-eqsin, 10Traffic, 10Patch-For-Review: rack/setup/install dns500[12] - https://phabricator.wikimedia.org/T181556#3793986 (10BBlack) 05Open>03Resolved a:03BBlack [21:13:55] 10Operations, 10ops-eqsin, 10Traffic, 10Patch-For-Review: rack/setup/install lvs500[123] - https://phabricator.wikimedia.org/T182171#4036504 (10BBlack) [21:14:07] 10Operations, 10ops-eqsin, 10Traffic, 10Patch-For-Review: rack/setup/install lvs500[123] - https://phabricator.wikimedia.org/T182171#3815107 (10BBlack) 05Open>03Resolved a:03BBlack [21:15:18] !log otto@tin Synchronized wmf-config/ProductionServices.php: Revert: point monolog avro producer back at Kafka analytics. Too many TCP connections? T188136 (duration: 00m 58s) [21:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:36] T188136: Migrate Mediawiki Monolog Kafka producer to Kafka Jumbo - https://phabricator.wikimedia.org/T188136 [21:15:45] (03PS5) 10Herron: puppet_compiler: add support for puppetdb4 and local postgresql [puppet] - 10https://gerrit.wikimedia.org/r/413881 (https://phabricator.wikimedia.org/T187258) [21:16:03] 10Operations, 10ops-eqsin, 10Traffic, 10Patch-For-Review: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#4036509 (10BBlack) [21:16:22] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler: add support for puppetdb4 and local postgresql [puppet] - 10https://gerrit.wikimedia.org/r/413881 (https://phabricator.wikimedia.org/T187258) (owner: 10Herron) [21:16:40] 10Operations, 10ops-eqsin, 10Traffic, 10Patch-For-Review: rack/setup/install cp50(0[1-9]|1[0-2]) - https://phabricator.wikimedia.org/T181557#3794022 (10BBlack) 05Open>03Resolved a:03BBlack (other than DOA cp5006, tracked separately for repair in T187157 [21:17:16] 10Operations, 10Traffic: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#4036517 (10BBlack) [21:17:19] 10Operations, 10Traffic: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#4036515 (10BBlack) 05Open>03Resolved a:03BBlack [21:26:16] 10Operations, 10Traffic: Server hardware installation for Asia Cache DC - https://phabricator.wikimedia.org/T156032#4036575 (10BBlack) [21:26:19] 10Operations, 10Traffic: Server hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T156033#4036573 (10BBlack) 05Open>03Resolved a:03BBlack [21:26:36] 10Operations, 10Traffic: Enable Service in Asia Cache DC - https://phabricator.wikimedia.org/T156026#4036582 (10BBlack) [21:26:39] 10Operations, 10Traffic: Server hardware installation for Asia Cache DC - https://phabricator.wikimedia.org/T156032#2962057 (10BBlack) 05Open>03Resolved a:03BBlack [21:27:28] 10Operations, 10ops-eqsin, 10Traffic, 10Patch-For-Review: rack/setup/install bast5001 - https://phabricator.wikimedia.org/T181554#4036593 (10BBlack) [21:27:31] 10Operations, 10ops-eqsin, 10netops: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#4036592 (10BBlack) [21:28:37] 10Operations, 10Traffic: Enable Service in Asia Cache DC - https://phabricator.wikimedia.org/T156026#4036599 (10BBlack) [21:28:40] 10Operations, 10ops-eqsin, 10netops: setup and deploy eqsin network infrastructure - https://phabricator.wikimedia.org/T181558#3794067 (10BBlack) [21:31:24] 10Operations, 10Traffic: WP Zero workarounds for eqsin - https://phabricator.wikimedia.org/T189250#4036605 (10BBlack) p:05Triage>03Normal [21:32:41] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/10381/" [puppet] - 10https://gerrit.wikimedia.org/r/417376 (https://phabricator.wikimedia.org/T188544) (owner: 10Herron) [21:33:35] (03PS2) 10Herron: change rhodium to use puppetdb4 server puppetdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/417376 (https://phabricator.wikimedia.org/T188544) [21:34:25] (03CR) 10Herron: [C: 032] change rhodium to use puppetdb4 server puppetdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/417376 (https://phabricator.wikimedia.org/T188544) (owner: 10Herron) [21:39:30] RECOVERY - MariaDB Slave Lag: s4 on db2090 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [21:41:10] !log set compression chunk length to 32, parsoid tables (group "others") - T189057 [21:41:20] 10Operations, 10Traffic: Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4036653 (10BBlack) p:05Triage>03Normal [21:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:26] T189057: Understand (and if possible, improve) performance of new storage strategy - https://phabricator.wikimedia.org/T189057 [21:53:11] (03PS5) 10Herron: WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) [21:53:46] (03CR) 10jerkins-bot: [V: 04-1] WIP: puppetdbquery: upgrade to 3.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/410050 (https://phabricator.wikimedia.org/T187259) (owner: 10Herron) [21:54:02] we'd had a mysterious baseline increase in the (relatively-small) amount of failed fetches from varnish->appservers since Feb 26, and circa 20:30 UTC today it suddenly vanished, fixing itself (~84 mins ago). [21:54:09] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-14d&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cache_type=text&var-server=All [21:54:30] (03PS1) 10Dzahn: deploy1001: test with jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/417454 (https://phabricator.wikimedia.org/T175288) [21:55:30] anyone know of something that got "fixed" then which might explain? I see in the right time ballpark both "All wikis to 1.31.0-wmf.24" and the T189057 stuff related to RB storage perf tuning, but it could be neither of those. [21:55:31] T189057: Understand (and if possible, improve) performance of new storage strategy - https://phabricator.wikimedia.org/T189057 [21:57:24] (03PS2) 10Dzahn: deploy1001: test with jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/417454 (https://phabricator.wikimedia.org/T175288) [22:00:45] (03PS2) 10Rush: openstack: nova seed for db setup [puppet] - 10https://gerrit.wikimedia.org/r/417333 (https://phabricator.wikimedia.org/T188266) [22:00:47] (03CR) 10Dzahn: [C: 032] deploy1001: test with jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/417454 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [22:02:18] (03PS3) 10Rush: openstack: nova seed for db setup [puppet] - 10https://gerrit.wikimedia.org/r/417333 (https://phabricator.wikimedia.org/T188266) [22:02:52] (03CR) 10Rush: [C: 032] openstack: nova seed for db setup [puppet] - 10https://gerrit.wikimedia.org/r/417333 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [22:06:00] (03PS1) 10Rush: wip openstack: nova-compute jessie mitaka setup [puppet] - 10https://gerrit.wikimedia.org/r/417455 (https://phabricator.wikimedia.org/T188266) [22:07:52] !log guess what? trying T187516 again [22:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:47] !log reedy@tin Synchronized php-1.31.0-wmf.24/includes/specials/pagers/BlockListPager.php: T189251 (duration: 00m 59s) [22:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:03] T189251: Notice: Undefined property: stdClass::$ipb_by_text in /srv/mediawiki/php-1.31.0-wmf.24/includes/specials/pagers/BlockListPager.php - https://phabricator.wikimedia.org/T189251 [22:20:05] 10Operations, 10Traffic: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#4036787 (10ayounsi) [22:20:21] 10Operations, 10Traffic: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#2962044 (10ayounsi) [22:26:13] (03PS1) 10Ayounsi: Rancid: Add eqsin devices [puppet] - 10https://gerrit.wikimedia.org/r/417458 [22:26:25] (03PS1) 10Herron: add puppetdb-termini pacakge require in puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/417459 (https://phabricator.wikimedia.org/T177253) [22:27:03] (03CR) 10jerkins-bot: [V: 04-1] add puppetdb-termini pacakge require in puppetmaster module [puppet] - 10https://gerrit.wikimedia.org/r/417459 (https://phabricator.wikimedia.org/T177253) (owner: 10Herron) [22:31:54] !log set compression chunk length to 32, parsoid tables (group "commons") - T189057 [22:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:10] T189057: Understand (and if possible, improve) performance of new storage strategy - https://phabricator.wikimedia.org/T189057 [22:36:24] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/10383/" [puppet] - 10https://gerrit.wikimedia.org/r/417459 (https://phabricator.wikimedia.org/T177253) (owner: 10Herron) [22:37:37] 10Operations, 10ops-eqsin, 10Traffic, 10netops: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4036874 (10ayounsi) The IPv6 issue seems to be on the Atlas, most likely not configured yet. From the router interface I can't ping its global IP: `ping 2001:df2:e500:201:103:102:166:2... [22:38:51] (03CR) 10Ayounsi: [C: 032] Rancid: Add eqsin devices [puppet] - 10https://gerrit.wikimedia.org/r/417458 (owner: 10Ayounsi) [22:46:59] (03PS2) 10Rush: openstack: nova-compute jessie mitaka setup [puppet] - 10https://gerrit.wikimedia.org/r/417455 (https://phabricator.wikimedia.org/T188266) [22:47:38] (03PS9) 10Paladox: Phabricator: Support php 7.1 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [22:47:45] (03PS3) 10Rush: openstack: nova-compute jessie mitaka setup [puppet] - 10https://gerrit.wikimedia.org/r/417455 (https://phabricator.wikimedia.org/T188266) [22:50:01] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4036925 (10Dzahn) I have tried with manual partitioning, i have tried with jessie instead of stretch, i have trie... [23:02:25] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Level3 link esams-eqiad down, techs onsite [23:02:25] ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi Level3 link esams-eqiad down, techs onsite [23:10:13] !log set compression chunk length to 32, parsoid tables (group "wikipedia") - T189057 [23:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:29] T189057: Understand (and if possible, improve) performance of new storage strategy - https://phabricator.wikimedia.org/T189057 [23:17:28] (03PS1) 10Dzahn: netboot: temp remove bast1002 for debugging [puppet] - 10https://gerrit.wikimedia.org/r/417463 (https://phabricator.wikimedia.org/T186623) [23:19:14] (03PS2) 10Dzahn: netboot: temp remove bast1002 for debugging [puppet] - 10https://gerrit.wikimedia.org/r/417463 (https://phabricator.wikimedia.org/T186623) [23:19:21] (03CR) 10Dzahn: [C: 032] netboot: temp remove bast1002 for debugging [puppet] - 10https://gerrit.wikimedia.org/r/417463 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn) [23:23:23] (03PS10) 10Paladox: Phabricator: Support php 7.2 under stretch [puppet] - 10https://gerrit.wikimedia.org/r/410245 (https://phabricator.wikimedia.org/T182832) [23:34:39] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#4037082 (10Dzahn) I booted it without partman recipe to get into manual partitioning. And, as opposed to deploy1001, this time it worked for some reason. It went past the partioning and is now instal... [23:35:07] (03PS1) 10Dzahn: Revert "netboot: temp remove bast1002 for debugging" [puppet] - 10https://gerrit.wikimedia.org/r/417466 [23:36:08] (03CR) 10Dzahn: [C: 032] Revert "netboot: temp remove bast1002 for debugging" [puppet] - 10https://gerrit.wikimedia.org/r/417466 (owner: 10Dzahn) [23:36:37] (03PS4) 10Madhuvishy: dumps: Rename and move distribution hiera rsync config [puppet] - 10https://gerrit.wikimedia.org/r/416980 (https://phabricator.wikimedia.org/T188727) [23:36:56] (03PS1) 10Dzahn: Revert "deploy1001: test with jessie installer" [puppet] - 10https://gerrit.wikimedia.org/r/417467 [23:36:59] (03PS1) 10Dzahn: Revert "netboot: temp remove deploy hosts for debugging" [puppet] - 10https://gerrit.wikimedia.org/r/417468 [23:37:35] (03CR) 10Madhuvishy: [C: 032] dumps: Rename and move distribution hiera rsync config [puppet] - 10https://gerrit.wikimedia.org/r/416980 (https://phabricator.wikimedia.org/T188727) (owner: 10Madhuvishy) [23:38:22] (03CR) 10Dzahn: [C: 032] Revert "deploy1001: test with jessie installer" [puppet] - 10https://gerrit.wikimedia.org/r/417467 (owner: 10Dzahn) [23:38:29] (03PS2) 10Dzahn: Revert "deploy1001: test with jessie installer" [puppet] - 10https://gerrit.wikimedia.org/r/417467 [23:39:00] uhhh mutante did you merge my patch may be? :) [23:39:58] (03CR) 10Dzahn: [C: 032] Revert "netboot: temp remove deploy hosts for debugging" [puppet] - 10https://gerrit.wikimedia.org/r/417468 (owner: 10Dzahn) [23:40:10] (03PS2) 10Dzahn: Revert "netboot: temp remove deploy hosts for debugging" [puppet] - 10https://gerrit.wikimedia.org/r/417468 [23:40:14] madhuvishy: yes, i did [23:40:29] mutante: okay cool thanks! [23:40:30] i got the "multiple" screen [23:40:33] yw [23:40:56] yeah I went and found no patches so was just checking :) [23:41:28] i figured just do it :) [23:49:19] (03PS1) 10Dzahn: fix erroneous records for bast1002 [dns] - 10https://gerrit.wikimedia.org/r/417469 (https://phabricator.wikimedia.org/T186623) [23:49:30] (03CR) 10jerkins-bot: [V: 04-1] fix erroneous records for bast1002 [dns] - 10https://gerrit.wikimedia.org/r/417469 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn) [23:50:09] (03PS2) 10Dzahn: fix erroneous records for bast1002 [dns] - 10https://gerrit.wikimedia.org/r/417469 (https://phabricator.wikimedia.org/T186623) [23:51:06] (03CR) 10Dzahn: [C: 032] fix erroneous records for bast1002 [dns] - 10https://gerrit.wikimedia.org/r/417469 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn)