[00:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Evening SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191205T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:15] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [00:01:55] NO SWAT. [00:03:07] brennen: you get a t-shirt :) [00:03:22] i was really not hoping for today to be a t-shirt day [00:03:38] sorry man, I know it's not fun :/ [00:03:40] * brennen drafting a train status mail and a phab ticket currently [00:03:48] On behalf of those of us with the shirt already, sorry. [00:04:05] (and calling the train blocked for the day, i think.) [00:04:32] Very much so, yes. [00:04:45] yup [00:08:08] brennen: welcome to the club ;) [00:08:45] if you think that's fun you can join the phabricator breakers club though we don't yet have a t-shirt [00:10:24] twentyafterfour: It's more like a suit of armour for you hallowed few, right? ;-) [00:12:13] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:12:13] yeah [00:12:45] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [00:13:14] still? why are these not gone? [00:13:28] (also why am I not gone but that's a different matter) [00:13:59] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:14:39] apergos: that one's been flapping all afternoon [00:14:40] What's the threshold set to? Values right now are ~400ms which feels elevated. [00:16:21] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [00:18:07] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.06667 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [00:18:19] mail sent. [00:20:33] Thank you, brennen! [00:20:53] thanks for the assistance all afternoon! [00:30:05] my model of this is now that a similar incident was probably beginning with the earlier attempted deploy, and parsoid logspam was in fact something of a red herring. [00:33:23] brennen: that's what I was thinking as well [00:33:29] and that's why logspam is bad [00:33:50] so we are now preparing to take phabricator offline for a second round of server maintenance [00:34:01] should be fairly short this time, I hope [00:36:24] well.. this one yes.. but then we have to rsync from scratch [00:38:34] (03PS2) 10Dzahn: Revert "switch discovery record for phabricator to 1001 for ATS" [dns] - 10https://gerrit.wikimedia.org/r/554589 [00:38:57] (03CR) 10Dzahn: [C: 03+2] Revert "switch discovery record for phabricator to 1001 for ATS" [dns] - 10https://gerrit.wikimedia.org/r/554589 (owner: 10Dzahn) [00:39:46] (03PS2) 10Dzahn: phabricator: switch prod server to phab1003, enables dumps and ferm holes [puppet] - 10https://gerrit.wikimedia.org/r/554644 (https://phabricator.wikimedia.org/T238956) [00:40:04] (03CR) 10Dzahn: [C: 03+2] phabricator: switch prod server to phab1003, enables dumps and ferm holes [puppet] - 10https://gerrit.wikimedia.org/r/554644 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [00:42:43] (03PS2) 10Dzahn: Revert "varnish: switch phabricator backend to phab1001" [puppet] - 10https://gerrit.wikimedia.org/r/554590 [00:42:50] (03CR) 10Dzahn: [C: 03+2] Revert "varnish: switch phabricator backend to phab1001" [puppet] - 10https://gerrit.wikimedia.org/r/554590 (owner: 10Dzahn) [00:43:02] !log cwhite@cumin1001 dbctl commit (dc=all): 'Depool db1062 T239874', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20191205-004256-cwhite.json [00:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:50] (03PS2) 10Dzahn: Revert "phabricator: switch mail destination to phab1001" [puppet] - 10https://gerrit.wikimedia.org/r/554591 [00:44:53] running puppet on cp-eqiad [00:45:25] merged DNS and puppet switch at the same time [00:45:36] (03CR) 10Dzahn: [C: 03+2] Revert "phabricator: switch mail destination to phab1001" [puppet] - 10https://gerrit.wikimedia.org/r/554591 (owner: 10Dzahn) [00:46:04] running puppet on mx1001 [00:46:17] (03PS2) 10Dzahn: Revert "dumps/phabricator: switch dumps host from phab1003 to phab1001" [puppet] - 10https://gerrit.wikimedia.org/r/554592 [00:47:43] (03CR) 10Dzahn: [C: 03+2] Revert "dumps/phabricator: switch dumps host from phab1003 to phab1001" [puppet] - 10https://gerrit.wikimedia.org/r/554592 (owner: 10Dzahn) [00:48:39] running puppet on dumps [00:49:11] ferm rules on phab1001 were removed and added on phab1003 [00:49:43] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=phab1001-vcs.eqiad.wmnet [00:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:52] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=phab1001-vcs.eqiad.wmnet [00:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:29] twentyafterfour: forgot one change at this point .. conftool data.. coming up [00:51:22] or we can just leave it during this time [00:51:33] since it has an issue anyways and we will go back [00:51:40] let me start the rsync instead to save time [00:51:56] if it looks good to you in the UI [00:52:37] waits another few minutes for that DNS change to propagate [00:53:03] 10Operations, 10DC-Ops, 10hardware-requests: Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 - https://phabricator.wikimedia.org/T239880 (10thcipriani) [01:00:04] twentyafterfour: How many deployers does it take to do Phabricator update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191205T0100). [01:00:30] jouncebot: 9? [01:02:13] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Danny Horn - https://phabricator.wikimedia.org/T239881 (10DannyS712) [01:02:23] jouncebot: no, we'll reinstall phab1001 now [01:03:15] !log phabricator switched back to phab1003 - reimaging phab1001 now [01:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:49] !log telling phab1001 to boot into BIOS next time it boots via mgmt console (https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN30#Reboot_and_boot_into_BIOS_then_console) [01:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:32] !log phab1001 - powercycling [01:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:41] !log phab1001 - System BIOS Settings > SATA Settings > Embedded SATA: switch from ATA to AHCI mode (T238956) [01:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:46] T238956: switch prod Phabricator from phab1003 to phab1001 - https://phabricator.wikimedia.org/T238956 [01:09:05] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Danny Horn - https://phabricator.wikimedia.org/T239881 (10Aklapper) @DannyH: Hi, see https://phabricator.wikimedia.org/tag/ldap-access-requests/ : Do you already have shell access? Where do you expect to be able to login (icinga, graphite,... [01:10:24] as things are stable, i am going to go eat some tacos. good luck with phabricator, mutante & twentyafterfour. [01:12:03] !log phab1001 - enabling Write Cache in BIOS [01:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:07] (03PS1) 10Cwhite: hiera: update ores statsd exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/554656 (https://phabricator.wikimedia.org/T233448) [01:19:11] !log phab1001 back, still in legacy ide mode [01:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:31] aka wtf lol bbq [01:19:42] (03CR) 10jerkins-bot: [V: 04-1] hiera: update ores statsd exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/554656 (https://phabricator.wikimedia.org/T233448) (owner: 10Cwhite) [01:20:52] (03PS2) 10Cwhite: hiera: update ores statsd exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/554656 (https://phabricator.wikimedia.org/T233448) [01:21:40] !log phab1001 - rebooting to BIOS once more - "The settings were saved successfully." [01:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:22:42] Initializing Serial ATA devices... [01:23:29] and..it's booting from PXE [01:23:50] that already gets it the Debian installer [01:24:01] but i'll redo it with the reimage script anyways [01:25:32] (now the bot usually says that i did that) [01:25:57] hmm [01:26:18] dzahn@cumin1001:~$ sudo -i wmf-auto-reimage-host -p T238956 phab1001.eqiad.wmnet [01:26:18] T238956: switch prod Phabricator from phab1003 to phab1001 - https://phabricator.wikimedia.org/T238956 [01:27:29] the installer is running.. watching it [01:27:49] oh look, wikibugs talked in the other channel, just not here [01:29:05] formatted the drives.. no problem [01:29:24] :) [01:37:46] (03PS1) 10Dzahn: Revert "Revert "dumps/phabricator: switch dumps host from phab1003 to phab1001"" [puppet] - 10https://gerrit.wikimedia.org/r/554657 [01:37:52] (03PS1) 10Dzahn: Revert "Revert "phabricator: switch mail destination to phab1001"" [puppet] - 10https://gerrit.wikimedia.org/r/554658 [01:37:58] (03PS1) 10Dzahn: Revert "Revert "varnish: switch phabricator backend to phab1001"" [puppet] - 10https://gerrit.wikimedia.org/r/554659 [01:38:05] (03PS1) 10Dzahn: Revert "phabricator: switch prod server to phab1003, enables dumps and ferm holes" [puppet] - 10https://gerrit.wikimedia.org/r/554660 [01:38:25] (03PS1) 10Dzahn: Revert "Revert "switch discovery record for phabricator to 1001 for ATS"" [dns] - 10https://gerrit.wikimedia.org/r/554661 [01:39:23] Installing the base system .. 77% [01:56:27] awww..man.. it went into reinstall loop [01:56:39] lovely [01:56:56] so the reimage script said "still waiting for reboot" [01:57:03] and then fails [01:57:15] and on console i see it is in PXE again and installer starting from scratch [01:58:11] wtf [01:58:15] when doing it manually you would have to give it the special command via DRAC to "boot into PXE but only ONCE" [01:58:57] maybe we should ask for a new server or just keep phab1003 ;) [01:59:11] sounds like this one has issues [01:59:40] phab1003 is a little low on disk space but it's a much nicer server [01:59:57] performance was really good after switching [02:01:04] That’s because of php-fpm twentyafterfour :) [02:01:31] paladox: I think the hardware might have helped too [02:01:52] Oh [02:02:01] Also php7 too :P [02:02:12] i just updated the wikitech pages for these a bit [02:04:42] twentyafterfour: yea.. we can ask to keep phab1003 but give back phab1001 instead of keeping both [02:05:20] but also checking the install ..hmm [02:09:01] told it via mgmt interface / racadm that first boot device is HDD (while install is going on) [02:09:24] paladox: yea, all of that together but i wonder how much is each part [02:16:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:18:11] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:27:02] !log phab1001 - fixed boot order in BIOS to boot only from HDD, back at login [02:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:25] !log phab1001 - signed new puppet cert - initial puppet run in progress [02:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:43] PROBLEM - Check whether microcode mitigations for CPU vulnerabilities are applied on phab1001 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {flush_l1d, ssbd, md_clear} https://wikitech.wikimedia.org/wiki/Microcode [03:07:04] !log phab1001 - now using AHCI mode after reinstall, performance much better. rsyncing /srv/repos from phab1003 again [03:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:03] ACKNOWLEDGEMENT - Check whether microcode mitigations for CPU vulnerabilities are applied on phab1001 is CRITICAL: CRITICAL - Server is missing the following CPU flags: {flush_l1d, ssbd, md_clear} daniel_zahn docs say a reboot should fix it https://wikitech.wikimedia.org/wiki/Microcode [03:29:29] !log twentyafterfour@deploy1001 Started deploy [phabricator/deployment@UNKNOWN]: deploy release/2019-08-22/1 to phab1001 [03:29:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:50] !log twentyafterfour@deploy1001 Finished deploy [phabricator/deployment@UNKNOWN]: deploy release/2019-08-22/1 to phab1001 (duration: 00m 22s) [03:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:30:47] !log twentyafterfour@deploy1001 Started deploy [phabricator/deployment@UNKNOWN]: deploy release/2019-08-22/1 to phab1001 [03:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:23] !log twentyafterfour@deploy1001 Finished deploy [phabricator/deployment@UNKNOWN]: deploy release/2019-08-22/1 to phab1001 (duration: 01m 36s) [03:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:04] !log leaving phabricator on phab1003 for tonight while phab1001 raid syncs, will pick it up tomorrow to decide where to go from here [03:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:31] 10Operations, 10ops-eqiad: (No Need By Date Provided) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Andrew) *bump* [04:42:39] (03PS10) 10Andrew Bogott: Neutron l3_agent: remove external_network_bridge config option [puppet] - 10https://gerrit.wikimedia.org/r/553883 [04:43:42] (03CR) 10Andrew Bogott: [C: 03+2] Neutron l3_agent: remove external_network_bridge config option [puppet] - 10https://gerrit.wikimedia.org/r/553883 (owner: 10Andrew Bogott) [05:06:58] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:07:08] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:10:32] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:15:52] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:23:02] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:23:12] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:28:30] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:30:08] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:57:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3317 for upgrade', diff saved to https://phabricator.wikimedia.org/P9811 and previous config saved to /var/cache/conftool/dbconfig/20191205-055756-marostegui.json [05:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:30] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [06:03:30] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:04:24] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:11:20] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:11:28] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:13:38] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on dbstore1003 - https://phabricator.wikimedia.org/T239217 (10Marostegui) It looks like disk #4: ` root@dbstore1003:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level... [06:14:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3317, db1101:3318', diff saved to https://phabricator.wikimedia.org/P9812 and previous config saved to /var/cache/conftool/dbconfig/20191205-061453-marostegui.json [06:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:16] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:29:24] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3317, db1101:3318', diff saved to https://phabricator.wikimedia.org/P9813 and previous config saved to /var/cache/conftool/dbconfig/20191205-063103-marostegui.json [06:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:16] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:24] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:39:30] (03PS1) 10Andrew Bogott: nova cloudvirt pool: update cloudvirt1024 comment to reflect hardware fixes [puppet] - 10https://gerrit.wikimedia.org/r/554671 (https://phabricator.wikimedia.org/T230289) [06:41:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:41:01] (03CR) 10Andrew Bogott: [C: 03+2] nova cloudvirt pool: update cloudvirt1024 comment to reflect hardware fixes [puppet] - 10https://gerrit.wikimedia.org/r/554671 (https://phabricator.wikimedia.org/T230289) (owner: 10Andrew Bogott) [06:43:38] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:43:46] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:46:22] (03PS1) 10Marostegui: mariadb: Decommission db2065 [puppet] - 10https://gerrit.wikimedia.org/r/554672 (https://phabricator.wikimedia.org/T239046) [06:47:20] (03PS1) 10Marostegui: wmnet: Remove production DNS for db2065 [dns] - 10https://gerrit.wikimedia.org/r/554673 (https://phabricator.wikimedia.org/T239046) [06:48:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:48:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1101:3317, db1101:3318', diff saved to https://phabricator.wikimedia.org/P9814 and previous config saved to /var/cache/conftool/dbconfig/20191205-064845-marostegui.json [06:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [06:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [06:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:43] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2065 [puppet] - 10https://gerrit.wikimedia.org/r/554672 (https://phabricator.wikimedia.org/T239046) (owner: 10Marostegui) [06:52:16] (03CR) 10Marostegui: [C: 03+2] wmnet: Remove production DNS for db2065 [dns] - 10https://gerrit.wikimedia.org/r/554673 (https://phabricator.wikimedia.org/T239046) (owner: 10Marostegui) [06:54:06] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2065.codfw.wmnet - https://phabricator.wikimedia.org/T239046 (10Marostegui) a:05Marostegui→03Papaul [06:54:15] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2065.codfw.wmnet - https://phabricator.wikimedia.org/T239046 (10Marostegui) Host ready for @Papaul [06:55:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1101:3317, db1101:3318', diff saved to https://phabricator.wikimedia.org/P9815 and previous config saved to /var/cache/conftool/dbconfig/20191205-065536-marostegui.json [06:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:08] (03CR) 10Marostegui: [C: 03+2] Revert "mediawiki: Stop rebuildItermTerms temporary" [puppet] - 10https://gerrit.wikimedia.org/r/554583 (owner: 10Ladsgroup) [07:06:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1099:3311, db1099:3318 for upgrade', diff saved to https://phabricator.wikimedia.org/P9816 and previous config saved to /var/cache/conftool/dbconfig/20191205-070631-marostegui.json [07:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:40] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [07:14:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1099:3311, db1099:3318', diff saved to https://phabricator.wikimedia.org/P9817 and previous config saved to /var/cache/conftool/dbconfig/20191205-071445-marostegui.json [07:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:08] <_joe_> !log umounting /proc,/sys,/dev from /var/cache/pbuilder/build/cow.6815 on boron to allow reaping it away [07:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:09] RECOVERY - Check systemd state on boron is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:23:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1099:3311, db1099:3318', diff saved to https://phabricator.wikimedia.org/P9818 and previous config saved to /var/cache/conftool/dbconfig/20191205-072314-marostegui.json [07:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:00] <_joe_> !log manually running package_builder_Clean_up_build_directory.service on boron [07:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:13] ACKNOWLEDGEMENT - Check the last execution of package_builder_Clean_up_build_directory on boron is CRITICAL: CRITICAL: Status of the systemd unit package_builder_Clean_up_build_directory Giuseppe Lavagetto process ran, will recover in 18 hours. https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:27:19] RECOVERY - Check the last execution of package_builder_Clean_up_build_directory on boron is OK: OK: Status of the systemd unit package_builder_Clean_up_build_directory https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:28:39] RECOVERY - DPKG on kubestagetcd1001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:28:54] _joe_ nice fix, I tried yesterday but failed, there is a task about it! [07:29:10] <_joe_> elukey: I just unmounted the bind mounts [07:29:26] <_joe_> btw if we succeeded in removing the /dev directory, boron would need a hard reboot [07:29:28] <_joe_> :P [07:29:35] <_joe_> elukey: what's the task? [07:29:52] <_joe_> !log ran apt-get install manually on kubestagetcd1001 to fix broken packages [07:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:16] _joe_ yes unmounting /dev/ would be interesting :D [07:30:18] https://phabricator.wikimedia.org/T237713 [07:32:07] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:32:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1099:3311, db1099:3318', diff saved to https://phabricator.wikimedia.org/P9819 and previous config saved to /var/cache/conftool/dbconfig/20191205-073209-marostegui.json [07:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:13] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:35] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:55] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 54, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:39:33] ok so cr2-eqiad <-> cr2-eqord is a Telia link under maintenance [07:40:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wmflib: add inject_secret [puppet] - 10https://gerrit.wikimedia.org/r/553473 (owner: 10Giuseppe Lavagetto) [07:41:06] <_joe_> there are two maintenances ongoing [07:41:25] Telia also have another maintenance ongoing netween cr1-eqiad and cr1-codfw [07:41:26] <_joe_> the other one is cr1-eqiad - cr1-codfw I guessed [07:41:41] yes but it seems not a scheduled maintenance [07:42:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1099:3311, db1099:3318', diff saved to https://phabricator.wikimedia.org/P9820 and previous config saved to /var/cache/conftool/dbconfig/20191205-074200-marostegui.json [07:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:36] out of 4 transport links between eqiad and codfw we have only 2 now [07:42:57] the path through eqdfw (2 links in realilty) [07:43:07] and the Zayo transport [07:43:58] IIRC it happened not long time ago as well, netops told me that it is an acceptable situation if one of the link down is under maintenance and restored in a reasonable amount of time [07:44:47] Interestingly, the physical link between cr1 eqiad/codfw seems up from the routers [07:44:55] but the ospf session is down [07:46:20] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [07:51:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/19802/ as expected it's a noop right now" [puppet] - 10https://gerrit.wikimedia.org/r/553474 (owner: 10Giuseppe Lavagetto) [08:00:11] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 56, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:01:35] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:03:52] !log remove logstash_cleanup_indices_apifeatureusage-search.svc.codfw.wmnet and logstash_cleanup_indices_apifeatureusage-search.svc.eqiad.wmnet from logstash1025,logstash1024,logstash1023,logstash2024,logstash2025 to reduce cronspam - T234854 [08:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:58] T234854: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 [08:04:19] PROBLEM - BFD status on cr2-eqord is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:04:49] (03PS1) 10Marostegui: instances.yaml: Remove db1062 from config [puppet] - 10https://gerrit.wikimedia.org/r/554678 (https://phabricator.wikimedia.org/T239188) [08:05:09] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:06:02] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1062 from config [puppet] - 10https://gerrit.wikimedia.org/r/554678 (https://phabricator.wikimedia.org/T239188) (owner: 10Marostegui) [08:08:12] 10Operations, 10Wikimedia-Logstash: Upgrade ELK Stack - https://phabricator.wikimedia.org/T234854 (10elukey) Hello :) From cronspam I can see two errors that happen daily: ` Cron[logstash_cleanup_indices_logstash] elasticsearch.exceptions.ElasticsearchException: Unable to create client connection to Elastic... [08:08:52] (03PS2) 10Volans: zone_validator: better detection of mgmt ORIGINs [dns] - 10https://gerrit.wikimedia.org/r/554078 [08:09:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1062 from etcd T239188', diff saved to https://phabricator.wikimedia.org/P9821 and previous config saved to /var/cache/conftool/dbconfig/20191205-080909-marostegui.json [08:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:16] T239188: Decommission db1062.eqiad.wmnet - https://phabricator.wikimedia.org/T239188 [08:09:57] !log Upgrade pc2007, pc2008, pc2009, pc2010 [08:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:21] marostegui: do you miss deploying 100 times a day? [08:10:26] :D [08:10:33] elukey: I still do it, but without having to commit :) [08:12:35] (03PS2) 10Muehlenhoff: Setup apt pinning for puppet 5 / facter 3 on stretch/jessie [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) [08:13:23] BFD status on cr2-eqord/eqiad seems to be flapping due to [08:13:24] Local diag: AdminDown Remote diag: None Reason: Received Upstream Destroy Session. [08:16:42] so I guess that the maintenance is not over yet, even if the interfaces are up [08:18:56] TIL show bfd session on juniper [08:25:55] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [08:27:15] (03CR) 10Volans: [C: 03+2] "Merging as this is needed in order to merge the other CRs with frack mgmt records." [dns] - 10https://gerrit.wikimedia.org/r/554078 (owner: 10Volans) [08:27:33] (03PS2) 10Volans: frack: add missing asset tag management records [dns] - 10https://gerrit.wikimedia.org/r/554079 (https://phabricator.wikimedia.org/T239597) [08:28:13] (03Abandoned) 10Andrew Bogott: wikitech: Don't load OpenStackManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/432702 (https://phabricator.wikimedia.org/T161553) (owner: 10Andrew Bogott) [08:30:03] (03Abandoned) 10Andrew Bogott: wikitech: remove OpenStackManager private settings [puppet] - 10https://gerrit.wikimedia.org/r/432703 (https://phabricator.wikimedia.org/T161553) (owner: 10Andrew Bogott) [08:30:37] (03CR) 10Ema: [C: 03+2] ATS: allow to toggle request coalescing [puppet] - 10https://gerrit.wikimedia.org/r/554558 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [08:30:42] (03Abandoned) 10Andrew Bogott: labtestwikitech: use eqiad db host (db1073) even from codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/429841 (https://phabricator.wikimedia.org/T192339) (owner: 10Andrew Bogott) [08:31:25] (03Abandoned) 10Andrew Bogott: Allocate LVS service IPs for labs dns recursor [dns] - 10https://gerrit.wikimedia.org/r/284824 (https://phabricator.wikimedia.org/T119660) (owner: 10Andrew Bogott) [08:31:34] (03Abandoned) 10Andrew Bogott: Add an lvs service ip (labs-ns.wikimedia.org) for the labs dns recursors [puppet] - 10https://gerrit.wikimedia.org/r/284829 (https://phabricator.wikimedia.org/T119660) (owner: 10Andrew Bogott) [08:34:19] (03PS2) 10Muehlenhoff: Fix apt pinning for VP9-enabled ffmpeg build [puppet] - 10https://gerrit.wikimedia.org/r/554550 (https://phabricator.wikimedia.org/T239831) [08:35:36] (03CR) 10Andrew Bogott: "I'm removing myself from a ton of these patches because they appear abandoned... feel free to re-add me if you want to bring them back to " [puppet] - 10https://gerrit.wikimedia.org/r/268279 (owner: 10Tim Landscheidt) [08:37:26] 10Operations, 10ops-eqiad, 10User-fgiunchedi: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10fgiunchedi) a:05fgiunchedi→03Cmjohnson Hosts are fully in service now! >>! In T237438#5692635, @fgiunchedi wrote: > @Cmjohnson @Jclark-ctr I'm not b... [08:37:52] (03CR) 10Andrew Bogott: "Is this still useful/meaningful?" [puppet] - 10https://gerrit.wikimedia.org/r/406484 (https://phabricator.wikimedia.org/T180377) (owner: 10Paladox) [08:38:38] !log Upgrade db2078 [08:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:01] 10Operations, 10ops-eqiad, 10User-fgiunchedi: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10fgiunchedi) [08:39:57] (03CR) 10Andrew Bogott: [C: 03+1] "If/when this is merged we need to update https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas to reflect the chang" [dns] - 10https://gerrit.wikimedia.org/r/534573 (https://phabricator.wikimedia.org/T231520) (owner: 10Marostegui) [08:42:02] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) a:05jbond→03Marostegui [08:45:33] (03CR) 10Muehlenhoff: [C: 03+2] Fix apt pinning for VP9-enabled ffmpeg build [puppet] - 10https://gerrit.wikimedia.org/r/554550 (https://phabricator.wikimedia.org/T239831) (owner: 10Muehlenhoff) [08:59:21] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/554549 (https://phabricator.wikimedia.org/T239832) (owner: 10Muehlenhoff) [09:03:36] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) >>! In T212697#5175571, @elukey wrote: > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=928927 > > Once this gets resolved and a new version of uwsgi is... [09:07:21] !log Upgrade db2094 and db2095 [09:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:46] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10MoritzMuehlenhoff) I'll reach out to the maintainer for a Buster backport. [09:10:00] 10Operations, 10Performance-Team (Radar): unwind the Puppetized /etc/hosts override of statsd.eqiad.wmnet - https://phabricator.wikimedia.org/T239862 (10akosiaris) p:05Triage→03Low [09:10:47] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:18:06] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb deamons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [09:41:55] (03PS1) 10Ema: ATS: disable text@esams origin server request coalescing [puppet] - 10https://gerrit.wikimedia.org/r/554808 (https://phabricator.wikimedia.org/T238494) [09:42:47] (03PS4) 10Effie Mouzeli: php::admin include APCu fragmentation percentage metrics [puppet] - 10https://gerrit.wikimedia.org/r/554507 [09:43:57] (03PS5) 10Effie Mouzeli: php::admin include APCu fragmentation percentage metrics [puppet] - 10https://gerrit.wikimedia.org/r/554507 [09:44:09] (03PS1) 10Muehlenhoff: Extend pinning for ffmpeg with libav libraries [puppet] - 10https://gerrit.wikimedia.org/r/554809 [09:54:17] !log text@esams: disable ats-be origin server request coalescing T238494 [09:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:23] T238494: 15% response start regression starting around 2019-11-11 - https://phabricator.wikimedia.org/T238494 [10:06:24] (03CR) 10Muehlenhoff: [C: 03+2] Add CAS authentication to debmonitor [software/debmonitor] - 10https://gerrit.wikimedia.org/r/554273 (owner: 10Muehlenhoff) [10:09:32] (03PS1) 10Muehlenhoff: Also extend package declaration for ffmpeg with libav packages [puppet] - 10https://gerrit.wikimedia.org/r/554811 [10:09:44] (03CR) 10DCausse: [C: 03+1] search: decommission elastic10[18-31].eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554517 (https://phabricator.wikimedia.org/T239821) (owner: 10Gehel) [10:13:44] !log oblivian@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [10:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:02] (03PS2) 10Muehlenhoff: Also extend package declaration for ffmpeg with libav packages [puppet] - 10https://gerrit.wikimedia.org/r/554811 [10:16:39] (03CR) 10Muehlenhoff: [C: 03+2] Also extend package declaration for ffmpeg with libav packages [puppet] - 10https://gerrit.wikimedia.org/r/554811 (owner: 10Muehlenhoff) [10:16:58] (03PS1) 10TechneSiyam: I am TechneSiyam.Doing this for GCI Task. MY Full Name: MD Abu Siyam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554812 [10:20:32] (03PS3) 10Muehlenhoff: Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) [10:21:26] 10Operations, 10SRE-swift-storage, 10serviceops, 10Patch-For-Review, and 2 others: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10fgiunchedi) I've investigated a bit the scope and impact of this issue, namely by joining the transactions ID... [10:25:36] (03PS6) 10Effie Mouzeli: php::admin include APCu fragmentation percentage metrics [puppet] - 10https://gerrit.wikimedia.org/r/554507 [10:26:23] 10Operations, 10netops: BGP peering sessions with corp partially down in ulsfo - https://phabricator.wikimedia.org/T239893 (10elukey) [10:26:36] !log reimage mw2260.codfw.wmnet [10:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:39] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw2260.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201912051027_jiji_93252.log`. [10:27:55] (03CR) 10Hashar: "recheck debian-glue host migrated to Buster (update pristine-tar and fix autopkgtest) T224943" [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/553780 (https://phabricator.wikimedia.org/T239482) (owner: 10Hashar) [10:29:19] (03CR) 10Majavah: [C: 04-1] "Could you make the commit message follow the guidelines described in https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554812 (owner: 10TechneSiyam) [10:31:19] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:34:36] (03CR) 10Giuseppe Lavagetto: [C: 03+1] php::admin include APCu fragmentation percentage metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554507 (owner: 10Effie Mouzeli) [10:35:30] !log oblivian@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [10:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:55] (03PS7) 10Effie Mouzeli: php::admin include APCu fragmentation percentage metrics [puppet] - 10https://gerrit.wikimedia.org/r/554507 [10:35:59] 10Operations, 10Traffic, 10observability: 'LVS connections' graph on Load Balancers dashboard takes a rate of a gauge - https://phabricator.wikimedia.org/T236700 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Fixed now and 'load balancers' dashboard adjusted [10:36:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:37:41] 10Operations, 10Traffic: Unclear LVS bandwidth graph in "load balancers" dashboard - https://phabricator.wikimedia.org/T174432 (10fgiunchedi) >>! In T174432#3565169, @ema wrote: >>>! In T174432#3562830, @BBlack wrote: >> Are the non-icmp graphs somehow LVS-specific? > > Yes, the metrics are: node_ipvs_backend... [10:38:58] !log oblivian@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [10:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:25] Looking at logs it seems plwikibooks started to have a really bad lua [10:39:39] https://logstash.wikimedia.org/goto/4b5e753eb63d514ad4eb43168a2b50b9 [10:41:02] I'm going to deploy something that makes some noise in logstash sorry [10:41:18] it's to debug an important bug [10:41:59] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:44:40] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [10:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:12] (03CR) 10Effie Mouzeli: "OK https://puppet-compiler.wmflabs.org/compiler1002/19804/" [puppet] - 10https://gerrit.wikimedia.org/r/554507 (owner: 10Effie Mouzeli) [10:46:16] (03CR) 10Effie Mouzeli: [C: 03+2] php::admin include APCu fragmentation percentage metrics [puppet] - 10https://gerrit.wikimedia.org/r/554507 (owner: 10Effie Mouzeli) [10:47:04] 10Operations, 10netops: Facebook BGP peering links down in ulsfo - https://phabricator.wikimedia.org/T239896 (10elukey) [10:48:44] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:51] The deployment is postponed to the next two hours [11:02:38] 10Operations: wmf-auto-reimage errors: failure to downtime (w/ no rename), pytho gc whine - https://phabricator.wikimedia.org/T239897 (10ArielGlenn) p:05Triage→03Normal [11:05:55] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2260.codfw.wmnet'] ` and were **ALL** successful. [11:28:57] (03PS1) 10Muehlenhoff: Add exec for apt-get update for ffmpeg component/new job runner installs [puppet] - 10https://gerrit.wikimedia.org/r/554822 [11:30:21] (03CR) 10Jbond: [C: 03+2] backup::host: use fqdn_rand_string for password generation [puppet] - 10https://gerrit.wikimedia.org/r/547569 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [11:40:08] (03CR) 10Effie Mouzeli: "Please ping me when to merge this :)" [puppet] - 10https://gerrit.wikimedia.org/r/554599 (owner: 10Krinkle) [11:41:18] (03PS8) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [11:41:20] (03PS1) 10Jbond: CI - black: update python3 files with black [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) [11:41:22] (03PS1) 10Jbond: CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T221083) [11:43:51] (03CR) 10Jbond: "First pass at the black reformating. sorry :(" [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [11:44:30] (03CR) 10Jbond: "First pass at the black reformatting. sorry :(" [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [11:44:44] (03CR) 10jerkins-bot: [V: 04-1] CI - black: update python3 files with black [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [11:44:48] 10Operations, 10DBA, 10Wikimedia-Incident: Disallow 'weight: 0' for MW db config in dbctl - https://phabricator.wikimedia.org/T239901 (10Krinkle) [11:45:13] 10Operations: wmf-auto-reimage errors: failure to downtime (w/ no rename), pytho gc whine - https://phabricator.wikimedia.org/T239897 (10Volans) For the first one the downtime cookbook failed to run puppet on the Icinga active host to get the definitions of the reimaged hosts to downtime. Given how much puppet... [11:45:14] (03CR) 10jerkins-bot: [V: 04-1] CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [11:45:22] (03CR) 10Krinkle: [C: 04-1] "I'm blocked on Beta being fixed so that we can test it there first since it's a fairly major change." [puppet] - 10https://gerrit.wikimedia.org/r/554599 (owner: 10Krinkle) [11:45:26] (03CR) 10jerkins-bot: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [11:47:53] 10Operations, 10DBA, 10Wikimedia-Incident: Disallow 'weight: 0' for MW db config in dbctl - https://phabricator.wikimedia.org/T239901 (10Krinkle) [11:48:50] well thats embarissing seems flake8 dosn't like black [11:49:20] (03PS2) 10Muehlenhoff: Add exec for apt-get update for ffmpeg component/new job runner installs [puppet] - 10https://gerrit.wikimedia.org/r/554822 [11:49:27] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:50:02] jbond42: it's compatible AFAIK with a couple of caveats, see their readme [11:50:53] volans: reading now [11:52:59] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191205T1200). [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:00:21] (03PS2) 10Jbond: CI - black: update python3 files with black [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) [12:00:23] (03PS2) 10Jbond: CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T221083) [12:00:25] (03PS9) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [12:00:27] (03PS1) 10Jbond: ci - flake8: add ignore E203 in preperation for black transformation [puppet] - 10https://gerrit.wikimedia.org/r/554827 (https://phabricator.wikimedia.org/T221083) [12:00:29] (03CR) 10Urbanecm: [C: 04-1] "Please name the files PROJECTCODE-1.5x.png and PROJECTCODE-2x.png, as instructed in https://codein.withgoogle.com/tasks/5189998823866368/." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554812 (owner: 10TechneSiyam) [12:00:39] (03CR) 10Urbanecm: [C: 04-1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554812 (owner: 10TechneSiyam) [12:03:31] (03CR) 10jerkins-bot: [V: 04-1] CI - black: update python3 files with black [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [12:04:12] (03CR) 10jerkins-bot: [V: 04-1] CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [12:06:42] (03CR) 10jerkins-bot: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [12:07:03] (03CR) 10Effie Mouzeli: [C: 03+1] "Ok let's try it!" [puppet] - 10https://gerrit.wikimedia.org/r/554822 (owner: 10Muehlenhoff) [12:07:50] (03PS3) 10Muehlenhoff: Switch ldap-corp.codfw.wikimedia.org to ldap-corp2001 [dns] - 10https://gerrit.wikimedia.org/r/553323 (https://phabricator.wikimedia.org/T224557) [12:09:44] (03PS1) 10Andrew Bogott: Openstack: move eqiad1 to version 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/554829 (https://phabricator.wikimedia.org/T237749) [12:11:59] (03PS1) 10Urbanecm: Fix namespace name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554830 (https://phabricator.wikimedia.org/T239547) [12:13:18] (03CR) 10Urbanecm: [C: 03+2] Fix namespace name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554830 (https://phabricator.wikimedia.org/T239547) (owner: 10Urbanecm) [12:14:12] (03Merged) 10jenkins-bot: Fix namespace name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554830 (https://phabricator.wikimedia.org/T239547) (owner: 10Urbanecm) [12:16:34] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: 6c9d168: Fix namespace name - napwikisource (T239547) (duration: 01m 02s) [12:16:35] (03CR) 10Effie Mouzeli: [C: 03+2] Add exec for apt-get update for ffmpeg component/new job runner installs [puppet] - 10https://gerrit.wikimedia.org/r/554822 (owner: 10Muehlenhoff) [12:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:41] T239547: New Translation namespace for nap.wikisource - https://phabricator.wikimedia.org/T239547 [12:16:54] (03PS3) 10Effie Mouzeli: Add exec for apt-get update for ffmpeg component/new job runner installs [puppet] - 10https://gerrit.wikimedia.org/r/554822 (owner: 10Muehlenhoff) [12:17:23] (03CR) 10Muehlenhoff: [C: 03+2] Switch ldap-corp.codfw.wikimedia.org to ldap-corp2001 [dns] - 10https://gerrit.wikimedia.org/r/553323 (https://phabricator.wikimedia.org/T224557) (owner: 10Muehlenhoff) [12:19:50] 10Operations, 10serviceops: VP9-enabled ffmpeg doesn't get installed after reimage of mw job runner/video scaler - https://phabricator.wikimedia.org/T239831 (10jijiki) @brion It looks like all our videoscalers were lacking VP9 codec support. What do you think we should do ? [12:21:13] (03PS1) 10Giuseppe Lavagetto: blubberoid: break tls fucntionality into an helper [deployment-charts] - 10https://gerrit.wikimedia.org/r/554832 [12:21:16] (03PS1) 10Giuseppe Lavagetto: scaffold: import the blubberoid tls helpers in scaffold [deployment-charts] - 10https://gerrit.wikimedia.org/r/554833 [12:21:17] (03PS1) 10Giuseppe Lavagetto: eventgate: convert to use the common tls templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/554834 [12:21:29] (03CR) 10jerkins-bot: [V: 04-1] eventgate: convert to use the common tls templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/554834 (owner: 10Giuseppe Lavagetto) [12:21:56] !log Reimage mw2261.codfw.wmnet [12:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:18] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw2261.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201912051222_jiji_116344.log`. [12:24:11] (03CR) 10Andrew Bogott: "An upgrade window for this is scheduled for December 12th." [puppet] - 10https://gerrit.wikimedia.org/r/554829 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [12:26:48] (03PS2) 10Jbond: ci - flake8: update flake8 rules to be compatible with black [puppet] - 10https://gerrit.wikimedia.org/r/554827 (https://phabricator.wikimedia.org/T221083) [12:26:53] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:27:18] (03PS1) 10Krinkle: profiler: Remove arclamp.php (was for HHVM) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554836 (https://phabricator.wikimedia.org/T235142) [12:29:10] (03PS3) 10Jbond: CI - black: update python3 files with black [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) [12:29:20] (03PS3) 10Jbond: CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T221083) [12:29:30] (03PS10) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [12:31:23] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:32:12] (03CR) 10jerkins-bot: [V: 04-1] CI - black: update python3 files with black [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [12:32:52] (03CR) 10jerkins-bot: [V: 04-1] CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T221083) (owner: 10Jbond) [12:33:51] (03CR) 10jerkins-bot: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [12:39:19] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [12:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:08] (03PS1) 10Jcrespo: prometheus-bacula-exporter: Ignore jobs without a start time [puppet] - 10https://gerrit.wikimedia.org/r/554840 (https://phabricator.wikimedia.org/T234900) [12:41:23] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:39] (03CR) 10Jcrespo: [C: 03+2] prometheus-bacula-exporter: Ignore jobs without a start time [puppet] - 10https://gerrit.wikimedia.org/r/554840 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [12:48:25] (03PS3) 10Krinkle: webperf: switch xhgui_host from tungsten to xhgui1001 [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) (owner: 10Dzahn) [12:49:24] (03PS2) 10Andrew Bogott: Openstack: move eqiad1 to version 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/554829 (https://phabricator.wikimedia.org/T237749) [12:49:26] (03PS1) 10Andrew Bogott: Openstack codfw1dev: everything is ocata now [puppet] - 10https://gerrit.wikimedia.org/r/554842 (https://phabricator.wikimedia.org/T237749) [12:50:03] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:53:19] jynus: ^ \o/ \o/ [12:54:19] o wow [12:54:25] I didn't know that caused an alert [12:54:38] I thought it was ok to have a job in testing [12:54:56] if I knew it caused issues I would have delayed productionization [12:57:47] (03CR) 10Andrew Bogott: "puppet compiler agrees that this is the no-op it should be: https://puppet-compiler.wmflabs.org/compiler1001/19806/" [puppet] - 10https://gerrit.wikimedia.org/r/554842 (https://phabricator.wikimedia.org/T237749) (owner: 10Andrew Bogott) [12:58:40] (03PS1) 10Filippo Giunchedi: prometheus: drop Logstash 7 unnamed plugins [puppet] - 10https://gerrit.wikimedia.org/r/554843 [13:00:24] jynus: no issues, the alert was recently introduced is there exactly to catch cases like this [13:01:14] <_joe_> !log restarted mtail on mw1239 [13:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:30] <_joe_> godog: mtail seems stuck on a few appservers [13:01:51] <_joe_> mw1240 is still up with mtail in this "stuck" state if you want to take a look [13:02:04] <_joe_> and it can't stop cleanly it seems [13:02:46] _joe_: sure I'll take a quick look [13:03:04] but yeah mw1239 and mw2261 are down too, I'm looking at https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:03:27] <_joe_> 1239 should be up soon [13:03:36] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2261.codfw.wmnet'] ` and were **ALL** successful. [13:03:37] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:05:07] (03PS1) 10Phamhi: wmcs: make cloudmetrics1002 the primary instead of labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/554844 (https://phabricator.wikimedia.org/T224585) [13:06:08] (03PS1) 10TechneSiyam: This is for GCI Task of TechneSiyam. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554845 [13:07:25] (03PS3) 10Jbond: ci - flake8: update flake8 rules to be compatible with black [puppet] - 10https://gerrit.wikimedia.org/r/554827 (https://phabricator.wikimedia.org/T221083) [13:07:46] (03PS4) 10Jbond: CI - black: update python3 files with black [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) [13:07:55] (03PS4) 10Jbond: CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T221083) [13:08:07] (03PS11) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [13:09:14] !log bounce mtail on mw1240 [13:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:20] (03CR) 10Arturo Borrero Gonzalez: wmcs: make cloudmetrics1002 the primary instead of labmon1001 (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/554844 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:13:30] (03CR) 10jerkins-bot: [V: 04-1] CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [13:17:47] (03PS1) 10KartikMistry: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/554849 (https://phabricator.wikimedia.org/T236080) [13:21:58] (03CR) 10jerkins-bot: [V: 04-1] apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/554849 (https://phabricator.wikimedia.org/T236080) (owner: 10KartikMistry) [13:22:25] (03PS1) 10TechneSiyam: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554851 [13:22:46] (03PS1) 10Muehlenhoff: Switch ldap-corp.eqiad.wikimedia.org to ldap-corp1001 [dns] - 10https://gerrit.wikimedia.org/r/554852 (https://phabricator.wikimedia.org/T224557) [13:23:31] 10Operations, 10Maps: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 (10Gehel) A few more comments inline. >>! In T239728#5711882, @Mathew.onipe wrote: > > Steps/Processes: > At each DC (starting from eqiad) > [] Downtime t... [13:25:55] (03PS1) 10Arturo Borrero Gonzalez: wmcs: use hiera for labmon/cloudmetrics instead of harcoded values [puppet] - 10https://gerrit.wikimedia.org/r/554853 (https://phabricator.wikimedia.org/T224585) [13:28:34] (03CR) 10Filippo Giunchedi: [C: 03+2] logstash: bump max fields to 4096 [puppet] - 10https://gerrit.wikimedia.org/r/554303 (https://phabricator.wikimedia.org/T180051) (owner: 10Filippo Giunchedi) [13:35:36] 10Operations, 10Cassandra, 10RESTBase-Cassandra, 10Core Platform Team Legacy (Later), and 3 others: Configure a threshold for earlier notification of /srv/cassandra/instance-data - https://phabricator.wikimedia.org/T191659 (10fgiunchedi) [13:48:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:51:13] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/554849 (https://phabricator.wikimedia.org/T236080) (owner: 10KartikMistry) [13:51:19] (03PS5) 10Jbond: CI - black: update python3 files with black [puppet] - 10https://gerrit.wikimedia.org/r/554825 (https://phabricator.wikimedia.org/T221083) [13:52:15] (03PS2) 10Phamhi: wmcs: make cloudmetrics1002 the primary instead of labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/554844 (https://phabricator.wikimedia.org/T224585) [13:54:14] (03PS5) 10Jbond: CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T221083) [13:55:28] (03Abandoned) 10Urbanecm: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554851 (owner: 10TechneSiyam) [13:57:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch ldap-corp.eqiad.wikimedia.org to ldap-corp1001 [dns] - 10https://gerrit.wikimedia.org/r/554852 (https://phabricator.wikimedia.org/T224557) (owner: 10Muehlenhoff) [13:58:50] (03PS2) 10Arturo Borrero Gonzalez: wmcs: use hiera for labmon/cloudmetrics instead of harcoded values [puppet] - 10https://gerrit.wikimedia.org/r/554853 (https://phabricator.wikimedia.org/T224585) [14:01:11] (03PS3) 10Arturo Borrero Gonzalez: wmcs: use hiera for labmon/cloudmetrics instead of harcoded values [puppet] - 10https://gerrit.wikimedia.org/r/554853 (https://phabricator.wikimedia.org/T224585) [14:04:13] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:04:32] (03PS1) 10Mathew.onipe: maps: disable replication cron [puppet] - 10https://gerrit.wikimedia.org/r/554860 (https://phabricator.wikimedia.org/T239728) [14:12:06] (03CR) 10Mathew.onipe: "PCC output is Ok: https://puppet-compiler.wmflabs.org/compiler1001/19807/maps1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/554860 (https://phabricator.wikimedia.org/T239728) (owner: 10Mathew.onipe) [14:14:27] (03PS4) 10Arturo Borrero Gonzalez: wmcs: use hiera for labmon/cloudmetrics instead of harcoded values [puppet] - 10https://gerrit.wikimedia.org/r/554853 (https://phabricator.wikimedia.org/T224585) [14:14:42] (03PS6) 10Filippo Giunchedi: install_server: standard recipe and raid1/raid10 [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955) [14:14:44] (03PS6) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) [14:14:50] (03PS3) 10Filippo Giunchedi: WIP role: extend centrallog's /srv if needed [puppet] - 10https://gerrit.wikimedia.org/r/554044 (https://phabricator.wikimedia.org/T156955) [14:20:08] !reimage mw2260 [14:20:33] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw2260.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201912051420_jiji_139305.log`. [14:20:43] wow yeah we should totally have an IRC bot for that [14:20:49] imagine how much time we would save sshing around [14:21:43] a bot for which, sorry? [14:21:55] reimaging hosts! [14:22:03] heh [14:22:13] hashtag chatops [14:22:48] :) [14:25:49] I assume that was supposed to be a !log, so repeating for SAL [14:25:52] !log 14:20:08 reimage mw2260 [14:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:08] thank you Lucas :D [14:26:14] np :) [14:27:44] (03PS1) 10Ema: ATS: add varnish-fe <-> ats-be TTFB histogram [puppet] - 10https://gerrit.wikimedia.org/r/554863 (https://phabricator.wikimedia.org/T238494) [14:28:25] PROBLEM - SSH on backup1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:28:55] (03Abandoned) 10Ema: ATS: add trafficserver_backend_client_requests_total to mtail [puppet] - 10https://gerrit.wikimedia.org/r/554570 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [14:29:31] (03CR) 10jerkins-bot: [V: 04-1] ATS: add varnish-fe <-> ats-be TTFB histogram [puppet] - 10https://gerrit.wikimedia.org/r/554863 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [14:29:44] (03PS1) 10Lucas Werkmeister (WMDE): noc: Sort clusters naturally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554865 [14:29:58] jenkins breaking my heart as always [14:32:19] (03CR) 10Gehel: [C: 03+2] maps: disable replication cron [puppet] - 10https://gerrit.wikimedia.org/r/554860 (https://phabricator.wikimedia.org/T239728) (owner: 10Mathew.onipe) [14:32:36] (03PS6) 10Jbond: CI - black: run black over python2 files [puppet] - 10https://gerrit.wikimedia.org/r/554826 (https://phabricator.wikimedia.org/T221083) [14:33:00] (03PS2) 10Ema: ATS: add varnish-fe <-> ats-be TTFB histogram [puppet] - 10https://gerrit.wikimedia.org/r/554863 (https://phabricator.wikimedia.org/T238494) [14:33:39] (03PS1) 10BBlack: anycast healthchecker should be able to bind as well [puppet] - 10https://gerrit.wikimedia.org/r/554866 [14:33:41] (03PS1) 10BBlack: anycast: bind to real service [puppet] - 10https://gerrit.wikimedia.org/r/554867 [14:33:43] (03PS1) 10BBlack: bird: 10s delay after route withdraw [puppet] - 10https://gerrit.wikimedia.org/r/554868 [14:33:45] (03PS1) 10BBlack: dnsrecursor: param threads and lce for prod [puppet] - 10https://gerrit.wikimedia.org/r/554869 [14:34:02] (03PS12) 10Jbond: CI - taskgen: add black tests for python2 and python3 files [puppet] - 10https://gerrit.wikimedia.org/r/553487 (https://phabricator.wikimedia.org/T239334) [14:35:33] (03PS3) 10Ema: ATS: add varnish-fe <-> ats-be TTFB histogram [puppet] - 10https://gerrit.wikimedia.org/r/554863 (https://phabricator.wikimedia.org/T238494) [14:37:19] (03PS1) 10CDanis: noc/db: clarify loads output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554870 [14:37:33] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [14:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:52] (03CR) 10Filippo Giunchedi: ATS: add varnish-fe <-> ats-be TTFB histogram (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554863 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [14:39:40] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:32] jouncebot: next [14:45:32] In 2 hour(s) and 14 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191205T1700) [14:45:36] (03PS2) 10Giuseppe Lavagetto: eventgate: convert to use the common tls templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/554834 [14:45:38] (03CR) 10Ladsgroup: [C: 03+1] noc: Sort clusters naturally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554865 (owner: 10Lucas Werkmeister (WMDE)) [14:45:49] (03CR) 10jerkins-bot: [V: 04-1] eventgate: convert to use the common tls templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/554834 (owner: 10Giuseppe Lavagetto) [14:47:17] PROBLEM - Bird Internet Routing Daemon on dns4001 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:47:21] Amir1: Lucas_WMDE: oh nice, thanks for that patch. I was about to merge and push another change to db.php to make its output slightly nicer; I'll push both [14:47:33] ^ bird alert is ok, it's me experimenting with improvements [14:47:39] cdanis: thanks! [14:47:55] (03CR) 10CDanis: [C: 03+2] noc: Sort clusters naturally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554865 (owner: 10Lucas Werkmeister (WMDE)) [14:48:01] (03CR) 10CDanis: [C: 03+2] noc/db: clarify loads output [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554870 (owner: 10CDanis) [14:48:08] cdanis: cool, thanks! [14:48:25] RECOVERY - Bird Internet Routing Daemon on dns4001 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:48:44] and +1 for your loads change, I had been wondering about exactly the thing that it clarifies :) [14:49:34] (03PS2) 10BBlack: anycast healthchecker should be able to bind [puppet] - 10https://gerrit.wikimedia.org/r/554866 [14:49:35] (03PS2) 10BBlack: anycast: bind to real service [puppet] - 10https://gerrit.wikimedia.org/r/554867 [14:49:37] (03PS2) 10BBlack: bird: 10s delay after route withdraw [puppet] - 10https://gerrit.wikimedia.org/r/554868 [14:49:39] (03PS2) 10BBlack: dnsrecursor: param threads and lce for prod [puppet] - 10https://gerrit.wikimedia.org/r/554869 [14:50:03] Lucas_WMDE: clearly associative arrays are ordered entities, right? ;) [14:50:41] !log cdanis@deploy1001 Synchronized docroot/noc/db.php: c0fe7c410 noc/db.php: clarify loads output (duration: 01m 01s) [14:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:04] oops, I commented wrong, sigh [14:51:55] !log disable tilerator on maps100[1-3].eqiad.wmnet - T239728 [14:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:00] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [14:52:30] !log disable puppet on maps100[1-3].eqiad.wmnet - T239728 [14:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:49] !log cdanis@deploy1001 Synchronized src/Noc/WmfClusters.php: c0fe7c410 clarify loads output (earlier push was 7963fdcd2 sort clusters naturally) (duration: 00m 59s) [14:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:10] ok, all done [14:53:25] (03PS4) 10Ema: ATS: add varnish-fe <-> ats-be TTFB histogram [puppet] - 10https://gerrit.wikimedia.org/r/554863 (https://phabricator.wikimedia.org/T238494) [14:55:12] (03Abandoned) 10Jbond: ci - taskgen: test black ci on individual files [puppet] - 10https://gerrit.wikimedia.org/r/553488 (https://phabricator.wikimedia.org/T239334) (owner: 10Jbond) [14:56:20] (03CR) 10Filippo Giunchedi: [C: 03+1] ATS: add varnish-fe <-> ats-be TTFB histogram [puppet] - 10https://gerrit.wikimedia.org/r/554863 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [14:56:33] (03CR) 10Jcrespo: "https://jynus.com/gif/gerrit_plus_1.gifv" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554865 (owner: 10Lucas Werkmeister (WMDE)) [15:06:00] (03CR) 10Ema: [C: 03+2] ATS: add varnish-fe <-> ats-be TTFB histogram [puppet] - 10https://gerrit.wikimedia.org/r/554863 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [15:06:59] (03PS3) 10BBlack: anycast healthchecker should be able to bind [puppet] - 10https://gerrit.wikimedia.org/r/554866 [15:07:01] (03PS3) 10BBlack: anycast: bind to real service [puppet] - 10https://gerrit.wikimedia.org/r/554867 [15:07:02] (03PS3) 10BBlack: bird: 10s delay after route withdraw [puppet] - 10https://gerrit.wikimedia.org/r/554868 [15:07:04] (03PS3) 10BBlack: dnsrecursor: param threads and lce for prod [puppet] - 10https://gerrit.wikimedia.org/r/554869 [15:10:47] (03PS4) 10BBlack: anycast healthchecker should be able to bind [puppet] - 10https://gerrit.wikimedia.org/r/554866 (https://phabricator.wikimedia.org/T98006) [15:10:51] (03PS4) 10BBlack: anycast: bind to real service [puppet] - 10https://gerrit.wikimedia.org/r/554867 (https://phabricator.wikimedia.org/T98006) [15:10:54] (03PS4) 10BBlack: bird: 10s delay after route withdraw [puppet] - 10https://gerrit.wikimedia.org/r/554868 (https://phabricator.wikimedia.org/T98006) [15:10:57] (03PS4) 10BBlack: dnsrecursor: param threads and lce for prod [puppet] - 10https://gerrit.wikimedia.org/r/554869 (https://phabricator.wikimedia.org/T239667) [15:13:44] (03PS1) 10TechneSiyam: This is for GCI - TechneSiyam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554889 [15:14:01] (03CR) 10BBlack: [C: 03+2] anycast healthchecker should be able to bind [puppet] - 10https://gerrit.wikimedia.org/r/554866 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [15:14:08] (03CR) 10BBlack: [C: 03+2] anycast: bind to real service [puppet] - 10https://gerrit.wikimedia.org/r/554867 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [15:14:11] (03CR) 10BBlack: [C: 03+2] bird: 10s delay after route withdraw [puppet] - 10https://gerrit.wikimedia.org/r/554868 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [15:16:14] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/554890 [15:16:14] !log run osm-import on maps1004 - T239728 [15:16:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:21] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [15:16:56] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5167 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:17:31] !log anomie@deploy1001 Started scap: Backporting fix for T239428 [15:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:37] T239428: API edit on page with non-resolvable redirect and redirect=1 creates page with invalid title - https://phabricator.wikimedia.org/T239428 [15:18:38] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/554890 [15:24:18] (03PS4) 10Hashar: contint: role for CI package_builder instances [puppet] - 10https://gerrit.wikimedia.org/r/554642 (https://phabricator.wikimedia.org/T224943) [15:25:38] 10Operations, 10serviceops: VP9-enabled ffmpeg doesn't get installed after reimage of mw job runner/video scaler - https://phabricator.wikimedia.org/T239831 (10brion) Ideally: fix the installs ASAP :) If can't be done: disable `$wgFFmpegVP9RowMT` and that should hopefully work if it reverted to a version that... [15:25:46] (03CR) 10Hashar: "cherry picked, ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/554642 (https://phabricator.wikimedia.org/T224943) (owner: 10Hashar) [15:26:04] jouncebot: next [15:26:04] In 1 hour(s) and 33 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191205T1700) [15:27:48] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0625 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [15:31:31] (03CR) 10BBlack: [C: 03+2] dnsrecursor: param threads and lce for prod [puppet] - 10https://gerrit.wikimedia.org/r/554869 (https://phabricator.wikimedia.org/T239667) (owner: 10BBlack) [15:38:01] (03PS1) 10Brion VIBBER: Temporary disable of $wgFFmpegVP9RowMT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554894 (https://phabricator.wikimedia.org/T239831) [15:41:45] (03PS2) 10Alexandros Kosiaris: blubberoid: Harmonize eqiad limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/553469 [15:42:16] (03CR) 10Alexandros Kosiaris: [C: 03+2] blubberoid: Harmonize eqiad limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/553469 (owner: 10Alexandros Kosiaris) [15:42:30] (03Merged) 10jenkins-bot: blubberoid: Harmonize eqiad limits/requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/553469 (owner: 10Alexandros Kosiaris) [15:43:07] !log upgrading the reimaged video scalers back to the row-mt enabled ffmpeg T239831 [15:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:12] T239831: VP9-enabled ffmpeg doesn't get installed after reimage of mw job runner/video scaler - https://phabricator.wikimedia.org/T239831 [15:43:39] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'blubberoid' for release 'production' . [15:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:53] !log restart backup1001, overloaded T234900 [15:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:58] T234900: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 [15:47:43] (03PS1) 10Jcrespo: Revert "bacula: Schedule hourly copies of production backups to the offsite pool" [puppet] - 10https://gerrit.wikimedia.org/r/554896 [15:48:03] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Revert "bacula: Schedule hourly copies of production backups to the offsite pool" [puppet] - 10https://gerrit.wikimedia.org/r/554896 (owner: 10Jcrespo) [15:50:51] !log anomie@deploy1001 Finished scap: Backporting fix for T239428 (duration: 33m 20s) [15:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:57] T239428: API edit on page with non-resolvable redirect and redirect=1 creates page with invalid title - https://phabricator.wikimedia.org/T239428 [15:52:36] (03PS1) 10BBlack: cron_splay: support semihourly execution [puppet] - 10https://gerrit.wikimedia.org/r/554898 [15:54:01] (03CR) 10jerkins-bot: [V: 04-1] cron_splay: support semihourly execution [puppet] - 10https://gerrit.wikimedia.org/r/554898 (owner: 10BBlack) [15:54:43] RECOVERY - SSH on backup1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:57:52] (03PS2) 10BBlack: cron_splay: support semihourly execution [puppet] - 10https://gerrit.wikimedia.org/r/554898 [16:00:05] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:02:29] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Comments inline about the implementation, but the idea is sound" (0311 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/554832 (owner: 10Giuseppe Lavagetto) [16:03:48] (03PS3) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/554890 [16:04:15] PROBLEM - Host ms-fe2007 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:24] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2260.codfw.wmnet'] ` and were **ALL** successful. [16:05:47] (03CR) 10jerkins-bot: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/554890 (owner: 10CDanis) [16:07:06] (03PS2) 10Papaul: DNS: Add mgmt and production DNS for frdb2002 [dns] - 10https://gerrit.wikimedia.org/r/554640 [16:09:37] (03PS4) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/554890 [16:11:06] (03PS5) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/554890 [16:11:19] (03PS1) 10Muehlenhoff: make ffmpeg installation depend on apt-get update exec [puppet] - 10https://gerrit.wikimedia.org/r/554904 [16:12:06] (03PS1) 10Herron: lvs: add entries for logstash-next and kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/554905 (https://phabricator.wikimedia.org/T234854) [16:12:12] (03PS1) 10Herron: dns: add kibana-next and logstash-next service addresses [dns] - 10https://gerrit.wikimedia.org/r/554906 (https://phabricator.wikimedia.org/T234854) [16:12:45] (03CR) 10jerkins-bot: [V: 04-1] dns: add kibana-next and logstash-next service addresses [dns] - 10https://gerrit.wikimedia.org/r/554906 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [16:13:00] (03CR) 10jerkins-bot: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/554890 (owner: 10CDanis) [16:14:40] (03PS6) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/554890 [16:16:33] (03CR) 10jerkins-bot: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/554890 (owner: 10CDanis) [16:17:38] (03PS7) 10Filippo Giunchedi: install_server: standard recipe and raid1/raid10 [puppet] - 10https://gerrit.wikimedia.org/r/553363 (https://phabricator.wikimedia.org/T156955) [16:17:40] (03PS7) 10Filippo Giunchedi: install_server: apply standard partman recipes, take #1 [puppet] - 10https://gerrit.wikimedia.org/r/553483 (https://phabricator.wikimedia.org/T156955) [16:17:42] (03PS4) 10Filippo Giunchedi: WIP role: extend centrallog's /srv if needed [puppet] - 10https://gerrit.wikimedia.org/r/554044 (https://phabricator.wikimedia.org/T156955) [16:18:16] (03PS7) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/554890 [16:19:41] (03CR) 10Herron: [C: 03+1] prometheus: drop Logstash 7 unnamed plugins [puppet] - 10https://gerrit.wikimedia.org/r/554843 (owner: 10Filippo Giunchedi) [16:20:11] !log elukey@cr2-eqord> clear bfd session 208.80.154.208 [16:20:11] ^ [16:20:13] (03CR) 10jerkins-bot: [V: 04-1] WIP [puppet] - 10https://gerrit.wikimedia.org/r/554890 (owner: 10CDanis) [16:20:14] argh [16:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:18] will remove it :) [16:20:20] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:20:38] !log execute clear bfd session address 208.80.154.208 on cr2-eqord [16:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:21] (03PS2) 10Herron: lvs: add entries for logstash-next and kibana-next [puppet] - 10https://gerrit.wikimedia.org/r/554905 (https://phabricator.wikimedia.org/T234854) [16:23:33] (03CR) 10Muehlenhoff: "No longer necessary, the correct ffmpeg has been deployed by Effie on all reimaged servers (and we're in the process of correcting new ins" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554894 (https://phabricator.wikimedia.org/T239831) (owner: 10Brion VIBBER) [16:23:52] Woohoo! [16:25:07] (03PS8) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/554890 [16:26:07] (03Abandoned) 10Brion VIBBER: Temporary disable of $wgFFmpegVP9RowMT [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554894 (https://phabricator.wikimedia.org/T239831) (owner: 10Brion VIBBER) [16:28:34] (03PS9) 10CDanis: Install prometheus-atlas-exporter on netmon hosts [puppet] - 10https://gerrit.wikimedia.org/r/554890 [16:29:03] (03CR) 10jerkins-bot: [V: 04-1] Install prometheus-atlas-exporter on netmon hosts [puppet] - 10https://gerrit.wikimedia.org/r/554890 (owner: 10CDanis) [16:29:41] I really wish jerkins wouldn't post here for changes marked WIP in gerrit >.< [16:30:01] 11:28:59 fatal: remote error: access denied or repository not exported: /operations/puppet [16:30:02] truling living to it's name it is :-) [16:30:04] oh good [16:30:10] (03PS1) 10EBernhardson: [airflow] correct path to discovery analytics repo [puppet] - 10https://gerrit.wikimedia.org/r/554909 [16:30:13] (03CR) 10CDanis: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/554890 (owner: 10CDanis) [16:30:16] cdanis: File a bug or a patch? ;) [16:30:21] (03PS1) 10Eevans: Update session serialization (Kask) to PHP w/ HMAC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554910 (https://phabricator.wikimedia.org/T222099) [16:31:20] Reedy: https://phabricator.wikimedia.org/T239928 just for you [16:31:58] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:32:04] (03CR) 10CDanis: "PCC looks reasonable: https://puppet-compiler.wmflabs.org/compiler1002/19820/" [puppet] - 10https://gerrit.wikimedia.org/r/554890 (owner: 10CDanis) [16:32:43] !log execute clear bfd session address fe80::7a4f:9b00:d4e:8004 on cr1-eqiad [16:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:10] !log execute clear bfd session address fe80::5e5e:ab00:d3d:85ce on cr3-knams [16:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:55] (03PS2) 10Elukey: airflow: correct path to discovery analytics repo [puppet] - 10https://gerrit.wikimedia.org/r/554909 (owner: 10EBernhardson) [16:38:22] 10Operations, 10serviceops, 10Patch-For-Review: VP9-enabled ffmpeg doesn't get installed after reimage of mw job runner/video scaler - https://phabricator.wikimedia.org/T239831 (10brion) Ok, I'll run: ` foreachwiki extensions/TimedMediaHandler/maintenance/requeueTranscodes.php --error --throttle ` to catch... [16:38:40] (03CR) 10Elukey: [C: 03+2] airflow: correct path to discovery analytics repo [puppet] - 10https://gerrit.wikimedia.org/r/554909 (owner: 10EBernhardson) [16:38:48] RECOVERY - BFD status on cr2-eqord is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:40:42] !log running `requeueTranscodes.php --error --throttle` on mwmaint1002 to clean up T239831-related broken video transcodes. will raise usage on video scalers for a while. [16:40:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:48] T239831: VP9-enabled ffmpeg doesn't get installed after reimage of mw job runner/video scaler - https://phabricator.wikimedia.org/T239831 [16:42:58] (03CR) 10BPirkle: Update session serialization (Kask) to PHP w/ HMAC (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554910 (https://phabricator.wikimedia.org/T222099) (owner: 10Eevans) [16:46:19] ACKNOWLEDGEMENT - Host ms-fe2007 is DOWN: PING CRITICAL - Packet loss = 100% Filippo Giunchedi https://phabricator.wikimedia.org/T239805 [16:46:21] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10jijiki) @Jclark-ctr Do we have an estimate when we are able to have those servers racked? [16:47:53] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@87b25f2]: initial airflow dags/plugins [16:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:47:59] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@87b25f2]: initial airflow dags/plugins (duration: 00m 06s) [16:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:30] PROBLEM - mediawiki-installation DSH group on mw2261 is CRITICAL: Host mw2261 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:00:04] godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191205T1700). Please do the needful. [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:07:16] ^ I am aware of mw2261 [17:12:35] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554905 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [17:13:15] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: drop Logstash 7 unnamed plugins [puppet] - 10https://gerrit.wikimedia.org/r/554843 (owner: 10Filippo Giunchedi) [17:16:57] (03CR) 10Effie Mouzeli: "OK https://puppet-compiler.wmflabs.org/compiler1002/19821/" [puppet] - 10https://gerrit.wikimedia.org/r/554904 (owner: 10Muehlenhoff) [17:17:08] (03CR) 10Effie Mouzeli: [C: 03+2] make ffmpeg installation depend on apt-get update exec [puppet] - 10https://gerrit.wikimedia.org/r/554904 (owner: 10Muehlenhoff) [17:18:20] (03CR) 10Bearloga: statistics::discovery: move cron to timer and add kerberos support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554528 (owner: 10Elukey) [17:18:52] !log reimage mw2260, yes again [17:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:16] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` ['mw2260.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201912051719_jiji_174984.log`. [17:22:24] RECOVERY - mediawiki-installation DSH group on mw2261 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:23:10] !log ✔️ cdanis@install1002.wikimedia.org ~ 🕧☕ sudo -E reprepro -C main include stretch-wikimedia prometheus-atlas-exporter_1.0+git20191204.ffafab7-1_amd64.changes [17:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:17] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10wiki_willy) Hi @jijiki - we received these on Sept 25, but just recently received the racking instructions from your team on Nov 25, so there might be... [17:25:35] 10Operations, 10DBA, 10Wikimedia-Incident: Disallow 'weight: 0' for MW db config in dbctl - https://phabricator.wikimedia.org/T239901 (10Marostegui) I am not sure if I want to fully disallow weight 0 for replicas, there are some cases where we might actually want that, Cross posting from: T239900 ` We used... [17:27:56] (03CR) 10Elukey: statistics::discovery: move cron to timer and add kerberos support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554528 (owner: 10Elukey) [17:28:34] (03PS3) 10Elukey: statistics::discovery: move cron to timer and add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/554528 [17:29:09] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10jijiki) @wiki_willy given we are responsible for this delay, we would like to check with your team and tell us what is the earliest we can do, and we c... [17:30:37] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10wiki_willy) Thanks for confirming @jijiki - I'll let @Jclark-ctr provide an ETA on them, when he gets in a bit later today. Thanks, Willy [17:30:55] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@c29a758]: deploy repo to search-airflow dsh group [17:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:08] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@c29a758]: deploy repo to search-airflow dsh group (duration: 00m 13s) [17:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:18] ops wizards: what's the thoughts on swatting a new hook into the linter extension today? [17:33:38] ops wizards: we'd like to get a head start on moving Parsoid linting over from JS to PHP [17:33:46] (03CR) 10Jforrester: [C: 03+1] "Happy to deploy myself unless you want to?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554836 (https://phabricator.wikimedia.org/T235142) (owner: 10Krinkle) [17:33:48] (03PS2) 10Zoranzoki21: Add *.archives.go.jp to the wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553875 (https://phabricator.wikimedia.org/T238476) [17:34:26] but given the issues with wmf.8 it doesn't seem clear whether we should swat to wmf.5 or wmf.8 -- or else just wait until monday and hope the train's been restarted by then and things will be clearer. [17:34:38] (03CR) 10Bearloga: [C: 03+1] "Nice, thank you so much! I love that kerberos::exect has those logging parameters. LGTM! I don't have +2 so that's for you or Andrew to do" [puppet] - 10https://gerrit.wikimedia.org/r/554528 (owner: 10Elukey) [17:34:59] James_F, Krinkle: ^ poking you as two specific members of the ops wizards group [17:35:08] cscott: I'm not up to speed on all related things, but I thought part of the reason for the train delays was parsoid, so using more of it seems not so great? [17:35:28] parsoid appeared to be a red herring in this particular train crash [17:35:39] (03CR) 10Herron: "should be unblocked now, mw logs are flowing again in logstash-beta" [puppet] - 10https://gerrit.wikimedia.org/r/554599 (owner: 10Krinkle) [17:36:04] bblack: we're really trying to retire parsoid/js asap, especially considering we lose mobrovac on the 13th. [17:36:11] AIUI a (yet-undiagnosed?) sharp increase in load put on s7 (which includes centralauth) by .8 was responsible [17:36:18] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime [17:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:49] i concur that parsoid errors were a red herring here, but the volume of logspam it generates was a considerable factor in obscuring the underlying issue. [17:37:00] acknowledged [17:37:36] fwiw the swat patch doesn't immediately enable parsoid/php linting; it just puts the hook in place that would allow us to turn that on before the next train. [17:37:49] (03PS3) 10Tchanders: Enable CheckUser's Special:Investigate page on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T239936) [17:38:27] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:45] and then it still doesn't get turned on until monday? [17:39:10] or that just means the next decision, to turn it on, no longer needs to block on swat/train? :) [17:40:10] bblack: yes, that the next decision no longer needs to block on swat/train. [17:40:22] cscott: Back-porting the hook so it can be used early next week sounds sane. [17:40:35] yeah, i've got no personal objection as long as we don't add any more noise to the mix before this train can grind its way to the finish line. [17:40:58] AIUI this will abate noise about fatals from the langconv stuff, right? [17:41:00] on a technical level, do we swat to wmf.5 or wmf.8 or both or neither? [17:41:01] (03PS1) 10EBernhardson: [airflow] Include deployment venv on path [puppet] - 10https://gerrit.wikimedia.org/r/554920 [17:41:22] ok :) [17:41:25] Are you planning on using it on Monday? We might not have got the train out by then. :-( [17:41:28] James_F: no, i'm still working on the langconv patch. that can be done w/ a parsoid deploy which is independent of the swat/train. [17:42:15] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10Jclark-ctr) @jijiki If no surprises i could have them Racked by Dec 20th [17:42:16] James_F: yeah, that's the thing. my personal feeling is maybe we might as well just wait to swat until monday, and hope the train was resolved by then so it was clear which branch to swat to [17:42:46] Yes, I mean this will abate it when the new Parsoid code goes out. [17:43:09] Just dump it into wmf.5 as well for safety. [17:43:36] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10jijiki) That would be lovely, thank you! [17:43:59] there are two separate issues, two things which we're still using parsoid/js for. one is for linting, because we're missing the needed hook. the other is for languageconverter, for faults of our own. [17:44:19] Oh, OK. [17:45:02] (briefly, we mirrored traffic from parsoid/js to parsoid/php for weeks so we could iron the bugs out, but we hadn't been mirroring the html2html traffic for language converter. but now it's being mirrored, at a 6% of traffic level, so we can do the same bug-elimination work on that code) [17:45:54] the most obvious log spam right now is language converter emiting invalid UTF8, which is crashing PHP and gets silently converted to UCS2 replacement characters in JS. [17:46:07] 10Operations, 10ops-eqiad, 10serviceops: (Need By Dec 20) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10wiki_willy) [17:46:44] Is this something we should catch on the ingestion and prevent processing/saving (in both Parser and Parsoid)? [17:49:38] no, it's invalid UTF8 as a result of the conversion, not as the input. which means that the conversion table is wrong somehow, but I haven't figured out exactly how yet. (Unless the original legacy language converter tables have invalid UTF8, which isn't entirely impossible.) [17:50:55] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [17:54:47] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2260.codfw.wmnet'] ` and were **ALL** successful. [17:55:10] 10Operations, 10DBA, 10Wikimedia-Incident: Disallow 'weight: 0' for MW db config in dbctl - https://phabricator.wikimedia.org/T239901 (10colewhite) p:05Triage→03Normal [17:55:32] 10Operations, 10netops: Facebook BGP peering links down in ulsfo - https://phabricator.wikimedia.org/T239896 (10colewhite) p:05Triage→03Normal [17:58:14] 10Operations, 10netops: BGP peering sessions with corp partially down in ulsfo - https://phabricator.wikimedia.org/T239893 (10colewhite) p:05Triage→03Normal [17:58:34] 10Operations, 10DC-Ops, 10hardware-requests: Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 - https://phabricator.wikimedia.org/T239880 (10colewhite) p:05Triage→03Normal [17:58:57] 10Operations, 10Patch-For-Review: Fix installation of Puppet 5/Facter 3 on new stretch installs/reimages - https://phabricator.wikimedia.org/T239832 (10colewhite) p:05Triage→03Normal [17:59:30] 10Operations, 10ops-codfw: ms-fe2007 NIC failure - https://phabricator.wikimedia.org/T239805 (10colewhite) p:05Triage→03Normal [18:00:04] cscott, arlolra, subbu, halfak, and accraze: How many deployers does it take to do Services – Graphoid / Parsoid / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191205T1800). [18:01:36] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Danny Horn - https://phabricator.wikimedia.org/T239881 (10colewhite) p:05Triage→03Normal [18:02:53] (03PS1) 10Cwhite: admin: add Danny Horn to ldap-only-users [puppet] - 10https://gerrit.wikimedia.org/r/554924 (https://phabricator.wikimedia.org/T239881) [18:04:02] (03PS2) 10Eevans: Update session serialization (Kask) to PHP w/ HMAC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554910 (https://phabricator.wikimedia.org/T222099) [18:04:13] We're doing an ORES deployment [18:04:16] kevinbazira, ^ [18:04:53] (03CR) 10Eevans: "> Patch Set 1:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554910 (https://phabricator.wikimedia.org/T222099) (owner: 10Eevans) [18:08:26] !log kevinbazira@deploy1001 Started deploy [ores/deploy@6dd1fef]: T238839 [18:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:32] T238839: ORES deploy -- Late November, 2019 - https://phabricator.wikimedia.org/T238839 [18:10:16] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1089 - https://phabricator.wikimedia.org/T239365 (10mforns) p:05Triage→03High [18:13:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Data-Services, and 2 others: Decommission labstore100[123] and their disk shelves - https://phabricator.wikimedia.org/T187456 (10faidon) [18:13:56] (03CR) 10Volans: [C: 04-1] "Thanks a lot for all the work and to give me a test instance to test it! A bunch of comments inline." (0327 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [18:14:40] 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10Dzahn) 05Stalled→03Open [18:23:08] testing a plugin please ignore https://netbox.wikimedia.org/extras/reports/puppetdb.PuppetDB/ [18:23:28] (03PS1) 10Cwhite: admin: add rxy to ldap-only-users [puppet] - 10https://gerrit.wikimedia.org/r/554926 (https://phabricator.wikimedia.org/T239494) [18:25:20] (03CR) 10Anomie: Update session serialization (Kask) to PHP w/ HMAC (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554910 (https://phabricator.wikimedia.org/T222099) (owner: 10Eevans) [18:25:46] !log kevinbazira@deploy1001 Finished deploy [ores/deploy@6dd1fef]: T238839 (duration: 17m 20s) [18:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:52] T238839: ORES deploy -- Late November, 2019 - https://phabricator.wikimedia.org/T238839 [18:26:49] 10Operations, 10DC-Ops, 10hardware-requests: Replacement hardware for buster/stretch upgrade of contint1001 and contint2001 - https://phabricator.wikimedia.org/T239880 (10Dzahn) I'd be the one to take these from "role::spare"-state to serving the puppet roles. We'll see what breaks on buster and whether we c... [18:29:33] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) Just a thought.. If the 2 machines for contint are granted (T239880) first and are taken from available spare pool.. then possibly one of them could work as temp Gerrit test machine a... [18:31:45] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10Dzahn) a:05Dzahn→03None I'm unassigning from me for now because i'll be on a vacation and don't want this to be blocked on me for no reason. If anyone else can push this forward meanwhil... [18:33:14] (03CR) 10Gehel: [C: 03+2] [airflow] Include deployment venv on path [puppet] - 10https://gerrit.wikimedia.org/r/554920 (owner: 10EBernhardson) [18:35:45] Victory! ORES deploy looks good. Thanks for your help halfak! [18:40:20] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2259.codfw.wmnet [18:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:37] 10Operations, 10serviceops: VP9-enabled ffmpeg doesn't get installed after reimage of mw job runner/video scaler - https://phabricator.wikimedia.org/T239831 (10jijiki) @brion thank you! You can mark this as resolved if there is nothing else to be done [18:43:45] 10Operations, 10ops-codfw, 10DC-Ops, 10serviceops: mw2259 down and mgmt does not exist? - https://phabricator.wikimedia.org/T239758 (10Dzahn) Thank you very much @Papaul. I can confirm the server is reachable again and everything looked fine in Icinga. Also the mgmt interface is in Icinga again. I jus... [18:44:20] 10Operations, 10serviceops: Reimage all mediawiki servers - https://phabricator.wikimedia.org/T239054 (10Dzahn) [18:47:24] 10Operations, 10Gerrit, 10vm-requests: Gerrit VM to test data migration - https://phabricator.wikimedia.org/T239151 (10thcipriani) >>! In T239151#5716392, @Dzahn wrote: > Just a thought.. If the 2 machines for contint are granted (T239880) first and are taken from available spare pool.. then possibly one of... [18:53:03] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1002: SMART/disk error - https://phabricator.wikimedia.org/T230088 (10Jclark-ctr) @mathew.onip Disk arrived [19:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191205T1900). [19:00:04] urandom, cscott, arlolra, and Zoranzoki21: A patch you scheduled for Morning SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:01:14] I can SWAT today! [19:03:06] cscott: around? [19:03:18] yup [19:04:10] cool! +2'ed your backports, will let you know once they're ready to be tested [19:04:37] I moved mine until later [19:04:51] Pretty sure they're not testable ;) [19:04:56] I guess jouncebot didn't get the memo (in time) [19:06:19] urandom: yes, saw that [19:07:38] Reedy: at least, "nothing broke when the code is runned" is also a test :-). [19:07:56] A hook with no subscribers is a no-op [19:08:01] linting will confirm it is valid code [19:08:16] (03CR) 10Dzahn: [C: 03+1] admin: add rxy to ldap-only-users [puppet] - 10https://gerrit.wikimedia.org/r/554926 (https://phabricator.wikimedia.org/T239494) (owner: 10Cwhite) [19:08:38] (03CR) 10Dzahn: [C: 03+1] admin: add Danny Horn to ldap-only-users [puppet] - 10https://gerrit.wikimedia.org/r/554924 (https://phabricator.wikimedia.org/T239881) (owner: 10Cwhite) [19:09:41] (03CR) 10Cwhite: [C: 03+2] admin: add rxy to ldap-only-users [puppet] - 10https://gerrit.wikimedia.org/r/554926 (https://phabricator.wikimedia.org/T239494) (owner: 10Cwhite) [19:09:54] (03CR) 10Cwhite: [C: 03+2] admin: add Danny Horn to ldap-only-users [puppet] - 10https://gerrit.wikimedia.org/r/554924 (https://phabricator.wikimedia.org/T239881) (owner: 10Cwhite) [19:10:04] (03PS2) 10Cwhite: admin: add Danny Horn to ldap-only-users [puppet] - 10https://gerrit.wikimedia.org/r/554924 (https://phabricator.wikimedia.org/T239881) [19:13:48] (03Abandoned) 10Jforrester: I am TechneSiyam.Doing this for GCI Task. MY Full Name: MD Abu Siyam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554812 (owner: 10TechneSiyam) [19:14:01] (03Abandoned) 10Jforrester: This is for GCI Task of TechneSiyam. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554845 (owner: 10TechneSiyam) [19:14:11] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/Linter: SWAT: 839c383: Implement ParserLogLinterData hook (T238456) (duration: 01m 02s) [19:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:17] T238456: Missing implementation to post Parsoid/PHP lints to production database - https://phabricator.wikimedia.org/T238456 [19:14:24] RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:14:29] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554889 (owner: 10TechneSiyam) [19:15:37] !log urbanecm@deploy1001 scap failed: average error rate on 3/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [19:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:23] that...doesn't look like a no op [19:16:39] arlolra: ^ [19:16:45] `Error from line 46 of /srv/mediawiki/php-1.35.0-wmf.5/extensions/Linter/includes/ApiRecordLint.php: Call to undefined method MediaWiki\Linter\Hooks::onParserLogLinterData()`, clearly related, reverting... [19:17:00] Urbanecm: thanks! [19:18:50] (although that function should be defined in that same commit, hm.) [19:19:08] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Linter/: SWAT: b376528: Revert "Implement ParserLogLinterData hook" (duration: 01m 01s) [19:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:13] Why are you bypassing CI? [19:20:04] Reedy: because it was a revert [19:20:07] And? [19:21:37] onimisionipe: o/ [19:21:38] you around? [19:21:39] (03PS4) 10Dzahn: airflow: add a local mariadb server [puppet] - 10https://gerrit.wikimedia.org/r/554215 (https://phabricator.wikimedia.org/T236180) [19:21:48] (03PS1) 10Cwhite: admin: use dannyh uid as it matches ldap [puppet] - 10https://gerrit.wikimedia.org/r/554929 (https://phabricator.wikimedia.org/T239881) [19:22:13] Reedy: and revert needs to be done ASAP? [19:22:19] No it doesn't [19:22:26] It's not been deployed fully [19:22:45] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) - events-.wikimedia.org/v1/events - -events.wikimedia.org/v1/events - intake-.wikim... [19:22:51] true, but it reached some of the prod servers, didn't it [19:23:04] Where in the docs does it say to do C+2, V+2 and submit? [19:23:20] (03CR) 10Cwhite: [C: 03+2] admin: use dannyh uid as it matches ldap [puppet] - 10https://gerrit.wikimedia.org/r/554929 (https://phabricator.wikimedia.org/T239881) (owner: 10Cwhite) [19:23:50] You can do a local git revert and deploy that [19:23:51] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1002: SMART/disk error - https://phabricator.wikimedia.org/T230088 (10elukey) @Mathew.onipe @Gehel @aborrero can you sync with @Jclark-ctr when you have a minute to swap the disk? [19:24:00] You don't need to bypass CI in gerrit [19:24:32] PROBLEM - Nginx local proxy to apache on mw1239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:24:36] (once this is done, we could use some help figuring out how we can get this impossible error message. presumably swat procedures clear the opcache appropriately?) [19:25:06] It indeed sounds a lot like that [19:25:23] I'm wondering if with multiple dependent files... There were some hits before they were all synced? [19:25:23] They're meant to. [19:25:24] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) - logging-sink & analytics-sink (or sink-*) ? [19:25:36] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1089 - https://phabricator.wikimedia.org/T239365 (10elukey) 05Open→03Resolved Disk swapped, raid status ok. Thanks! [19:25:37] Ah, that'd potentially break things yes. [19:25:47] Sorry, didn't look at the patch before encouraging a SWAT of it. [19:25:55] I mean, it should've been fine.. [19:26:00] PROBLEM - Apache HTTP on mw1239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:26:07] If the new function was deployed, and then the usages [19:26:47] well, parsoid/js uses the mediawiki api to do lint logging [19:26:58] Well, the uses are split between the hooks definition in extension.json and the uses in ApiRecordLint.php. [19:27:12] so this refactored the actual doing-work bits out into a separate method, and named it a hook, so that parsoid/php could call the doing-work bits directly [19:27:22] It's at the margin of deployable code. [19:27:32] Deploy includes/Hooks.php first [19:27:37] Then any of the other two files after [19:27:39] Would've sufficed [19:27:41] but parsoid/js would still be invoking this indirectly via the mediawiki api. [19:27:55] But if you deploy the use of the hook before declaring it, it'll break, right? [19:27:58] RECOVERY - Nginx local proxy to apache on mw1239 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.508 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:28:09] heh, just started the sec revert (for .8) [19:28:15] should I retry with syncing files individually? [19:28:17] (03CR) 10Krinkle: "Awesome I'll cherry-pick and beta test this later today or tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/554599 (owner: 10Krinkle) [19:28:22] So includes/Hooks.php then extension.json then includes/ApiRecordLint.php? [19:28:25] Can't guarantee what order sync-dir will sync them [19:28:35] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/Linter: SWAT: e0a2059: Revert "Implement ParserLogLinterData hook" (duration: 01m 01s) [19:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:40] James_F: That's what I'd think indeed [19:28:43] Oh, wow, did you use sync-dir? [19:28:57] Yeah, sync-dir is an exceptionally bad idea to use for live code paths like this. [19:28:57] [19:14:11] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/Linter: SWAT: 839c383: Implement ParserLogLinterData hook (T238456) (duration: 01m 02s) [19:28:58] T238456: Missing implementation to post Parsoid/PHP lints to production database - https://phabricator.wikimedia.org/T238456 [19:29:00] PROBLEM - PHP7 rendering on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:29:04] calling ApiRecordLint won't invoke the hook, it calls Hooks::onParserWhatever directly [19:29:13] so the order of extensions.json should be irrelevant [19:29:20] cscott: "Hooks::onParserLogLinterData()". [19:29:27] nothing will actually invoke it "as a hook" until the parsoid/php code-to-be-deployed-later [19:29:34] It's nothing to do with it being a hook [19:29:37] Oh, ignore me, you're not registering the hook, just using it. [19:29:43] It's literally the order the files were pushed to the server [19:29:52] James_F: yes [19:29:54] Six parallel unrelated conversations at once will do that to you. [19:29:57] Reedy: what james said [19:30:06] agreed that Hooks.php needed to be deployed first [19:30:19] but after that the order of the other two files shouldn't matter [19:30:23] Indeed [19:30:33] Urbanecm: Please be careful with these kinds of changes. [19:30:34] [19:27:32] Deploy includes/Hooks.php first [19:30:34] [19:27:37] Then any of the other two files after [19:30:34] ;) [19:30:46] RECOVERY - Confd template for /var/lib/gdnsd/discovery-kibana.state on authdns1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [19:30:49] James_F: noted :) [19:30:52] RECOVERY - Confd template for /var/lib/gdnsd/discovery-kibana.state on authdns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [19:30:56] i don't have a problem with James' order either, it's also safe. [19:30:57] (Or feel free to make me do the work rather than rely on you being psychic. ;-)) [19:31:12] sorry, i had forgotten this detail about swats, i would have flagged it if I'd remembered [19:31:21] so...going to retry then [19:31:46] (03CR) 10Dzahn: "done! thanks for the reviews, John. But maybe this is not needed anymore now. we'll see :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554215 (https://phabricator.wikimedia.org/T236180) (owner: 10Dzahn) [19:32:56] RECOVERY - Apache HTTP on mw1239 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:32:57] yeah, i just re-read everything to be sure. but doing things in that order seems like it should work. [19:33:29] and the good? news is that this only affected api requests to log lints, so we may have lost a few lint errors, but no user-visible outage [19:34:32] !log ns1.wikimedia.org: re-route authdns traffic from authdns2001 (to be reimaged) -> dns2001 temporarily - T239667 [19:34:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:38] T239667: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 [19:35:58] RECOVERY - PHP7 rendering on mw1238 is OK: HTTP OK: HTTP/1.1 200 OK - 75814 bytes in 0.354 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:35:59] !log Icinga: delete all downtimes for mw2259. Scheduling Icinga downtimes is tricky business. If you add some for hardware failure and they are too short you cause Icinga spam, if they are too long and the dcops operator is amazingly fast like Papaul then your server is back in production but not monitored and you have to click a million times in the web UI to remove them to avoid that. [19:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:05] T239758 [19:36:05] T239758: mw2259 down and mgmt does not exist? - https://phabricator.wikimedia.org/T239758 [19:36:23] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['authdns2001.wikimedia.org'] ` The log can be found in `/var/log/wmf-... [19:36:52] syncing includes/Hooks.php for .5 now [19:37:42] PROBLEM - Apache HTTP on mw1239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:37:43] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Linter/includes/Hooks.php: SWAT: 7b7f326: Implement ParserLogLinterData hook (1/3, T238456) (duration: 01m 09s) [19:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:50] T238456: Missing implementation to post Parsoid/PHP lints to production database - https://phabricator.wikimedia.org/T238456 [19:38:04] PROBLEM - PHP7 rendering on mw1239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:38:13] now extension.json [19:38:41] ^ let me check mw1239 pleae [19:38:49] we have a history [19:39:12] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Linter/extension.json: SWAT: 7b7f326: Implement ParserLogLinterData hook (2/3, T238456) (duration: 01m 05s) [19:39:15] effie: should I wait? [19:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:20] RECOVERY - Apache HTTP on mw1239 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:39:55] Urbanecm: oh no, it is not a blocker for you [19:40:02] ok [19:40:09] sorry for the confusion [19:40:36] no problem [19:40:38] She has a personal vendetta against mw1239 [19:40:40] PROBLEM - PHP7 rendering on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:40:52] syncing includes/ApiRecordLint.php now [19:41:03] Reedy: yes, it is the one with 100% APCu fragmentation [19:41:13] (03PS4) 10Dzahn: phabricator: remove wstunnel for aphlict from webserver config [puppet] - 10https://gerrit.wikimedia.org/r/554219 (https://phabricator.wikimedia.org/T238593) [19:41:16] and I want to see how far this will go :p [19:41:38] RECOVERY - PHP7 rendering on mw1239 is OK: HTTP OK: HTTP/1.1 200 OK - 75811 bytes in 9.056 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:41:58] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Linter/includes/ApiRecordLint.php: SWAT: 7b7f326: Implement ParserLogLinterData hook (3/3, T238456) (duration: 01m 04s) [19:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:22] and same for .8 now [19:42:26] RECOVERY - PHP7 rendering on mw1238 is OK: HTTP OK: HTTP/1.1 200 OK - 75811 bytes in 9.343 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:43:20] PROBLEM - Nginx local proxy to apache on mw1239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:44:16] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Danny Horn - https://phabricator.wikimedia.org/T239881 (10colewhite) @DannyH I've moved ahead and added you to the wmf ldap group on the basis of your status as staff. We still need to know what you need this access f... [19:44:32] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/Linter/includes/Hooks.php: SWAT: afcfdce: Revert "Revert "Implement ParserLogLinterData hook"" (1/3, T238456) (duration: 01m 11s) [19:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:38] T238456: Missing implementation to post Parsoid/PHP lints to production database - https://phabricator.wikimedia.org/T238456 [19:45:15] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for rxy - https://phabricator.wikimedia.org/T239494 (10colewhite) @Rxy I've added you to the NDA group which should grant you access to Logstash. Please let me know if you encounter any related issue. [19:45:28] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for rxy - https://phabricator.wikimedia.org/T239494 (10colewhite) 05Open→03Resolved [19:46:10] PROBLEM - PHP7 rendering on mw1239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:46:22] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/Linter/includes/ApiRecordLint.php: SWAT: afcfdce: Revert "Revert "Implement ParserLogLinterData hook"" (2/3, T238456) (duration: 01m 09s) [19:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:39] and extension.json [19:46:47] so things seem to be looking okay with this deploy order [19:47:02] yup cscott [19:47:36] !log urbanecm@deploy1001 Synchronized php-1.35.0-wmf.8/extensions/Linter/extension.json: SWAT: afcfdce: Revert "Revert "Implement ParserLogLinterData hook"" (3/3, T238456) (duration: 01m 00s) [19:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:45] cscott: should be all [19:47:58] ok! [19:47:59] PROBLEM - Apache HTTP on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:48:28] (03CR) 10Jbond: "LGTM one question and havn't checked the measurment ids" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554890 (owner: 10CDanis) [19:48:56] Urbanecm: arlolra and i will stick around in case anything crops up later, but lgtm. [19:49:08] ack, thank you [19:49:19] RECOVERY - PHP7 rendering on mw1239 is OK: HTTP OK: HTTP/1.1 200 OK - 75811 bytes in 2.498 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [19:49:59] RECOVERY - Apache HTTP on mw1238 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 3.660 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:50:27] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.301e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:50:30] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10mpopov) >>! In T236386#5716552, @Ottomata wrote: > > - events-.wikimedia.org/v1/events > - -events.wiki... [19:51:31] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 4 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:52:54] (03PS1) 10Dzahn: Revert "phabricator: temp. disable automatic rsync of repo data" [puppet] - 10https://gerrit.wikimedia.org/r/554935 [19:53:12] (03Abandoned) 10Dzahn: Revert "phabricator: temp. disable automatic rsync of repo data" [puppet] - 10https://gerrit.wikimedia.org/r/554935 (owner: 10Dzahn) [19:53:54] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [19:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:55] PROBLEM - Nginx local proxy to apache on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:56:03] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:04] RECOVERY - Nginx local proxy to apache on mw1238 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 4.639 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:59:14] 10Operations, 10Parsoid-PHP, 10serviceops, 10Wikimedia-production-error: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10Dzahn) This was assigned to me to (make it possible to) raise the memory limit. That has happened now. Of the 2 patches linked one has been abandoned beca... [19:59:26] 10Operations, 10Parsoid-PHP, 10serviceops, 10Wikimedia-production-error: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10Dzahn) a:05Dzahn→03None [20:00:04] brennen and twentyafterfour: My dear minions, it's time we take the moon! Just kidding. Time for Mediawiki train - American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191205T2000). [20:01:36] a note for anyone following along at home that the train is still blocked by T239877 [20:01:39] T239877: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" - https://phabricator.wikimedia.org/T239877 [20:01:49] thanks for the heads up [20:02:06] PROBLEM - PHP7 rendering on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:02:16] 10Operations, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1089 - https://phabricator.wikimedia.org/T239365 (10Jclark-ctr) Replaced failed drive [20:02:30] looks like we could use the train window then for more phab stuff [20:03:06] RECOVERY - PHP7 rendering on mw1238 is OK: HTTP OK: HTTP/1.1 200 OK - 75814 bytes in 0.339 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [20:03:55] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['authdns2001.wikimedia.org'] ` and were **ALL** successful. [20:05:55] 10Operations, 10Parsoid-PHP, 10serviceops, 10Wikimedia-production-error: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10Dzahn) If you want to change the memory_limit for just Parsoid servers again that is now: ` 18606 'wmgMemoryLimitParsoid' => [ 18607 'default' =>... [20:10:36] !log ns1.wikimedia.org: restoring normal routing to the newly-reimaged authdns2001 [20:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:59] mutante: that's ok by me. i don't _think_ we're looking at a complete resolution on the train blocker in the next little while. [20:11:04] RECOVERY - Nginx local proxy to apache on mw1239 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Application_servers [20:11:22] brennen: ok! thanks [20:11:33] !log ban cloudelastic1002 from shard allocation - T230088 [20:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:38] T230088: cloudelastic1002: SMART/disk error - https://phabricator.wikimedia.org/T230088 [20:12:37] !log phab1001 - rebooting to hopefully clear "microcode vuln" icinga alert [20:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:12] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on phab1001 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [20:15:41] :) [20:21:10] (03PS2) 10Hashar: Backports for Buster [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/553780 (https://phabricator.wikimedia.org/T239482) [20:21:17] (03PS2) 10Dzahn: Revert "Revert "switch discovery record for phabricator to 1001 for ATS"" [dns] - 10https://gerrit.wikimedia.org/r/554661 [20:21:25] (03PS2) 10Dzahn: Revert "Revert "varnish: switch phabricator backend to phab1001"" [puppet] - 10https://gerrit.wikimedia.org/r/554659 [20:21:30] (03PS2) 10Dzahn: Revert "phabricator: switch prod server to phab1003, enables dumps and ferm holes" [puppet] - 10https://gerrit.wikimedia.org/r/554660 [20:23:36] (03CR) 10CDanis: Install prometheus-atlas-exporter on netmon hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554890 (owner: 10CDanis) [20:25:19] 10Operations, 10Analytics, 10Code-Stewardship-Reviews, 10Tools, 10Wikimedia-IRC-RC-Server: IRC RecentChanges feed: code stewardship request - https://phabricator.wikimedia.org/T185319 (10Dzahn) Is this really replacing the IRCd from T134271 ? [20:26:01] 10Operations, 10Wikimedia-IRC-RC-Server, 10Patch-For-Review: Replace ircd-ratbox with something newer/maintained - https://phabricator.wikimedia.org/T134271 (10Dzahn) Following the "cookie licking on Phab" discussion i'm unassigning this from me because i am not going to work on it soon and T185319 says that... [20:26:13] 10Operations, 10Wikimedia-IRC-RC-Server, 10Patch-For-Review: Replace ircd-ratbox with something newer/maintained - https://phabricator.wikimedia.org/T134271 (10Dzahn) a:05Dzahn→03None [20:29:13] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "switch discovery record for phabricator to 1001 for ATS"" [dns] - 10https://gerrit.wikimedia.org/r/554661 (owner: 10Dzahn) [20:29:15] !log migrating back to phab1001, minimal downtime expected [20:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:56] 10Operations, 10ops-eqiad: (No Need By Date Provided) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10Jclark-ctr) [20:29:59] (03CR) 10Eevans: Update session serialization (Kask) to PHP w/ HMAC (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554910 (https://phabricator.wikimedia.org/T222099) (owner: 10Eevans) [20:30:22] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "varnish: switch phabricator backend to phab1001"" [puppet] - 10https://gerrit.wikimedia.org/r/554659 (owner: 10Dzahn) [20:30:27] (03CR) 10Dzahn: [C: 03+2] Revert "phabricator: switch prod server to phab1003, enables dumps and ferm holes" [puppet] - 10https://gerrit.wikimedia.org/r/554660 (owner: 10Dzahn) [20:30:41] (03CR) 10Hashar: "grmblbl gotta drop that change entirely. It was not ready for merging." [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/553780 (https://phabricator.wikimedia.org/T239482) (owner: 10Hashar) [20:30:50] 10Operations, 10Analytics-Kanban, 10Better Use Of Data, 10Event-Platform, and 8 others: Set up eventgate-logging-external in production - https://phabricator.wikimedia.org/T236386 (10Ottomata) > Is /v1/events plural with the intention that eventually EventGate will support batches of events in the same req... [20:30:56] (03PS25) 10Jforrester: Variant configuration: Pre-calculate config for each wiki on demand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) [20:31:12] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "dumps/phabricator: switch dumps host from phab1003 to phab1001"" [puppet] - 10https://gerrit.wikimedia.org/r/554657 (owner: 10Dzahn) [20:31:22] (03PS3) 10Eevans: Update session serialization (Kask) to PHP w/ HMAC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554910 (https://phabricator.wikimedia.org/T222099) [20:31:27] (03CR) 10Dzahn: [C: 03+2] Revert "Revert "phabricator: switch mail destination to phab1001"" [puppet] - 10https://gerrit.wikimedia.org/r/554658 (owner: 10Dzahn) [20:32:27] running puppet on mx1001 [20:33:05] (03CR) 10Jbond: [C: 03+1] "lgtm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/554890 (owner: 10CDanis) [20:33:16] running puppet on dumps [20:34:19] (03PS1) 10Hashar: Backports for Buster [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/554942 (https://phabricator.wikimedia.org/T239482) [20:34:34] (03CR) 10Hashar: "Send it again as https://gerrit.wikimedia.org/r/#/c/operations/debs/doxygen/+/554942" [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/553780 (https://phabricator.wikimedia.org/T239482) (owner: 10Hashar) [20:34:46] (03CR) 10CDanis: [C: 03+2] Install prometheus-atlas-exporter on netmon hosts [puppet] - 10https://gerrit.wikimedia.org/r/554890 (owner: 10CDanis) [20:36:29] !log volker-e@deploy1001 Started deploy [design/style-guide@437023f]: Deploy design/style-guide: [20:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:37] !log volker-e@deploy1001 Finished deploy [design/style-guide@437023f]: Deploy design/style-guide: (duration: 00m 08s) [20:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:44] Prod clear for me to do a no-op config deploy? [20:41:24] !log running puppet on all cp* for phab change [20:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:47] James_F: I think so, we are switching phab servers right now but it shouldn't interfere with config deploy [20:41:53] Cool. [20:42:04] (03CR) 10Jforrester: [C: 03+2] Variant configuration: Pre-calculate config for each wiki on demand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [20:42:58] (03Merged) 10jenkins-bot: Variant configuration: Pre-calculate config for each wiki on demand [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507729 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [20:43:18] (03PS1) 10Mholloway: Update wikifeeds to 2019-12-05-202925-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/554943 (https://phabricator.wikimedia.org/T238942) [20:43:36] (Done.) [20:43:49] !log ns0.wikimedia.org: re-routing auth traffic from authdns1001 (reimaging) to dns1001 [20:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:51] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['authdns1001.wikimedia.org'] ` The log can be found in `/var/log/wmf-... [20:45:00] (03CR) 10Mholloway: [C: 03+2] Update wikifeeds to 2019-12-05-202925-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/554943 (https://phabricator.wikimedia.org/T238942) (owner: 10Mholloway) [20:45:16] (03Merged) 10jenkins-bot: Update wikifeeds to 2019-12-05-202925-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/554943 (https://phabricator.wikimedia.org/T238942) (owner: 10Mholloway) [20:46:34] !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'wikifeeds' for release 'staging' . [20:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:12] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [20:47:51] ns2 is actually fine, from the outside world, AFAICS [20:47:59] !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [20:48:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:08] probably some kind of monitoring issue, or internal-routing issue affecting icinga's pov, debugging a bit.. [20:48:55] !log successfully migrated to phab1001 with no apparent user impact! [20:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:57] !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'wikifeeds' for release 'production' . [20:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:23] (03PS5) 10Dzahn: phabricator: remove wstunnel for aphlict from webserver config [puppet] - 10https://gerrit.wikimedia.org/r/554219 (https://phabricator.wikimedia.org/T238593) [20:51:16] twentyafterfour https://phabricator.wikimedia.org/source/operations-puppet/ seems that is failing. [20:51:40] (03CR) 1020after4: [C: 03+1] "go for it" [puppet] - 10https://gerrit.wikimedia.org/r/554219 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:52:04] paladox: :( we made double sure /srv/repos was synced to be identical [20:52:19] :( [20:52:57] (03PS1) 10CDanis: atlasexporter: filter out logspam [puppet] - 10https://gerrit.wikimedia.org/r/554945 [20:53:21] (03CR) 10CDanis: [C: 03+2] atlasexporter: filter out logspam [puppet] - 10https://gerrit.wikimedia.org/r/554945 (owner: 10CDanis) [20:54:56] (03CR) 10Reedy: [V: 03+2 C: 03+2] Setup default permissions [software/netbox-extras] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/554598 (owner: 10Volans) [20:55:25] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/19826/" [puppet] - 10https://gerrit.wikimedia.org/r/554219 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [20:56:47] !log cr[12]-eqiad: delete leftover static route of ns2->authdns1001 from esams work, which was blinding icinga to the real ns2 :P [20:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:23] paladox: phd should fix it after a few minutes ! [20:57:32] thanks! [20:57:53] paladox: fixed by Mukunda with just git fetch [20:57:58] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 83.53 ms [20:58:04] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [20:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:20] paladox: we want to replace rsync completely and let phd do the data syncing [20:58:29] ok :) [20:58:31] it's kind of just to speed things up [21:00:11] !log bblack@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:25] !log phab1001 - reload apache2, removed /ws/ rewrite for wstunnel for aphlict [21:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:00] bblack: imagine a world where we had proper configuration management for our routers 🙃 [21:01:43] for the price of a few thousand cups of coffee, we could have a second network engineer to help with that! [21:03:05] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=phab1001-vcs.eqiad.wmnet [21:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:59] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['authdns1001.wikimedia.org'] ` and were **ALL** successful. [21:07:25] (03CR) 10BPirkle: [C: 03+1] Update session serialization (Kask) to PHP w/ HMAC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554910 (https://phabricator.wikimedia.org/T222099) (owner: 10Eevans) [21:07:33] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:07:39] RECOVERY - PyBal IPVS diff check on lvs1014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:07:44] YES< i was hoping for that [21:07:49] that came from git-ssh [21:07:53] (03PS1) 10BBlack: clean up leftover ganeti3003 config [puppet] - 10https://gerrit.wikimedia.org/r/554946 (https://phabricator.wikimedia.org/T236479) [21:07:59] (03PS15) 10Krinkle: swift: use implicit /dev/swift prefix for swift devices [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [21:08:20] (03CR) 10Hashar: "I have forked the Doxygen repository from debian.org and ported it to Buster (changing clang, adding some lintian override). CI builds it" [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/554942 (https://phabricator.wikimedia.org/T239482) (owner: 10Hashar) [21:08:46] (03PS15) 10Krinkle: Scap: scap_source correct gid [puppet] - 10https://gerrit.wikimedia.org/r/361796 (owner: 10Thcipriani) [21:09:39] !log ns0.wikimedia.org: restore routing to authdns1001 [21:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:33] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:10:34] (03CR) 10BBlack: [C: 03+2] clean up leftover ganeti3003 config [puppet] - 10https://gerrit.wikimedia.org/r/554946 (https://phabricator.wikimedia.org/T236479) (owner: 10BBlack) [21:10:36] (03CR) 10Krinkle: "And yet, it is still cherry-picked and adding to the work of rebasing. Please remove from beta puppetmaster or re-open." [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T153468) (owner: 10Hashar) [21:11:55] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:13:20] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Convert DNS servers to Buster - https://phabricator.wikimedia.org/T239667 (10BBlack) 05Open→03Resolved ` (12) authdns[1001,2001].wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002].wikimedia.org... [21:13:22] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10BBlack) [21:13:59] (03PS2) 10BBlack: Revert "Switch to digicert-2019a in esams, temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/554535 [21:14:09] (03CR) 10BBlack: [V: 03+2 C: 03+2] Revert "Switch to digicert-2019a in esams, temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/554535 (owner: 10BBlack) [21:19:18] !log phab1001 - systemctl restart ssh-phab (to make it listen on IPv6, race between puppet adding the IP and starting the service) [21:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:48] RECOVERY - PyBal backends health check on lvs1014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:23:00] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:23:38] (03PS1) 10CDanis: atlasexporter: disable streaming mode [puppet] - 10https://gerrit.wikimedia.org/r/554949 [21:23:57] (03CR) 10CDanis: [C: 03+2] atlasexporter: disable streaming mode [puppet] - 10https://gerrit.wikimedia.org/r/554949 (owner: 10CDanis) [21:25:51] (03PS1) 10CDanis: atlasexporter: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/554950 [21:26:33] I have that problem again that i get Icinga alerts that compilation of templates in /srv/config-master/pybal is broken for something. but i dont see an issue in them. [21:26:36] (03CR) 10CDanis: [C: 03+2] atlasexporter: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/554950 (owner: 10CDanis) [21:26:47] mutante: confd templates? [21:27:21] cdanis: yes. and it even has 2 alerts on puppetmaster1001 but only 1 alert on puppetmaster2001 .. but when i look manually the files are the same [21:27:30] mutante: bblack hit exactly this earlier today :) [21:27:43] and also it hits "codfw" too which i never touched [21:27:56] i only changed eqiad things for that service [21:27:57] long discussion in #-sre about it [21:28:00] ohhh. heh [21:30:14] thanks cdanis.. still looking what he did to fix it [21:30:59] (03CR) 10Krinkle: "Removed from cherry picks for now. It can be added again for testing but was getting stale and generally trying to keep the list of cherry" [puppet] - 10https://gerrit.wikimedia.org/r/511751 (owner: 10Ori.livneh) [21:32:13] (03CR) 10Dmaza: [C: 03+1] Enable CheckUser's Special:Investigate page on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T239936) (owner: 10Tchanders) [21:33:00] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [21:33:00] RECOVERY - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [21:33:26] !log puppetmaster1001: deleting /var/run/confd-template/.git-ssh*.err to fix confd template compilation alerts [21:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:54] (03CR) 10Krinkle: "The last comment implies I think that this doesn't work as intended for Beta, so should be removed from cherry picks? Or is it doing somet" [puppet] - 10https://gerrit.wikimedia.org/r/499025 (https://phabricator.wikimedia.org/T219242) (owner: 10Alex Monk) [21:34:28] !log puppetmaster2001: deleting /var/run/confd-template/.git-ssh*.err to fix confd template compilation alerts [21:34:30] RECOVERY - Confd template for /srv/config-master/pybal/codfw/git-ssh on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [21:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:40] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5974 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:34:54] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/git-ssh on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd [21:36:03] (03PS1) 10CDanis: prometheus/ops: add atlasexporter [puppet] - 10https://gerrit.wikimedia.org/r/554954 [21:36:14] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 7 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:39:18] (03CR) 10CDanis: [C: 03+2] prometheus/ops: add atlasexporter [puppet] - 10https://gerrit.wikimedia.org/r/554954 (owner: 10CDanis) [21:43:33] (03PS4) 10Jforrester: Enable CheckUser's Special:Investigate page on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T239936) (owner: 10Tchanders) [21:49:19] (03Abandoned) 1020after4: Configuration for phabricator to use swift storage. [puppet] - 10https://gerrit.wikimedia.org/r/432528 (https://phabricator.wikimedia.org/T182085) (owner: 1020after4) [21:50:28] !log phab1003 - remove IPv6 service IP for git-ssh from lo:LVS [21:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:34] (03CR) 10Krinkle: "No longer needed on Beta per IRC chat with Tyler. Removed." [puppet] - 10https://gerrit.wikimedia.org/r/432528 (https://phabricator.wikimedia.org/T182085) (owner: 1020after4) [21:52:59] (03PS1) 10Jforrester: Make differ exit with code if there's a diff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554956 [21:53:42] (03PS1) 10Dzahn: phabricator: enable vcs listen addresses on phab1001, disable on phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/554957 (https://phabricator.wikimedia.org/T238956) [21:54:23] 10Operations, 10serviceops: VP9-enabled ffmpeg doesn't get installed after reimage of mw job runner/video scaler - https://phabricator.wikimedia.org/T239831 (10brion) 05Open→03Resolved [21:55:28] !log stopping phd on phab1003 and starting on phab1001 [21:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:46] (03PS2) 10Jforrester: Make differ exit with code if there's a diff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554956 [21:56:19] * Krinkle testing on mwdebug1001 - https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/554637/ [21:57:10] (03CR) 10Dzahn: [C: 03+2] phabricator: enable vcs listen addresses on phab1001, disable on phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/554957 (https://phabricator.wikimedia.org/T238956) (owner: 10Dzahn) [21:57:24] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for rxy - https://phabricator.wikimedia.org/T239494 (10Rxy) >>! In T239494#5716627, @colewhite wrote: > @Rxy I've added you to the NDA group which should grant you access to Logstash. Please let me know if you encounter any related issue. OK,... [22:00:06] (03CR) 10Jforrester: [C: 03+2] Make differ exit with code if there's a diff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554956 (owner: 10Jforrester) [22:00:47] !log krinkle@deploy1001 Synchronized php-1.35.0-wmf.8/includes/libs/rdbms/database/: T233342 (duration: 01m 02s) [22:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:55] T233342: Standardise on Logstash field for exceptions with back traces - https://phabricator.wikimedia.org/T233342 [22:00:56] (03Merged) 10jenkins-bot: Make differ exit with code if there's a diff [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554956 (owner: 10Jforrester) [22:01:14] !log phab1001 - restarting ssh-phab to listen on additional LVS IP [22:01:19] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T239936) (owner: 10Tchanders) [22:01:21] (03CR) 10Krinkle: [C: 03+2] profiler: Remove arclamp.php (was for HHVM) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554836 (https://phabricator.wikimedia.org/T235142) (owner: 10Krinkle) [22:01:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:12] (03Merged) 10jenkins-bot: profiler: Remove arclamp.php (was for HHVM) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554836 (https://phabricator.wikimedia.org/T235142) (owner: 10Krinkle) [22:02:50] (03PS1) 1020after4: Run phd on phab1001 instead of 1003 [puppet] - 10https://gerrit.wikimedia.org/r/554960 [22:03:26] !log phabricator - git-ssh.wikimedia.org has been fixed and is up again (T238956) [22:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:32] T238956: switch prod Phabricator from phab1003 to phab1001 - https://phabricator.wikimedia.org/T238956 [22:04:00] (03PS2) 1020after4: Run phd on phab1001 instead of 1003 [puppet] - 10https://gerrit.wikimedia.org/r/554960 (https://phabricator.wikimedia.org/T238956) [22:04:24] (03CR) 10Dzahn: [C: 03+2] Run phd on phab1001 instead of 1003 [puppet] - 10https://gerrit.wikimedia.org/r/554960 (https://phabricator.wikimedia.org/T238956) (owner: 1020after4) [22:04:46] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550668 (https://phabricator.wikimedia.org/T239936) (owner: 10Tchanders) [22:05:28] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 115.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [22:07:12] James_F: fancy :) [22:07:37] Krinkle: That's the plan. :-) [22:07:40] !log krinkle@deploy1001 Synchronized wmf-config/: I64e5ebe5fcd6b - removes arclamp.php (duration: 01m 01s) [22:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:00] Krinkle: Thanks for arclamp death. [22:08:24] long-live arclamp, but so long hhvm/xenon. [22:08:45] Yeah yeah. [22:09:22] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5042 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [22:17:37] (03PS3) 10Jforrester: Turn off redirect on exact search match for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554372 (https://phabricator.wikimedia.org/T235263) [22:18:29] mutante, twentyafterfour: Thank you both for all your work on Phabricator! [22:18:33] PROBLEM - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [22:18:38] (03PS1) 10Herron: install_server: switch ganeti[345]* to raid1 layout [puppet] - 10https://gerrit.wikimedia.org/r/554961 (https://phabricator.wikimedia.org/T226444) [22:19:42] James_F: :) glad it finally worked. it turned into 3 switches instead of 1 :) [22:20:07] and several totally unrelated issues of course [22:20:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1002: SMART/disk error - https://phabricator.wikimedia.org/T230088 (10Jclark-ctr) replaced failed drive [22:21:21] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 1.69e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:22:25] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 15 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:24:22] mutante: But it worked eventually. :-) [22:24:34] (03PS3) 10Thcipriani: Beta: maintenance: skip mediawiki::state function [puppet] - 10https://gerrit.wikimedia.org/r/462019 (https://phabricator.wikimedia.org/T125976) [22:24:36] (03PS3) 10Thcipriani: Beta: maintenance: no openldap management [puppet] - 10https://gerrit.wikimedia.org/r/462020 (https://phabricator.wikimedia.org/T125976) [22:24:59] (03PS2) 10Volans: dns: generate DNS snippets from Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/554543 (https://phabricator.wikimedia.org/T233183) [22:26:36] James_F: yes, and the fail-over is now simpler than before as a consequence [22:26:52] fewer steps and simplified puppet/hiera change [22:28:05] PROBLEM - Old JVM GC check - cloudelastic1004-cloudelastic-chi-eqiad on cloudelastic1004 is CRITICAL: 100.3 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [22:28:12] !log herron@cumin1001 START - Cookbook sre.hosts.downtime [22:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:32] ^JVM check is me [22:28:51] (03PS1) 10CRusnov: netbox: move to netbox-extras repository [puppet] - 10https://gerrit.wikimedia.org/r/554962 [22:30:22] !log herron@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [22:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:39] (03CR) 10jerkins-bot: [V: 04-1] netbox: move to netbox-extras repository [puppet] - 10https://gerrit.wikimedia.org/r/554962 (owner: 10CRusnov) [22:31:01] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/554962 (owner: 10CRusnov) [22:32:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): cloudelastic1002: SMART/disk error - https://phabricator.wikimedia.org/T230088 (10Mathew.onipe) ` onimisionipe@cloudelastic1002:~$ sudo smartctl -H /dev/sdb smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-9-amd64] (local build) Copyrigh... [22:35:45] ACKNOWLEDGEMENT - MD RAID on cloudelastic1002 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T239957 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:35:47] 10Operations, 10ops-eqiad: Degraded RAID on cloudelastic1002 - https://phabricator.wikimedia.org/T239957 (10ops-monitoring-bot) [22:44:35] hey James_F, Urbanecm, Reedy: can we document the issues with sync-file vs sync-dir on the SWAT page better? [22:44:45] they are *sort* of mentioned on https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Step_4:_synchronize_the_changes_to_the_cluster [22:44:48] (03CR) 10Anomie: Update session serialization (Kask) to PHP w/ HMAC (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554910 (https://phabricator.wikimedia.org/T222099) (owner: 10Eevans) [22:44:54] fwiw, sync-file and sync-dir are actually the same [22:45:05] you can sync-dir a file and sync-file a dir ;D [22:45:23] Reedy: ssh, you'll give the game away. [22:45:30] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.07083 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [22:45:31] yeah, i remember there being better documentation of all the differences back when i joined, before scap swallowed everything i think? [22:45:35] scap sync-stuff [22:45:41] scap sync-party. [22:45:47] scap sync-something [22:46:17] There's probably better ways to document it though [22:46:36] a sentence in https://wikitech.wikimedia.org/wiki/SWAT_deploys#Doing_the_deploy about maybe preferring file-based syncs, and thinking about the order, may be worthwhile. [22:47:24] (i was just doing an informal postmortem w/ subbu and realized that this stuff wasn't actually documented anywhere but in my head, and probably your heads as well) [22:47:33] I feel this is possibly documented elsewhere [22:48:05] But maybe not obvious... That and we have multiple docs describing the same thing [22:49:20] cscott: There are so many gotchas in deploying. [22:49:34] the closest documentation i could find was https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Step_4:_synchronize_the_changes_to_the_cluster but it wasn't as explicit as it could be [22:49:46] E.g. sync-file can't sync the deletion of a file/dir, you have to sync the outer directory. [22:50:02] Which in some cases is wmf-config, i.e. entirely unsyncable withouth a full scap. [22:55:09] The docs used to be at https://wikitech.wikimedia.org/wiki/Wikimedia_binaries [22:55:12] and still are for non-scap [22:55:35] https://wikitech.wikimedia.org/w/index.php?title=Wikimedia_binaries&oldid=111724#sync-dir [23:03:10] (03PS1) 10Jforrester: Variant configuration: Replace symfony/yaml with spyc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 [23:03:53] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Replace symfony/yaml with spyc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 (owner: 10Jforrester) [23:04:08] RECOVERY - Device not healthy -SMART- on cloudelastic1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=cloudelastic1002&var-datasource=eqiad+prometheus/ops [23:04:10] !log [cloudelastic-chi] reduce indices.recovery.max_bytes_per_sec from 512mb->128mb [23:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:40] 10Operations, 10ops-eqiad: Degraded RAID on cloudelastic1002 - https://phabricator.wikimedia.org/T239957 (10wiki_willy) a:03Jclark-ctr Looks like this one is a duplicate of T230088 [23:22:05] (03PS1) 10Bstorm: toolforge-calico: Set up yaml and config to use calicoctl as a pod [puppet] - 10https://gerrit.wikimedia.org/r/554969 (https://phabricator.wikimedia.org/T239406) [23:32:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:35:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:37:19] (03PS1) 10Zoranzoki21: Upload HD logos for aawiki, aawikibooks, aawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 [23:41:04] brennen: OK for me to do a final quickconfig push? [23:41:22] James_F: go ahead. train's not going anywhere today. [23:41:29] Fun. [23:41:38] (03CR) 10Jforrester: [C: 03+2] Turn off redirect on exact search match for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554372 (https://phabricator.wikimedia.org/T235263) (owner: 10Jforrester) [23:41:59] (03PS2) 10Zoranzoki21: Upload HD logos for aawiki, aawikibooks, aawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) [23:42:30] (03Merged) 10jenkins-bot: Turn off redirect on exact search match for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554372 (https://phabricator.wikimedia.org/T235263) (owner: 10Jforrester) [23:43:07] (03CR) 10jerkins-bot: [V: 04-1] Upload HD logos for aawiki, aawikibooks, aawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [23:44:14] Hi, is SWAT currently? [23:44:36] Oh it is for 15 minutes, cool [23:44:40] jouncebot: next [23:44:41] In 0 hour(s) and 15 minute(s): Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191206T0000) [23:44:57] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T235263 Turn off redirect on exact search match for Commons (duration: 01m 00s) [23:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:03] T235263: Make it possible to bypass automatic redirection to exact matches in commons - https://phabricator.wikimedia.org/T235263 [23:45:46] James confused me with T235263 [23:47:25] (03PS3) 10Zoranzoki21: Upload HD logos for aawiki, aawikibooks, aawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) [23:48:11] (03CR) 10Zoranzoki21: "Hmm, what?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [23:48:34] (03CR) 10jerkins-bot: [V: 04-1] Upload HD logos for aawiki, aawikibooks, aawiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554970 (https://phabricator.wikimedia.org/T150618) (owner: 10Zoranzoki21) [23:55:54] jouncebot: next [23:55:54] In 0 hour(s) and 4 minute(s): Evening SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191206T0000) [23:57:09] (03PS2) 10Jforrester: Variant configuration: Replace symfony/yaml with spyc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 [23:57:53] (03CR) 10jerkins-bot: [V: 04-1] Variant configuration: Replace symfony/yaml with spyc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967 (owner: 10Jforrester) [23:58:42] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1 [23:59:37] (03PS3) 10Jforrester: Variant configuration: Replace symfony/yaml with spyc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/554967