[00:00:00] <wikibugs>	 (03Merged) 10jenkins-bot: Use the RequestTimeout library to set time limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672579 (https://phabricator.wikimedia.org/T269326) (owner: 10Tim Starling)
[00:04:52] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/PhpAutoPrepend.php: use RequestTimeout library step 1: disable old request timeout system (duration: 00m 58s)
[00:04:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH)
[00:04:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:06:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) a:05RobH→03Cmjohnson It appears this is plugged into the non cloud switch (after IRC sync).  I chatted with Chris, who...
[00:06:35] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: use RequestTimeout library step 2: enable new system (duration: 00m 57s)
[00:06:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:07:39] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cumin2002.codfw.wmnet - https://phabricator.wikimedia.org/T276587 (10Papaul)
[00:07:57] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config: use RequestTimeout library step 3: clean up (duration: 00m 58s)
[00:08:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:13:05] <wikibugs>	 (03CR) 10Dzahn: "There's of course nothing wrong with changing the port but for the record, changing which port envoy uses for TLS termination would have b" [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm)
[00:14:08] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:23:05] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:35:10] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10wiki_willy) Hi @klausman - just following up here, to see if we can close out this task?  Thanks, Willy
[00:44:25] <wikibugs>	 (03PS1) 10Dzahn: httpd: add parameters and template to allow custom ports.conf [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989)
[00:45:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] httpd: add parameters and template to allow custom ports.conf [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn)
[00:46:37] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:58:01] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:01:02] <wikibugs>	 (03PS2) 10Dzahn: httpd: add parameters and template to allow custom ports.conf [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989)
[01:02:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] httpd: add parameters and template to allow custom ports.conf [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn)
[01:16:33] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:25:18] <wikibugs>	 (03PS1) 10Razzi: site: remove decommissioned node labsdb1012 [puppet] - 10https://gerrit.wikimedia.org/r/674182 (https://phabricator.wikimedia.org/T269211)
[01:26:48] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] site: remove decommissioned node labsdb1012 [puppet] - 10https://gerrit.wikimedia.org/r/674182 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi)
[01:29:05] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:43:20] <wikibugs>	 (03CR) 10Legoktm: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm)
[01:45:01] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "Unfortunate that it's not workable for the general case." [puppet] - 10https://gerrit.wikimedia.org/r/674165 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm)
[01:47:04] <wikibugs>	 (03CR) 10Legoktm: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/674151 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm)
[01:48:59] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:59:11] <wikibugs>	 (03PS3) 10Dzahn: httpd: add parameters and template to allow custom ports.conf [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989)
[02:02:05] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:07:46] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.36 [core] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674183
[02:12:07] <wikibugs>	 (03PS2) 10Jforrester: Branch commit for wmf/1.36.0-wmf.36 [core] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674183 (https://phabricator.wikimedia.org/T274940) (owner: 10TrainBranchBot)
[02:18:53] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:30:49] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:51:27] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:04:19] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:20:21] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:22:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:22:37] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 5.527 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:34:11] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:43:59] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:08:31] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:15:45] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:16:41] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs: add new admin script to create a new base image based off of upstream [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051)
[04:17:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: add new admin script to create a new base image based off of upstream [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051) (owner: 10Andrew Bogott)
[04:25:45] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_navigationtiming_hourly.service,systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:42:51] <icinga-wm>	 PROBLEM - Disk space on backup1002 is CRITICAL: DISK CRITICAL - free space: /srv 3018164 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup1002&var-datasource=eqiad+prometheus/ops
[04:56:12] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs: add new admin script to create a new base image based off of upstream [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051)
[05:01:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:10:49] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:12:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set weight 0 to db1136 before failover T274336', diff saved to https://phabricator.wikimedia.org/P14992 and previous config saved to /var/cache/conftool/dbconfig/20210323-051210-marostegui.json
[05:12:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:12:18] <stashbot>	 T274336: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336
[05:13:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1174 to api T274336', diff saved to https://phabricator.wikimedia.org/P14993 and previous config saved to /var/cache/conftool/dbconfig/20210323-051346-marostegui.json
[05:13:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:17:59] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:33:02] <wikibugs>	 (03PS4) 10Marostegui: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336)
[05:33:59] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui)
[05:50:51] <icinga-wm>	 PROBLEM - SSH on mw2248.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:00:04] <jouncebot>	 marostegui, kormat, and jynus: (Dis)respected human, time to deploy s7 primary master switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T0600). Please do the needful.
[06:00:10] <marostegui>	 let's go?
[06:00:19] <jynus>	 sure
[06:00:23] <marostegui>	 \o/
[06:00:29] <marostegui>	 kormat: you ready?
[06:00:31] <kormat>	 o/
[06:00:38] <marostegui>	 !log Starting s7 eqiad failover from db1086 to db1136 - T274336
[06:00:40] <kormat>	 marostegui: as i'll ever be
[06:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:49] <stashbot>	 T274336: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336
[06:01:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s7 as read-only for maintenance T274336', diff saved to https://phabricator.wikimedia.org/P14994 and previous config saved to /var/cache/conftool/dbconfig/20210323-060104-marostegui.json
[06:01:06] <marostegui>	 RO set
[06:01:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:45] <marostegui>	 I cannot edit, so proceeding
[06:02:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1136 to s7 master and remove read-only from s7 T274336', diff saved to https://phabricator.wikimedia.org/P14995 and previous config saved to /var/cache/conftool/dbconfig/20210323-060216-marostegui.json
[06:02:20] <marostegui>	 all done
[06:02:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:02:43] <marostegui>	 I can edit eswiki
[06:03:02] <kormat>	 marostegui: nice work :)
[06:03:29] <marostegui>	 I see recentchanges advancing 
[06:03:33] <marostegui>	 (on eswiki)
[06:03:55] <jynus>	 no centralauth errors?
[06:04:35] <marostegui>	 I am going thru kibana now
[06:04:45] <Majavah>	 at least centralauth changes are being made
[06:04:55] <marostegui>	 Majavah: thanks for checking :)
[06:05:13] <jynus>	 I only see "GrowthExperiments\Maintenance\RefreshLinkRecommendations::execute: no transaction to commit, something got out of sync"
[06:05:41] <jynus>	 on testwiki
[06:05:42] <kormat>	 marostegui: i'll run puppet to fix the RO alerts
[06:05:48] <marostegui>	 kormat: just did :)
[06:06:03] <kormat>	 then i'll just _pretend_ i was useful
[06:07:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 25%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P14996 and previous config saved to /var/cache/conftool/dbconfig/20210323-060701-root.json
[06:07:03] <Majavah>	 kormat: don't say the pretend part publicly, makes it much easier
[06:07:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:07:12] <kormat>	 :)
[06:07:47] <jynus>	 thats already reported at T277702
[06:07:48] <stashbot>	 T277702: GrowthExperiments\Maintenance\RefreshLinkRecommendations: no transaction to commit, something got out of sync - https://phabricator.wikimedia.org/T277702
[06:07:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Update s7-master cname [dns] - 10https://gerrit.wikimedia.org/r/673196 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui)
[06:09:47] <jynus>	 I see a few "Error connecting to 10.64.0.204"
[06:10:11] <jynus>	 but that s2
[06:16:37] <wikibugs>	 10SRE, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui)
[06:18:42] <wikibugs>	 (03CR) 1020after4: "This is used for sending email to phabricator. I have no idea how it works really and I never use the email->phab feature to create tasks." [puppet] - 10https://gerrit.wikimedia.org/r/655743 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[06:19:29] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:20:10] <marostegui>	 !log Upgrade kernel on db1086
[06:20:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:20:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1086', diff saved to https://phabricator.wikimedia.org/P14997 and previous config saved to /var/cache/conftool/dbconfig/20210323-062059-marostegui.json
[06:21:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:23:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 10%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P14998 and previous config saved to /var/cache/conftool/dbconfig/20210323-062338-root.json
[06:23:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:16] <wikibugs>	 (03PS1) 10Marostegui: wiki-replicas.sql: Add analytics user [puppet] - 10https://gerrit.wikimedia.org/r/674194 (https://phabricator.wikimedia.org/T269211)
[06:27:44] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:28:17] <wikibugs>	 (03CR) 10Marostegui: "This change was applied yesterday to the DB, this is just to keep track of it on puppet" [puppet] - 10https://gerrit.wikimedia.org/r/674194 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui)
[06:28:54] <icinga-wm>	 RECOVERY - Disk space on backup1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup1002&var-datasource=eqiad+prometheus/ops
[06:29:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1146:3314', diff saved to https://phabricator.wikimedia.org/P14999 and previous config saved to /var/cache/conftool/dbconfig/20210323-062942-marostegui.json
[06:29:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:20] <wikibugs>	 10SRE, 10Dumps-Generation, 10SRE-Access-Requests, 10Patch-For-Review: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it - https://phabricator.wikimedia.org/T277629 (10ArielGlenn) This was approved in yesterday's SRE meeting, though I guess someone on that team p...
[06:38:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 25%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P15000 and previous config saved to /var/cache/conftool/dbconfig/20210323-063842-root.json
[06:38:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:41:09] <wikibugs>	 (03PS2) 10Marostegui: wiki-replicas.sql: Add analytics user [puppet] - 10https://gerrit.wikimedia.org/r/674194 (https://phabricator.wikimedia.org/T269211)
[06:42:40] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:44:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/674194 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui)
[06:44:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wiki-replicas.sql: Add analytics user [puppet] - 10https://gerrit.wikimedia.org/r/674194 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui)
[06:47:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:49:30] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:51:00] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:51:58] <icinga-wm>	 RECOVERY - SSH on mw2248.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[06:53:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 50%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P15001 and previous config saved to /var/cache/conftool/dbconfig/20210323-065345-root.json
[06:53:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:48] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rakefile: fix most rubocop violations [deployment-charts] - 10https://gerrit.wikimedia.org/r/673992 (owner: 10Giuseppe Lavagetto)
[06:58:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3317 to enable report_host T266483', diff saved to https://phabricator.wikimedia.org/P15002 and previous config saved to /var/cache/conftool/dbconfig/20210323-065836-marostegui.json
[06:58:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:58:44] <stashbot>	 T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483
[06:59:07] <wikibugs>	 (03Merged) 10jenkins-bot: Rakefile: fix most rubocop violations [deployment-charts] - 10https://gerrit.wikimedia.org/r/673992 (owner: 10Giuseppe Lavagetto)
[06:59:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3318 to enable report_host T266483', diff saved to https://phabricator.wikimedia.org/P15003 and previous config saved to /var/cache/conftool/dbconfig/20210323-065947-marostegui.json
[06:59:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:16] <marostegui>	 !log Upgrade kernel on db1101
[07:02:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 25%: Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P15004 and previous config saved to /var/cache/conftool/dbconfig/20210323-070705-root.json
[07:07:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 25%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P15005 and previous config saved to /var/cache/conftool/dbconfig/20210323-070719-root.json
[07:07:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:08:04] <wikibugs>	 (03PS1) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673)
[07:08:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup)
[07:08:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 75%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P15006 and previous config saved to /var/cache/conftool/dbconfig/20210323-070849-root.json
[07:08:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:57] <wikibugs>	 (03PS2) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673)
[07:13:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup)
[07:14:58] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:15:46] <wikibugs>	 (03PS3) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673)
[07:22:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 50%: Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P15007 and previous config saved to /var/cache/conftool/dbconfig/20210323-072209-root.json
[07:22:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 50%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P15008 and previous config saved to /var/cache/conftool/dbconfig/20210323-072223-root.json
[07:22:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:30] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:23:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 100%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P15009 and previous config saved to /var/cache/conftool/dbconfig/20210323-072352-root.json
[07:23:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Without getting into how cfssl is used, the patch would disrupt all existing installations of etcd in production." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah)
[07:29:02] <wikibugs>	 (03PS4) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673)
[07:32:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1181.eqiad.wmnet with reason: REIMAGE
[07:32:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:34] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: REIMAGE
[07:34:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:51] <wikibugs>	 (03PS5) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673)
[07:36:37] <elukey>	 !log create a 50g lvm volume on prometheus[12]00[34] for the k8s-mlserve cluster - T272918
[07:36:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:44] <stashbot>	 T272918: Create ml-serve k8s cluster - https://phabricator.wikimedia.org/T272918
[07:37:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 25%: Slowly repool db1146:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P15010 and previous config saved to /var/cache/conftool/dbconfig/20210323-073702-root.json
[07:37:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 75%: Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P15011 and previous config saved to /var/cache/conftool/dbconfig/20210323-073713-root.json
[07:37:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 75%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P15012 and previous config saved to /var/cache/conftool/dbconfig/20210323-073726-root.json
[07:37:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:48] <wikibugs>	 (03PS1) 10Marostegui: db1165: Specify the future of this host [puppet] - 10https://gerrit.wikimedia.org/r/674250 (https://phabricator.wikimedia.org/T258361)
[07:42:13] <wikibugs>	 (03PS6) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673)
[07:42:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1165: Specify the future of this host [puppet] - 10https://gerrit.wikimedia.org/r/674250 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[07:46:28] <wikibugs>	 (03CR) 10Ladsgroup: "PCC success: https://puppet-compiler.wmflabs.org/compiler1002/28708/stat1007.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup)
[07:52:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 50%: Slowly repool db1146:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P15013 and previous config saved to /var/cache/conftool/dbconfig/20210323-075206-root.json
[07:52:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 100%: Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P15014 and previous config saved to /var/cache/conftool/dbconfig/20210323-075216-root.json
[07:52:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 100%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P15015 and previous config saved to /var/cache/conftool/dbconfig/20210323-075230-root.json
[07:52:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3317 to enable report_host T266483', diff saved to https://phabricator.wikimedia.org/P15016 and previous config saved to /var/cache/conftool/dbconfig/20210323-075253-marostegui.json
[07:53:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:01] <stashbot>	 T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483
[07:54:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 25%: Slowly repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15017 and previous config saved to /var/cache/conftool/dbconfig/20210323-075445-root.json
[07:54:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:00] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 82 probes of 606 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:00:36] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[08:02:04] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.059 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[08:03:14] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:04:56] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:04:58] <elukey>	 --- SREs: if you have to merge please wait a bit, there is an inconsistency (probably triggered/caused by me) in the puppetmaster1001
[08:05:01] <elukey>	 ---
[08:07:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 75%: Slowly repool db1146:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P15019 and previous config saved to /var/cache/conftool/dbconfig/20210323-080709-root.json
[08:07:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 50%: Slowly repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15020 and previous config saved to /var/cache/conftool/dbconfig/20210323-080949-root.json
[08:09:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:44] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 44 probes of 606 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:14:30] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] downtime: Support services and other special icinga host [puppet] - 10https://gerrit.wikimedia.org/r/674147 (https://phabricator.wikimedia.org/T277191) (owner: 10Alexandros Kosiaris)
[08:16:32] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:19:10] <wikibugs>	 (03PS3) 10JMeybohm: kubernetes eqiad: Populate hiera keys for k8s worker updates [puppet] - 10https://gerrit.wikimedia.org/r/673949 (https://phabricator.wikimedia.org/T277741)
[08:21:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] downtime: Support services and other special icinga host [puppet] - 10https://gerrit.wikimedia.org/r/674147 (https://phabricator.wikimedia.org/T277191) (owner: 10Alexandros Kosiaris)
[08:22:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 100%: Slowly repool db1146:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P15021 and previous config saved to /var/cache/conftool/dbconfig/20210323-082213-root.json
[08:22:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:23] <wikibugs>	 (03PS1) 10Elukey: prometheus: add the ml-serve clusters settings [puppet] - 10https://gerrit.wikimedia.org/r/674258 (https://phabricator.wikimedia.org/T272918)
[08:23:44] <wikibugs>	 (03PS3) 10JMeybohm: kubernetes eqiad: Apply role and hiera values to new masters [puppet] - 10https://gerrit.wikimedia.org/r/673952 (https://phabricator.wikimedia.org/T277741)
[08:24:15] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 18 hosts with reason: Reinitialize eqiad k8s cluster with new etcd
[08:24:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:22] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 18 hosts with reason: Reinitialize eqiad k8s cluster with new etcd
[08:24:24] <moritzm>	 !log installing mariadb-10.3 updates on buster (just client-side libs/tools, unrelated to the main wmf-mariadb packages)
[08:24:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 75%: Slowly repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15022 and previous config saved to /var/cache/conftool/dbconfig/20210323-082454-root.json
[08:25:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:10] <akosiaris>	 !log beginning the k8s upgrade/reinit process. T277741
[08:25:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:17] <stashbot>	 T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741
[08:28:17] <akosiaris>	 !log downtime all services in T277741 for 24H
[08:28:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:40] <wikibugs>	 (03PS12) 10Ayounsi: Add Capirca support to Homer [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865)
[08:31:29] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=apertium
[08:31:29] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=api-gateway
[08:31:29] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=blubberoid
[08:31:30] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=citoid
[08:31:30] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=cxserver
[08:31:30] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=echostore
[08:31:31] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics
[08:31:31] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics-external
[08:31:31] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-logging-external
[08:31:32] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-main
[08:31:32] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventstreams
[08:31:32] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventstreams-internal
[08:31:33] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=linkrecommendation
[08:31:33] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mathoid
[08:31:34] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mobileapps
[08:31:34] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=proton
[08:31:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:35] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=push-notifications
[08:31:35] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=recommendation-api
[08:31:36] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=sessionstore
[08:31:36] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=similar-users
[08:31:37] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=termbox
[08:31:37] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=wikifeeds
[08:31:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:31:56] <akosiaris>	 !log eqiad services in k8s depooled. T277741
[08:31:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add the ml-serve clusters settings [puppet] - 10https://gerrit.wikimedia.org/r/674258 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[08:32:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add Capirca support to Homer [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi)
[08:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:46] <stashbot>	 T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741
[08:33:53] <wikibugs>	 (03PS13) 10Ayounsi: Add Capirca support to Homer [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865)
[08:35:46] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:37:02] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is CRITICAL: 114.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37
[08:39:05] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=apertium
[08:39:06] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=api-gateway
[08:39:06] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=blubberoid
[08:39:06] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=citoid
[08:39:07] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=cxserver
[08:39:07] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=echostore
[08:39:07] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics
[08:39:08] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics-external
[08:39:08] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-logging-external
[08:39:09] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-main
[08:39:09] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventstreams
[08:39:09] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventstreams-internal
[08:39:10] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=linkrecommendation
[08:39:10] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mathoid
[08:39:10] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mobileapps
[08:39:11] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=proton
[08:39:11] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=push-notifications
[08:39:12] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=recommendation-api
[08:39:12] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=sessionstore
[08:39:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:13] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=similar-users
[08:39:13] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=termbox
[08:39:14] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=wikifeeds
[08:39:14] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=zotero
[08:39:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 100%: Slowly repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15023 and previous config saved to /var/cache/conftool/dbconfig/20210323-083957-root.json
[08:40:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:04] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Branch commit for wmf/1.36.0-wmf.36 [core] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674183 (https://phabricator.wikimedia.org/T274940) (owner: 10TrainBranchBot)
[08:43:22] <akosiaris>	 !log poweroff argon and chlorine T277741
[08:43:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:30] <stashbot>	 T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741
[08:45:39] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Add kubernetes1017 to BGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/672709 (https://phabricator.wikimedia.org/T277741) (owner: 10Alexandros Kosiaris)
[08:46:09] <wikibugs>	 (03Merged) 10jenkins-bot: Add kubernetes1017 to BGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/672709 (https://phabricator.wikimedia.org/T277741) (owner: 10Alexandros Kosiaris)
[08:48:09] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Add kubemaster.svc.eqiad.wmnet.cert [puppet] - 10https://gerrit.wikimedia.org/r/674261 (https://phabricator.wikimedia.org/T277741)
[08:48:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add kubemaster.svc.eqiad.wmnet.cert [puppet] - 10https://gerrit.wikimedia.org/r/674261 (https://phabricator.wikimedia.org/T277741) (owner: 10Alexandros Kosiaris)
[08:50:09] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add kubemaster.svc.eqiad.wmnet.cert [puppet] - 10https://gerrit.wikimedia.org/r/674261 (https://phabricator.wikimedia.org/T277741)
[08:51:40] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "> Patch Set 1:" [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) (owner: 10Ryan Kemper)
[08:52:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Add kubemaster.svc.eqiad.wmnet.cert [puppet] - 10https://gerrit.wikimedia.org/r/674261 (https://phabricator.wikimedia.org/T277741) (owner: 10Alexandros Kosiaris)
[08:55:12] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([argon.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[08:55:22] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([argon.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[08:58:36] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] kubernetes eqiad: Populate hiera keys for k8s worker updates [puppet] - 10https://gerrit.wikimedia.org/r/673949 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm)
[08:59:55] <wikibugs>	 (03PS3) 10Filippo Giunchedi: alerts: deploy to Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/674025 (https://phabricator.wikimedia.org/T272977)
[09:01:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] alerts: deploy to Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/674025 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi)
[09:02:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:04:34] <wikibugs>	 (03PS4) 10Filippo Giunchedi: alerts: deploy to Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/674025 (https://phabricator.wikimedia.org/T272977)
[09:04:36] <akosiaris>	 !log empty etcd T277741 
[09:04:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: alerts: deploy to Prometheus hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674025 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi)
[09:04:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:45] <stashbot>	 T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741
[09:05:03] <akosiaris>	 !log reboot kubetcd100[456] for kernel upgrades. T277741 T273278
[09:05:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:36] <wikibugs>	 (03CR) 10Volans: "I'm rebasing it on master so that CI can do it's course, please rebase your local copy too if you're modifying it ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) (owner: 10Ryan Kemper)
[09:06:39] <wikibugs>	 (03PS2) 10Volans: elasticsearch: combined plugin upgrade + reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) (owner: 10Ryan Kemper)
[09:07:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:07:39] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.36 [core] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674183 (https://phabricator.wikimedia.org/T274940) (owner: 10TrainBranchBot)
[09:13:17] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[09:13:30] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:13:52] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_cxserver_cluster_eqiad,swagger_check_eventgate_analytics_cluster_eqiad,swagger_check_sessionstore_eqiad,swagger_check_termbox_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:14:21] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: kubernetes eqiad: Apply role and hiera values to new masters [puppet] - 10https://gerrit.wikimedia.org/r/673952 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm)
[09:14:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1147', diff saved to https://phabricator.wikimedia.org/P15024 and previous config saved to /var/cache/conftool/dbconfig/20210323-091432-marostegui.json
[09:14:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] kubernetes eqiad: Apply role and hiera values to new masters [puppet] - 10https://gerrit.wikimedia.org/r/673952 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm)
[09:16:06] <wikibugs>	 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff)
[09:16:33] <wikibugs>	 10SRE: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff)
[09:16:42] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes1017.eqiad.wmnet
[09:16:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:07] <wikibugs>	 (03PS1) 10Ema: VCL: test Server-Timing response header [puppet] - 10https://gerrit.wikimedia.org/r/674265 (https://phabricator.wikimedia.org/T277769)
[09:17:24] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,service=kubemaster,cluster=kubernetes
[09:17:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:31] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=kubemaster,cluster=kubernetes
[09:17:31] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans)
[09:17:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:06] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dc=eqiad,cluster=kubernetes,name=kubernetes1017.eqiad.wmnet
[09:18:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:50] <wikibugs>	 (03CR) 10Volans: [C: 03+2] tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans)
[09:20:00] <wikibugs>	 (03CR) 10Volans: [C: 03+2] tests: generate documentation from schemas [homer/public] - 10https://gerrit.wikimedia.org/r/673071 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans)
[09:20:09] <wikibugs>	 (03PS2) 10Volans: tests: update deprecated pytest option [homer/public] - 10https://gerrit.wikimedia.org/r/674069
[09:20:21] <wikibugs>	 (03Merged) 10jenkins-bot: tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans)
[09:20:37] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kubemaster is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[09:20:39] <wikibugs>	 (03Merged) 10jenkins-bot: tests: generate documentation from schemas [homer/public] - 10https://gerrit.wikimedia.org/r/673071 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans)
[09:20:59] <wikibugs>	 (03CR) 10Volans: [C: 03+2] tests: update deprecated pytest option [homer/public] - 10https://gerrit.wikimedia.org/r/674069 (owner: 10Volans)
[09:21:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] prometheus: add the ml-serve clusters settings [puppet] - 10https://gerrit.wikimedia.org/r/674258 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[09:21:34] <wikibugs>	 (03Merged) 10jenkins-bot: tests: update deprecated pytest option [homer/public] - 10https://gerrit.wikimedia.org/r/674069 (owner: 10Volans)
[09:24:56] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1001.eqiad.wmnet with reason: REIMAGE
[09:24:56] <wikibugs>	 (03CR) 10Ema: [C: 03+2] VCL: test Server-Timing response header [puppet] - 10https://gerrit.wikimedia.org/r/674265 (https://phabricator.wikimedia.org/T277769) (owner: 10Ema)
[09:25:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:10] <icinga-wm>	 PROBLEM - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.16 and port 4969: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:25:26] <XioNoX>	 s'up?
[09:25:32] <jynus>	 k8s maintenance?
[09:25:34] <volans>	 any WIP?
[09:25:38] <XioNoX>	 part of the upgrade?
[09:25:43] <elukey>	 I think so yes
[09:25:48] <moritzm>	 zotero was mentioned as a special case earlier
[09:25:55] <elukey>	 they are upgrading the eqiad k8s cluster
[09:25:55] <jayme>	 probably us, yes
[09:25:58] <akosiaris>	 ignore please
[09:26:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1086 to clone db1181 T275633', diff saved to https://phabricator.wikimedia.org/P15025 and previous config saved to /var/cache/conftool/dbconfig/20210323-092600-marostegui.json
[09:26:03] <_joe_>	 yes, don't worry
[09:26:06] <akosiaris>	 mistake on our side, we forgot to downtime this one
[09:26:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:26:08] <stashbot>	 T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633
[09:26:58] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1001.eqiad.wmnet with reason: REIMAGE
[09:27:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:10] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1002.eqiad.wmnet with reason: REIMAGE
[09:27:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:38] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1003.eqiad.wmnet with reason: REIMAGE
[09:28:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:58] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1007.eqiad.wmnet with reason: REIMAGE
[09:29:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:11] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1002.eqiad.wmnet with reason: REIMAGE
[09:29:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:20] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1003.eqiad.wmnet with reason: REIMAGE
[09:29:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:27] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1165 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/674267 (https://phabricator.wikimedia.org/T258361)
[09:30:58] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1008.eqiad.wmnet with reason: REIMAGE
[09:31:00] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1004.eqiad.wmnet with reason: REIMAGE
[09:31:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1165 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/674267 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[09:31:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:17] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1007.eqiad.wmnet with reason: REIMAGE
[09:31:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1165 to dbctl, depooled - T258361', diff saved to https://phabricator.wikimedia.org/P15027 and previous config saved to /var/cache/conftool/dbconfig/20210323-093246-marostegui.json
[09:32:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:54] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[09:32:56] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1009.eqiad.wmnet with reason: REIMAGE
[09:33:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:17] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1008.eqiad.wmnet with reason: REIMAGE
[09:33:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:56] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1017.eqiad.wmnet
[09:34:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:55] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1010.eqiad.wmnet with reason: REIMAGE
[09:35:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:12] <wikibugs>	 (03PS1) 10Marostegui: db1165: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/674268 (https://phabricator.wikimedia.org/T258361)
[09:35:16] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1009.eqiad.wmnet with reason: REIMAGE
[09:35:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:34] <wikibugs>	 10SRE: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff)
[09:35:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1165: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/674268 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui)
[09:36:00] <logmsgbot>	 !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1004.eqiad.wmnet with reason: REIMAGE
[09:36:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:18] <wikibugs>	 10SRE: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete
[09:36:48] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1011.eqiad.wmnet with reason: REIMAGE
[09:36:50] <wikibugs>	 (03PS1) 10JMeybohm: contool-data: Add kubernetes1017.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/674269 (https://phabricator.wikimedia.org/T277741)
[09:36:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:58] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1011 is CRITICAL: 3630 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[09:37:14] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1010.eqiad.wmnet with reason: REIMAGE
[09:37:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:03] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1003 is CRITICAL: CRITICAL - degraded: The following units failed: nic-saturation-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:38:07] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs2006 is CRITICAL: 3698 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[09:38:09] <icinga-wm>	 PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-sidecar@k8s-mlserve.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:38:44] <elukey>	 this is me --^
[09:38:48] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1012.eqiad.wmnet with reason: REIMAGE
[09:38:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:05] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:39:14] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1011.eqiad.wmnet with reason: REIMAGE
[09:39:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] contool-data: Add kubernetes1017.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/674269 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm)
[09:40:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] admin_ng: Enable eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/673955 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm)
[09:40:20] <icinga-wm>	 PROBLEM - Host kubernetes1003 is DOWN: PING CRITICAL - Packet loss = 100%
[09:40:36] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes1005.eqiad.wmnet
[09:40:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:55] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes1015.eqiad.wmnet
[09:40:56] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1013.eqiad.wmnet with reason: REIMAGE
[09:41:00] <icinga-wm>	 RECOVERY - Host kubernetes1003 is UP: PING WARNING - Packet loss = 50%, RTA = 0.30 ms
[09:41:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:04] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes1016.eqiad.wmnet
[09:41:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:15] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1012.eqiad.wmnet with reason: REIMAGE
[09:41:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:30] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Enable eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/673955 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm)
[09:42:24] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs2005 is CRITICAL: 3956 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[09:42:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1165 into s6 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P15028 and previous config saved to /var/cache/conftool/dbconfig/20210323-094257-marostegui.json
[09:43:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:06] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[09:43:17] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1013.eqiad.wmnet with reason: REIMAGE
[09:43:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:26] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1014.eqiad.wmnet with reason: REIMAGE
[09:43:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:08] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs2004 is CRITICAL: 4060 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[09:44:20] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1003 is CRITICAL: 4072 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[09:44:23] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1165 pooled with minimal weight for now, once it looks good, I will start the automatic pooling
[09:44:26] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1008 is CRITICAL: 4078 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[09:44:32] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[09:44:35] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[09:44:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:24] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1014.eqiad.wmnet with reason: REIMAGE
[09:45:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:35] <wikibugs>	 (03CR) 10Ema: [C: 04-1] "See inline comments. I've also just added https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673295 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles)
[09:45:37] <wikibugs>	 (03CR) 10Michael Große: [C: 03+1] "Would be really useful to have 6.14 🙌" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/674134 (https://phabricator.wikimedia.org/T278180) (owner: 10Addshore)
[09:45:43] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[09:45:46] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[09:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:53] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[09:46:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:02] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[09:46:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:49] <wikibugs>	 (03PS1) 10JMeybohm: contool-data: Add kubernetes2017.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/674270 (https://phabricator.wikimedia.org/T277191)
[09:47:43] <wikibugs>	 (03PS2) 10JMeybohm: conftool-data: Add kubernetes2017.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/674270 (https://phabricator.wikimedia.org/T277191)
[09:48:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] conftool-data: Add kubernetes2017.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/674270 (https://phabricator.wikimedia.org/T277191) (owner: 10JMeybohm)
[09:49:44] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[09:49:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:10] <icinga-wm>	 PROBLEM - Check systemd state on prometheus1004 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-sidecar@k8s-mlserve.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:50:24] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,service=kubesvc,name=kubernetes1017.eqiad.wmnet
[09:50:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:30] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=kubesvc,name=kubernetes1017.eqiad.wmnet
[09:50:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:02] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[09:51:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:10] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/weight=10; selector: dc=codfw,service=kubesvc,name=kubernetes2017.codfw.wmnet
[09:53:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:15] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=kubesvc,name=kubernetes2017.codfw.wmnet
[09:53:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:23] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[09:53:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:42] <hashar>	 !log scap prep 1.36.0-wmf.36 # T274940
[09:53:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:50] <stashbot>	 T274940: 1.36.0-wmf.36 deployment blockers - https://phabricator.wikimedia.org/T274940
[09:53:57] <akosiaris>	 !log deploy helmfile.d/admin_ng for eqiad T277741
[09:53:57] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1016.eqiad.wmnet
[09:54:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:04] <stashbot>	 T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741
[09:54:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:31] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[09:54:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1165 into s6 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P15029 and previous config saved to /var/cache/conftool/dbconfig/20210323-095437-marostegui.json
[09:54:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:46] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[09:54:51] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes1006.eqiad.wmnet
[09:54:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:11] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1015.eqiad.wmnet
[09:56:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:31] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1006.eqiad.wmnet
[09:57:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:01] <hashar>	 !log Applied security patches for 1.36.0-wmf.36 # T274940
[10:01:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:09] <stashbot>	 T274940: 1.36.0-wmf.36 deployment blockers - https://phabricator.wikimedia.org/T274940
[10:02:34] <hashar>	 !log scap clean --delete 1.36.0-wmf.32  # T274940
[10:02:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:07] <icinga-wm>	 PROBLEM - Check systemd state on prometheus2003 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-sidecar@k8s-mlserve.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:03:25] <icinga-wm>	 PROBLEM - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kubemaster is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[10:07:49] <icinga-wm>	 PROBLEM - Check systemd state on prometheus2004 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-sidecar@k8s-mlserve.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:08:47] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "Thanks for this!" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051) (owner: 10Andrew Bogott)
[10:10:10] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10ayounsi) @volans, @crusnov, what do you think about changing the `status` of those IPs from `active` to `SLAAC`? See for example https://netbox-next.wikimedia.org/ipam/ip-addresses/4540/ Th...
[10:10:23] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1005.eqiad.wmnet
[10:10:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:11] <wikibugs>	 (03PS1) 10Elukey: prometheus: change port for k8s-mlserve clusters [puppet] - 10https://gerrit.wikimedia.org/r/674273 (https://phabricator.wikimedia.org/T272918)
[10:15:41] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is OK: (C)100 gt (W)80 gt 65.08 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37
[10:16:45] <logmsgbot>	 !log hashar@deploy1002 Pruned MediaWiki: 1.36.0-wmf.32 (duration: 14m 47s)
[10:16:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 25%: Slowly repool db1147 after schema change', diff saved to https://phabricator.wikimedia.org/P15030 and previous config saved to /var/cache/conftool/dbconfig/20210323-101836-root.json
[10:18:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:13] <logmsgbot>	 !log hashar@deploy1002 Pruned MediaWiki: 1.36.0-wmf.33 (duration: 01m 48s)
[10:19:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:24] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
[10:19:24] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[10:19:24] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[10:19:24] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
[10:19:24] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[10:19:24] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
[10:19:25] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
[10:19:25] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
[10:19:26] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
[10:19:26] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
[10:19:27] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[10:19:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:23] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' .
[10:20:23] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'staging' .
[10:20:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 5%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15031 and previous config saved to /var/cache/conftool/dbconfig/20210323-102119-root.json
[10:21:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:28] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[10:21:29] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' .
[10:21:29] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[10:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:21:47] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[10:22:09] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[10:22:09] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[10:22:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:20] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=kubesvc
[10:22:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:37] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'staging' .
[10:22:37] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[10:22:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] prometheus: change port for k8s-mlserve clusters [puppet] - 10https://gerrit.wikimedia.org/r/674273 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[10:22:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:31] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' .
[10:23:31] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
[10:23:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:54] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'staging' .
[10:23:54] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' .
[10:24:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:37] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[10:25:39] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[10:25:39] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
[10:25:39] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' .
[10:25:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:25] <icinga-wm>	 RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:26:39] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'production' .
[10:26:39] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'staging' .
[10:26:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:21] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
[10:27:21] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
[10:27:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:08] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[10:28:08] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
[10:28:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:59] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[10:28:59] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
[10:29:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:46] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
[10:29:46] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[10:29:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:53] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[10:29:53] <icinga-wm>	 RECOVERY - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[10:29:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:42] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
[10:31:42] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
[10:31:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:22] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .
[10:32:22] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams-internal' for release 'canary' .
[10:32:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:43] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[10:32:43] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[10:32:43] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[10:32:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:29] <wikibugs>	 10SRE, 10Mail: Domains of most projects do not have DMARC policy - https://phabricator.wikimedia.org/T211403 (10Beeloser) The domain wikipedia.org does have a DMARC record however it has been applied incorrectly and therefore is not working. If the policy is not to deploy DMARC records then the record for wiki...
[10:33:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 50%: Slowly repool db1147 after schema change', diff saved to https://phabricator.wikimedia.org/P15033 and previous config saved to /var/cache/conftool/dbconfig/20210323-103340-root.json
[10:33:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:06] <wikibugs>	 (03PS7) 10Ayounsi: Add Capirca definitions exporter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865)
[10:34:19] <icinga-wm>	 RECOVERY - Check systemd state on prometheus1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:35:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add Capirca definitions exporter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi)
[10:36:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 10%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15034 and previous config saved to /var/cache/conftool/dbconfig/20210323-103623-root.json
[10:36:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:31] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_4001: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet are marked down but pooled: recommendation-api_4632: Servers kubernetes1012.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernete
[10:36:31] <icinga-wm>	 , kubernetes1017.eqiad.wmnet are marked down but pooled: push-notifications_4104: Servers kubernetes1012.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1017.eqiad.wmnet are marked down but pooled: proton_4030: Servers kubernetes1012.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1017.eqiad.wmnet are marked down but pooled: mobileapps_4102: Servers kubernetes1012.eqia
[10:36:31] <icinga-wm>	 es1014.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet are marked down but pooled: similar-users_4110: Servers kubernetes1012.eqiad.wmnet https://wikitech.wikimedia.org/wiki/PyBal
[10:36:31] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[10:37:49] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' .
[10:37:49] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
[10:37:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:31] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[10:39:31] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[10:39:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:13] <wikibugs>	 (03PS1) 10DCausse: [wdqs] switch reporting topic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/674278
[10:41:11] <wikibugs>	 (03PS8) 10Ayounsi: Add Capirca definitions exporter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865)
[10:41:33] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
[10:41:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:42:41] <wikibugs>	 (03CR) 10Ayounsi: "And fixed the output indentation 😊" (035 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi)
[10:42:50] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' .
[10:42:50] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'canary' .
[10:42:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:02] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] [wdqs] switch reporting topic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/674278 (owner: 10DCausse)
[10:43:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:05] <icinga-wm>	 RECOVERY - Check systemd state on prometheus2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:43:17] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: openstack: nova: disable /etc/host management from cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/673475 (https://phabricator.wikimedia.org/T277866) (owner: 10Arturo Borrero Gonzalez)
[10:43:32] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'recommendation-api' for release 'production' .
[10:43:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:04] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
[10:44:04] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'staging' .
[10:44:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:31] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'similar-users' for release 'main' .
[10:44:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:14] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'staging' .
[10:45:14] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' .
[10:45:14] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'test' .
[10:45:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:43] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
[10:45:43] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[10:45:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:15] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' .
[10:46:15] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'staging' .
[10:46:19] <icinga-wm>	 RECOVERY - Check systemd state on prometheus2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:46:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:42] <icinga-wm>	 RECOVERY - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 197 bytes in 1.103 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[10:47:00] <volans>	 welcome back zotero
[10:48:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 75%: Slowly repool db1147 after schema change', diff saved to https://phabricator.wikimedia.org/P15035 and previous config saved to /var/cache/conftool/dbconfig/20210323-104843-root.json
[10:48:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:41] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:50:47] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs2004 is OK: (C)3600 ge (W)1200 ge 28.16 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[10:51:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 15%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15036 and previous config saved to /var/cache/conftool/dbconfig/20210323-105126-root.json
[10:51:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:35] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[10:53:09] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs1008 is OK: (C)3600 ge (W)1200 ge 57.36 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[10:53:11] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs2006 is OK: (C)3600 ge (W)1200 ge 17.66 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[10:53:31] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1003 is CRITICAL: 8224 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[10:56:14] <jayme>	 !log all services re-deployed to k8s eqiad - T277741
[10:56:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:22] <stashbot>	 T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741
[10:58:19] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 52.4 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[10:58:45] <wikibugs>	 (03PS1) 10Elukey: Add config for prometheus@k8s-mlserve [puppet] - 10https://gerrit.wikimedia.org/r/674279 (https://phabricator.wikimedia.org/T272918)
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European mid-day backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T1100).
[11:00:04] <jouncebot>	 MatmaRex: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:55] <MatmaRex>	 hi
[11:01:05] <wikibugs>	 (03CR) 10Elukey: "Filippo: I can split  the change if you prefer (lvs part first, then grafana)" [puppet] - 10https://gerrit.wikimedia.org/r/674279 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[11:01:07] <moritzm>	 !log installing tomcat8 security updates
[11:01:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:13] <Lucas_WMDE>	 I’m in a meeting, but I can probably take over the window later if no one else is around
[11:01:19] <dcausse>	 jayme: hi, when will you switch changeprop back to eqiad?
[11:01:59] <jayme>	 dcausse: probably some time tomorrow EU morning
[11:02:07] <dcausse>	 ok thanks!
[11:03:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 100%: Slowly repool db1147 after schema change', diff saved to https://phabricator.wikimedia.org/P15037 and previous config saved to /var/cache/conftool/dbconfig/20210323-110347-root.json
[11:03:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:22] <wikibugs>	 (03PS1) 10DCausse: Revert "[wdqs] switch reporting topic to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/674118
[11:05:41] <wikibugs>	 (03CR) 10DCausse: [C: 04-1] "not before changeprop is moved back to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/674118 (owner: 10DCausse)
[11:05:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1148', diff saved to https://phabricator.wikimedia.org/P15038 and previous config saved to /var/cache/conftool/dbconfig/20210323-110553-marostegui.json
[11:05:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 20%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15039 and previous config saved to /var/cache/conftool/dbconfig/20210323-110630-root.json
[11:06:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:06:38] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[11:08:08] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db1181 [puppet] - 10https://gerrit.wikimedia.org/r/674280 (https://phabricator.wikimedia.org/T275633)
[11:08:13] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 49.72 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:08:39] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:08:59] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1181 [puppet] - 10https://gerrit.wikimedia.org/r/674280 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui)
[11:10:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:10:41] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs1011 is OK: (C)3600 ge (W)1200 ge 5.633 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[11:11:50] <wikibugs>	 (03PS1) 10Hnowlan: aqs_next: add more hosts to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119)
[11:14:21] <wikibugs>	 (03CR) 10Elukey: "Hugh: Does this need hiera config too?" [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan)
[11:16:03] <wikibugs>	 (03PS2) 10Hnowlan: aqs_next: add more hosts to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119)
[11:16:56] <wikibugs>	 (03CR) 10Hnowlan: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan)
[11:17:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: deploy to Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/674025 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi)
[11:17:49] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28710/console" [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan)
[11:19:47] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on db1086 is CRITICAL: CRITICAL: Slot 1: Failed: 1I:1:4 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T278226 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[11:19:50] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10ops-monitoring-bot)
[11:21:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10Marostegui) a:03wiki_willy @wiki_willy this host will be decommissioned in a few weeks, but I would like this disk to be replaced (it is out of warranty) with some used ones if we still have. This host is a st...
[11:21:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10Marostegui) p:05Triage→03Medium
[11:21:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15040 and previous config saved to /var/cache/conftool/dbconfig/20210323-112133-root.json
[11:21:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:42] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[11:22:16] <wikibugs>	 (03PS1) 10Effie Mouzeli: hiera/nodes: put mc1037 into our session redis and memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/674282 (https://phabricator.wikimedia.org/T278225)
[11:22:25] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] aqs_next: add more hosts to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan)
[11:22:50] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:23:28] <MatmaRex>	 Lucas_WMDE: please let me know if you're able to do the deployment
[11:23:36] <MatmaRex>	 or if i should reschedule
[11:24:36] <Lucas_WMDE>	 I can probably do it now
[11:24:48] <Lucas_WMDE>	 MatmaRex: I assume you’ll be able to test it on mwdebug?
[11:25:02] <MatmaRex>	 yeah
[11:25:54] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Enable DiscussionTools' beta features on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674098 (https://phabricator.wikimedia.org/T276494) (owner: 10Bartosz Dziewoński)
[11:26:00] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable DiscussionTools' beta features on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674098 (https://phabricator.wikimedia.org/T276494) (owner: 10Bartosz Dziewoński)
[11:27:21] <wikibugs>	 (03Merged) 10jenkins-bot: Enable DiscussionTools' beta features on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674098 (https://phabricator.wikimedia.org/T276494) (owner: 10Bartosz Dziewoński)
[11:27:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, yeah I think that's fine as it is" [puppet] - 10https://gerrit.wikimedia.org/r/674279 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[11:27:44] <Lucas_WMDE>	 MatmaRex: alright, it should be live on mwdebug1001 now
[11:27:49] * Lucas_WMDE also tries it
[11:28:21] <MatmaRex>	 seems to be working
[11:29:00] <Lucas_WMDE>	 yup, syncing
[11:29:40] <wikibugs>	 (03PS3) 10Hnowlan: aqs_next: add more hosts to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119)
[11:30:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:674098|Enable DiscussionTools' beta features on dewiki (T276494)]] (duration: 00m 58s)
[11:30:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:34] <stashbot>	 T276494: Make the Reply and New Discussion Tools available as Beta Features at de.wiki - https://phabricator.wikimedia.org/T276494
[11:30:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1005.eqiad.
[11:30:39] <icinga-wm>	 down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:30:41] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1006.eqiad.
[11:30:41] <icinga-wm>	 down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[11:30:45] <MatmaRex>	 thanks Lucas_WMDE!
[11:31:17] <Lucas_WMDE>	 np, excited to see this rolling out to more wikis :)
[11:31:41] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] aqs_next: add more hosts to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan)
[11:31:49] <Lucas_WMDE>	 !log EU backport&config window done
[11:31:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:53] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:32:55] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:34:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM, I had a chat with Effie on IRC about the rollout plan (disable puppet on mw nodes first, run puppet on mc1037 since it is still with" [puppet] - 10https://gerrit.wikimedia.org/r/674282 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli)
[11:36:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 35%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15041 and previous config saved to /var/cache/conftool/dbconfig/20210323-113637-root.json
[11:36:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:46] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[11:37:39] <wikibugs>	 (03PS2) 10Effie Mouzeli: hiera/nodes: put mc1037 into our session redis and memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/674282 (https://phabricator.wikimedia.org/T278225)
[11:38:29] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on aqs[1012-1015].eqiad.wmnet with reason: New buster hosts, not in use
[11:38:30] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove duplicate role declaration [puppet] - 10https://gerrit.wikimedia.org/r/674287 (https://phabricator.wikimedia.org/T273566)
[11:38:30] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on aqs[1012-1015].eqiad.wmnet with reason: New buster hosts, not in use
[11:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:38:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:03] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:40:23] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] site.pp: Remove duplicate role declaration [puppet] - 10https://gerrit.wikimedia.org/r/674287 (https://phabricator.wikimedia.org/T273566) (owner: 10Marostegui)
[11:40:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove duplicate role declaration [puppet] - 10https://gerrit.wikimedia.org/r/674287 (https://phabricator.wikimedia.org/T273566) (owner: 10Marostegui)
[11:42:03] <MatmaRex>	 Lucas_WMDE: oh, you also posted on-wiki about it, thanks :D
[11:42:16] <Lucas_WMDE>	 yeah, I figured I might as well since it was dewiki ^
[11:42:20] <Lucas_WMDE>	 *^^
[11:44:25] <wikibugs>	 (03CR) 10Effie Mouzeli: "pcc https://puppet-compiler.wmflabs.org/compiler1003/28713/mc1034.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/674282 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli)
[11:51:23] <wikibugs>	 (03PS1) 10Effie Mouzeli: hiera/nodes: put mc1038 into our session redis and memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/674290 (https://phabricator.wikimedia.org/T278225)
[11:51:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15042 and previous config saved to /var/cache/conftool/dbconfig/20210323-115141-root.json
[11:51:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:49] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[11:55:02] <moritzm>	 !log installing libcaca security updates
[11:55:02] <wikibugs>	 10SRE, 10Proton, 10Product-Infrastructure-Team-Backlog (Kanban): Proton metrics broken - https://phabricator.wikimedia.org/T277857 (10MSantos) p:05Triage→03High a:03Jgiannelos
[11:55:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:55:17] <wikibugs>	 10SRE, 10Proton, 10Product-Infrastructure-Team-Backlog (Kanban): Proton metrics broken - https://phabricator.wikimedia.org/T277857 (10MSantos)
[11:55:54] <wikibugs>	 (03PS1) 10Hnowlan: aqs_next: add additional hiera config for new nodes [puppet] - 10https://gerrit.wikimedia.org/r/674291 (https://phabricator.wikimedia.org/T274119)
[11:57:47] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28714/console" [puppet] - 10https://gerrit.wikimedia.org/r/674291 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan)
[12:00:05] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T1200)
[12:05:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1016.eqiad.
[12:05:34] <icinga-wm>	 down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:06:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 60%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15043 and previous config saved to /var/cache/conftool/dbconfig/20210323-120644-root.json
[12:06:47] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: make cpp installation more general [puppet] - 10https://gerrit.wikimedia.org/r/674292
[12:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:53] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[12:07:08] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:07:21] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: install cpp at the same time as 'gridengine-common' [puppet] - 10https://gerrit.wikimedia.org/r/674292
[12:07:52] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: install cpp at the same time as 'gridengine-common' [puppet] - 10https://gerrit.wikimedia.org/r/674292 (owner: 10Arturo Borrero Gonzalez)
[12:09:14] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:09:23] <moritzm>	 !log drain ganeti2016
[12:09:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:16:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 25%: Slowly repool db1148 after schema change', diff saved to https://phabricator.wikimedia.org/P15044 and previous config saved to /var/cache/conftool/dbconfig/20210323-121642-root.json
[12:16:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:02] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet
[12:17:03] <akosiaris>	 !log remove all schedule downtimes for k8s cluster. T277741
[12:17:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:17:15] <stashbot>	 T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741
[12:17:52] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:18:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:19:04] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[12:19:40] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[12:20:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 25%: Slowly repool db1086 after cloning db1181', diff saved to https://phabricator.wikimedia.org/P15045 and previous config saved to /var/cache/conftool/dbconfig/20210323-122032-root.json
[12:20:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:20:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15046 and previous config saved to /var/cache/conftool/dbconfig/20210323-122148-root.json
[12:21:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:58] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[12:22:10] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[12:22:12] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikime
[12:22:12] <icinga-wm>	 l
[12:24:59] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet
[12:25:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:26] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:26:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[12:27:53] <moritzm>	 !log drain ganeti2017
[12:27:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:31:30] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:31:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 50%: Slowly repool db1148 after schema change', diff saved to https://phabricator.wikimedia.org/P15047 and previous config saved to /var/cache/conftool/dbconfig/20210323-123146-root.json
[12:31:52] <wikibugs>	 (03PS1) 10Esanders: Remove redundant wgDiscussionToolsEnable overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674300
[12:31:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:34:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove helmfile.d/admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/673956 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm)
[12:35:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 50%: Slowly repool db1086 after cloning db1181', diff saved to https://phabricator.wikimedia.org/P15048 and previous config saved to /var/cache/conftool/dbconfig/20210323-123535-root.json
[12:35:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:33] <wikibugs>	 (03Merged) 10jenkins-bot: Remove helmfile.d/admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/673956 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm)
[12:36:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 85%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15049 and previous config saved to /var/cache/conftool/dbconfig/20210323-123651-root.json
[12:37:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:03] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[12:40:22] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.8571 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[12:41:22] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not format db1181 [puppet] - 10https://gerrit.wikimedia.org/r/674301 (https://phabricator.wikimedia.org/T275633)
[12:41:56] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me
[12:42:20] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2017.codfw.wmnet
[12:42:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not format db1181 [puppet] - 10https://gerrit.wikimedia.org/r/674301 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui)
[12:43:32] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:43:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover tendril to dbmonitor1002 [dns] - 10https://gerrit.wikimedia.org/r/674303 (https://phabricator.wikimedia.org/T224589)
[12:46:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 75%: Slowly repool db1148 after schema change', diff saved to https://phabricator.wikimedia.org/P15050 and previous config saved to /var/cache/conftool/dbconfig/20210323-124650-root.json
[12:46:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:18] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2017.codfw.wmnet
[12:50:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:36] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[12:50:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 75%: Slowly repool db1086 after cloning db1181', diff saved to https://phabricator.wikimedia.org/P15051 and previous config saved to /var/cache/conftool/dbconfig/20210323-125039-root.json
[12:50:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15052 and previous config saved to /var/cache/conftool/dbconfig/20210323-125155-root.json
[12:52:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:05] <stashbot>	 T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361
[12:55:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Let's merge it tomorrow early morning as we agreed on IRC" [dns] - 10https://gerrit.wikimedia.org/r/674303 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff)
[12:56:34] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:58:00] <akosiaris>	 !log remove and decomission argon, chroline, acrab, acrux T277741, T277191
[12:58:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:09] <stashbot>	 T277191: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191
[12:58:11] <stashbot>	 T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741
[12:58:13] <moritzm>	 !log drain ganeti2018
[12:58:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:58:50] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add config for prometheus@k8s-mlserve [puppet] - 10https://gerrit.wikimedia.org/r/674279 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[13:00:05] <jouncebot>	 hashar and dancy: That opportune time is upon us again. Time for a Mediawiki train - European+American Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T1300).
[13:00:22] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Decommission argon, chlorine, acrab, acrux [puppet] - 10https://gerrit.wikimedia.org/r/674307 (https://phabricator.wikimedia.org/T277741)
[13:00:57] <wikibugs>	 (03PS38) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146)
[13:00:58] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:01:14] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1003.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikime
[13:01:14] <icinga-wm>	 l
[13:01:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikime
[13:01:17] <icinga-wm>	 l
[13:01:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 100%: Slowly repool db1148 after schema change', diff saved to https://phabricator.wikimedia.org/P15053 and previous config saved to /var/cache/conftool/dbconfig/20210323-130153-root.json
[13:02:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:32] <icinga-wm>	 PROBLEM - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4006: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:04:50] <icinga-wm>	 RECOVERY - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 7.257 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:05:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 100%: Slowly repool db1086 after cloning db1181', diff saved to https://phabricator.wikimedia.org/P15054 and previous config saved to /var/cache/conftool/dbconfig/20210323-130543-root.json
[13:05:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:58] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09524 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[13:12:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:13:07] <elukey>	 jayme, akosiaris are the linkreccomendation flaps expected?
[13:13:24] <akosiaris>	 I think we got a task for that already
[13:13:32] <akosiaris>	 it's weird though
[13:13:46] <akosiaris>	 I 'll double check
[13:14:03] <akosiaris>	 https://phabricator.wikimedia.org/T278223
[13:14:42] <elukey>	 akosiaris: ah ok sorry I thought it was related to the k8s upgrade
[13:14:44] <elukey>	 thanks :)
[13:15:34] <ema>	 !log cp3054: install varnishkafka built explicitly against varnish 6.0.1-1wm2 to fix broken dpkg status T264398
[13:15:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:48] <stashbot>	 T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398
[13:16:14] <ema>	 moritzm: this fixes dpkg on cp3054, I've tried upgrading screen and it worked fine ^
[13:17:32] <hasharLunch>	 going to run train for test wikis
[13:17:37] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2018.codfw.wmnet
[13:17:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:57] <moritzm>	 ema: ack, thx!
[13:18:41] <wikibugs>	 (03PS1) 10Hashar: testwikis wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674310
[13:18:43] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674310 (owner: 10Hashar)
[13:19:22] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:19:30] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674310 (owner: 10Hashar)
[13:20:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:21:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm some minor nits" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup)
[13:23:10] <logmsgbot>	 !log hashar@deploy1002 Started scap: Promote testwikis from 1.36.0-wmf.35 to 1.36.0-wmf.36 - T274940
[13:23:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:18] <stashbot>	 T274940: 1.36.0-wmf.36 deployment blockers - https://phabricator.wikimedia.org/T274940
[13:25:55] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2018.codfw.wmnet
[13:26:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:35] <wikibugs>	 (03PS1) 10Elukey: role::ml_k8s::master: add prometheus instance endpoint [puppet] - 10https://gerrit.wikimedia.org/r/674313 (https://phabricator.wikimedia.org/T272918)
[13:27:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, few typos inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond)
[13:27:54] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:28:00] <moritzm>	 !log drain ganeti2008
[13:28:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:16] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28715/console" [puppet] - 10https://gerrit.wikimedia.org/r/674313 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[13:31:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] aqs_next: add additional hiera config for new nodes [puppet] - 10https://gerrit.wikimedia.org/r/674291 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan)
[13:31:38] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:32:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:32:58] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28716/console" [puppet] - 10https://gerrit.wikimedia.org/r/674313 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[13:35:00] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:35:51] <wikibugs>	 (03CR) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup)
[13:37:04] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:40:40] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1003.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1005.eqiad.
[13:40:40] <icinga-wm>	 down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:40:46] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1003.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1016.eqiad.
[13:40:46] <icinga-wm>	 down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:41:20] <akosiaris>	 I 'll silence this while working on https://phabricator.wikimedia.org/T278223
[13:41:52] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:44:02] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:50:34] <wikibugs>	 (03PS1) 10Ayounsi: JSON schema, add coverage to secrets [homer/public] - 10https://gerrit.wikimedia.org/r/674318 (https://phabricator.wikimedia.org/T272688)
[13:52:09] <wikibugs>	 (03PS1) 10Palak199: Fix:: InvalidQueryException handling [software/transferpy] - 10https://gerrit.wikimedia.org/r/674319 (https://phabricator.wikimedia.org/T268258)
[13:52:25] <wikibugs>	 (03PS2) 10Ayounsi: JSON schema, add coverage to secrets [homer/public] - 10https://gerrit.wikimedia.org/r/674318 (https://phabricator.wikimedia.org/T272688)
[13:53:55] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: read alerts from /srv/alerts [puppet] - 10https://gerrit.wikimedia.org/r/674321 (https://phabricator.wikimedia.org/T272977)
[13:54:33] <elukey>	 !log sudo systemctl reload apache2 on prometheus[12]00[34] to pick up new k8s-mlserve instance settings
[13:54:34] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:54:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:05] <logmsgbot>	 !log hashar@deploy1002 Finished scap: Promote testwikis from 1.36.0-wmf.35 to 1.36.0-wmf.36 - T274940 (duration: 31m 57s)
[13:55:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:13] <stashbot>	 T274940: 1.36.0-wmf.36 deployment blockers - https://phabricator.wikimedia.org/T274940
[13:55:28] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:55:32] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:55:40] <wikibugs>	 (03CR) 10Jbond: etcd: Use cfssl for peer-to-peer communication (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah)
[13:58:08] <wikibugs>	 (03CR) 10Andrew Bogott: wmcs: add new admin script to create a new base image based off of upstream (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051) (owner: 10Andrew Bogott)
[13:59:24] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2008.codfw.wmnet
[13:59:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:34] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:01:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10wiki_willy) a:05wiki_willy→03Cmjohnson Hi @Cmjohnson - since this host is out of warranty, can you grab a drive from a decom'd server for this one?  Thanks, Willy
[14:04:49] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2008.codfw.wmnet
[14:04:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:59] <hashar>	 stupid train
[14:05:12] <hashar>	 no doing group0
[14:05:14] <moritzm>	 !log installing pygments security updates on stretch
[14:05:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:35] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=apertium
[14:05:35] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=api-gateway
[14:05:36] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=blubberoid
[14:05:36] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=citoid
[14:05:36] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=cxserver
[14:05:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:46] <wikibugs>	 (03PS1) 10Hashar: group0 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674325
[14:05:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:50] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674325 (owner: 10Hashar)
[14:05:52] <hashar>	 now doing group 0
[14:05:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:54] <akosiaris>	 !log pool a few services in eqiad k8s. T277741
[14:05:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:13] <stashbot>	 T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741
[14:06:30] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674325 (owner: 10Hashar)
[14:07:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] statistics: Migrate wmde cronjobs to systemd timers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup)
[14:07:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikime
[14:07:34] <icinga-wm>	 l
[14:07:48] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.36
[14:07:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:42] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_k8s::master: add prometheus instance endpoint [puppet] - 10https://gerrit.wikimedia.org/r/674313 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey)
[14:10:08] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/674165 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm)
[14:10:38] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3175 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[14:11:42] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1003.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1005.eqiad.
[14:11:42] <icinga-wm>	 down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:12:25] <wikibugs>	 (03PS39) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146)
[14:12:42] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:12:45] <wikibugs>	 (03CR) 10Jbond: "thanks fixed" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond)
[14:14:46] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:17:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks great!" [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond)
[14:17:44] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:17:48] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:19:29] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=similar-users
[14:19:29] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=termbox
[14:19:29] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=wikifeeds
[14:19:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:03] <akosiaris>	 !log pool a few more services in eqiad k8s. T277741
[14:20:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:20:11] <stashbot>	 T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741
[14:20:36] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "Very minor nits 😊" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667165 (https://phabricator.wikimedia.org/T275874) (owner: 10Jgiannelos)
[14:25:24] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 04-1] "just noticed it was -1'ed as it is WIP, sorry!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/667165 (https://phabricator.wikimedia.org/T275874) (owner: 10Jgiannelos)
[14:29:35] <wikibugs>	 (03CR) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup)
[14:30:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] wmcs.ceph.codfw: Upgrade to latest 5.X kernel [puppet] - 10https://gerrit.wikimedia.org/r/674074 (https://phabricator.wikimedia.org/T274565) (owner: 10David Caro)
[14:31:47] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.619 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[14:33:10] <wikibugs>	 (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/674350 (https://phabricator.wikimedia.org/T278223)
[14:34:37] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:34:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:34:43] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.07937 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[14:34:43] <wikibugs>	 (03CR) 10Kosta Harlan: "I don't have time to deploy this now, but could do later this evening. In the meantime, Alexandros or Janis: if you want to deploy this ea" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674350 (https://phabricator.wikimedia.org/T278223) (owner: 10Kosta Harlan)
[14:41:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Decommission argon, chlorine, acrab, acrux [puppet] - 10https://gerrit.wikimedia.org/r/674307 (https://phabricator.wikimedia.org/T277741) (owner: 10Alexandros Kosiaris)
[14:42:32] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventstreams
[14:42:33] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventstreams-internal
[14:42:33] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=linkrecommendation
[14:42:33] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=mathoid
[14:42:34] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=mobileapps
[14:42:34] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=proton
[14:42:34] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=push-notifications
[14:42:35] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=recommendation-api
[14:42:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:01] <akosiaris>	 !log pool more services in eqiad k8s. T277741. Only the very large ones traffic wise are still on codfw
[14:43:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:35] <stashbot>	 T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741
[14:44:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Sure, I 'll take this over and deploy to see if it fixes it. If not, I 'll revert." [deployment-charts] - 10https://gerrit.wikimedia.org/r/674350 (https://phabricator.wikimedia.org/T278223) (owner: 10Kosta Harlan)
[14:44:18] <wikibugs>	 10SRE, 10MediaWiki-General, 10observability, 10serviceops, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10AMooney) @lmata, This needs PET code review correct?
[14:44:27] <wikibugs>	 (03PS4) 10ArielGlenn: [WIP] needs more tests [dumps] - 10https://gerrit.wikimedia.org/r/646998 (https://phabricator.wikimedia.org/T252396)
[14:45:38] <wikibugs>	 (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/674350 (https://phabricator.wikimedia.org/T278223) (owner: 10Kosta Harlan)
[14:46:31] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[14:46:31] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[14:46:31] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[14:46:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:15] <wikibugs>	 (03PS1) 10Nikerabbit: MassMessage: Unbreak remote content fetching [extensions/MassMessage] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674366 (https://phabricator.wikimedia.org/T276936)
[14:48:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:49:02] <wikibugs>	 (03PS1) 10Nikerabbit: MassMessage: Unbreak remote content fetching [extensions/MassMessage] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/674367 (https://phabricator.wikimedia.org/T276936)
[14:49:11] <wikibugs>	 10SRE, 10MediaWiki-General, 10observability, 10serviceops, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10lmata) @AMooney yes please
[14:53:35] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[14:53:35] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[14:53:35] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[14:53:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:53:58] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:04:40] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:10:23] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' .
[15:10:23] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' .
[15:10:23] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' .
[15:10:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:40] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:17:43] <papaul>	 i am doing some work in rack A3 in case you see mgmt issue on some servers 
[15:22:33] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Revert "Update cxserver to 2021-03-15-131520-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674369
[15:24:50] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:25:41] <Trey314159>	 !log reindexing Italian wikis on elastic@eqiad, elastic@codfw, and cloudelastic complete (T274200)
[15:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:53] <stashbot>	 T274200: Reindex English and Italian wikis to enable homoglyph plugin - https://phabricator.wikimedia.org/T274200
[15:26:25] <wikibugs>	 (03PS1) 10Ahmon Dancy: logspam.pl: Update execution time limit regexp [puppet] - 10https://gerrit.wikimedia.org/r/674387
[15:26:43] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Revert "Update cxserver to 2021-03-15-131520-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674369
[15:27:32] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Partially revert "Update cxserver to 2021-03-15-131520-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674369
[15:27:36] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: Partially revert "Update cxserver to 2021-03-15-131520-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674369
[15:29:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Partially revert "Update cxserver to 2021-03-15-131520-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674369 (owner: 10Alexandros Kosiaris)
[15:31:02] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=echostore
[15:31:03] <wikibugs>	 (03Merged) 10jenkins-bot: Partially revert "Update cxserver to 2021-03-15-131520-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674369 (owner: 10Alexandros Kosiaris)
[15:31:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:31:19] <akosiaris>	 !log pool echostore for eqiad (the first of the larger services traffic wise)
[15:31:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:35] <moritzm>	 !log installing libsdl2 security updates
[15:32:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:31] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' .
[15:36:31] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
[15:36:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:19] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
[15:38:19] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' .
[15:38:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:13] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' .
[15:39:13] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
[15:39:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:24] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:46:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for libsdl2 [puppet] - 10https://gerrit.wikimedia.org/r/674391
[15:51:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove deployment-logstash2 [puppet] - 10https://gerrit.wikimedia.org/r/674392 (https://phabricator.wikimedia.org/T238707)
[15:52:17] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] Remove deployment-logstash2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674392 (https://phabricator.wikimedia.org/T238707) (owner: 10Muehlenhoff)
[15:53:33] <wikibugs>	 (03CR) 10Muehlenhoff: Remove deployment-logstash2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674392 (https://phabricator.wikimedia.org/T238707) (owner: 10Muehlenhoff)
[15:54:07] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove deployment-logstash2 [puppet] - 10https://gerrit.wikimedia.org/r/674392 (https://phabricator.wikimedia.org/T238707)
[15:55:26] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:00:05] <jouncebot>	 jbond42 and cdanis: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T1600).
[16:00:05] <jouncebot>	 tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:26] <tgr_>	 o/
[16:01:01] <rzl>	 j.bond may be out, I'm happy to deploy if cdanis would like
[16:01:50] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:03:37] <jbond42>	 rzl: thanks fyi lgtm 
[16:04:00] <wikibugs>	 (03CR) 10Muehlenhoff: "deployment-logstash2 will be going away tomorrow: I'm adding a few people who might have created/used the striker, phabricator, ores, wiki" [puppet] - 10https://gerrit.wikimedia.org/r/674392 (https://phabricator.wikimedia.org/T238707) (owner: 10Muehlenhoff)
[16:05:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libsdl2 [puppet] - 10https://gerrit.wikimedia.org/r/674391 (owner: 10Muehlenhoff)
[16:05:20] <wikibugs>	 (03PS1) 10Ladsgroup: mailman3: Add ferm [puppet] - 10https://gerrit.wikimedia.org/r/674393 (https://phabricator.wikimedia.org/T277286)
[16:06:14] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+1] logspam.pl: Update execution time limit regexp [puppet] - 10https://gerrit.wikimedia.org/r/674387 (owner: 10Ahmon Dancy)
[16:07:06] <rzl>	 cdanis: I am boldly taking over your deploy window :)
[16:07:09] <rzl>	 tgr_: going ahead, stand by
[16:07:48] <rzl>	 I guess for a maintenance job there's not much for you to verify, but thanks for being around anyhow!
[16:08:14] <tgr_>	 yeah, it isn't testable
[16:08:26] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Update GrowthExperiments cronjob parameters [puppet] - 10https://gerrit.wikimedia.org/r/673631 (https://phabricator.wikimedia.org/T275171) (owner: 10Gergő Tisza)
[16:10:00] <rzl>	 merged -- the next puppet run on mwmaint1002 is scheduled for ~7 minutes, before the next time refreshLinkRecommendations fires, so I won't bother running it manually
[16:11:02] <tgr_>	 thanks!
[16:11:10] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3651 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[16:12:32] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:14:42] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:22:00] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3016 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[16:22:43] <elukey>	 ouch --^
[16:23:31] <_joe_>	 ok, lemme take a look
[16:24:13] <_joe_>	 the latency doesn't show anything worrisome
[16:25:28] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:26:17] <_joe_>	 elukey: I can't see the problem looking at grafana
[16:26:39] <_joe_>	 there was some at 16:10
[16:30:27] <elukey>	 _joe_ yeah seems that the alert is a false positive? T
[16:32:00] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:38:55] <wikibugs>	 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) Trying to break down my current thoughts:  ### Onhost memcached  In terms of functionality, I don't see a difference between being a DaemonSet and running on the host its...
[16:41:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10Cmjohnson) 05Open→03Resolved Disk replaced with a disk from decom'd db host
[16:43:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10Cmjohnson) Fixed the primary port for cloudgw1001  The secondary port cable number is 5321 in xe-0/0/19
[16:44:16] <wikibugs>	 (03Abandoned) 10Hnowlan: aqs: move password to hieradata rather than password module [labs/private] - 10https://gerrit.wikimedia.org/r/672441 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan)
[16:48:29] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3016 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[16:56:22] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] maintain-dbusers: rely on the global_id, not username for paws [puppet] - 10https://gerrit.wikimedia.org/r/674165 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm)
[16:58:17] <wikibugs>	 (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.13.0-a29 [vendor] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674373 (https://phabricator.wikimedia.org/T275918)
[16:59:18] <wikibugs>	 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Slaporte) On April 5, 2020, we will update the [Wikimedia Maps Terms of Use](https://meta.wikimedia.org/wiki/Revised_Maps_Term...
[16:59:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10wiki_willy) Hi @ayounsi - just to follow up on this, we should probably wait a bit longer on determining which racks to convert to 10g (after John and Chris can wrap up all the...
[17:00:04] <jouncebot>	 chrisalbon and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T1700).
[17:00:58] <wikibugs>	 10SRE, 10vm-requests, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10wkandek) 05Stalled→03Resolved Let's close and reopen if a second server becomes necessary.
[17:02:01] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] pontoon: initial hiera config for pontoon env [puppet] - 10https://gerrit.wikimedia.org/r/669968 (owner: 10Cwhite)
[17:04:18] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.07937 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[17:05:31] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] Bump wikimedia/parsoid to 0.13.0-a29 [vendor] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674373 (https://phabricator.wikimedia.org/T275918) (owner: 10C. Scott Ananian)
[17:08:55] <wikibugs>	 10SRE, 10MediaWiki-General, 10observability, 10serviceops, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10AMooney)
[17:11:20] <wikibugs>	 10SRE: Rewrite http://download.wikimedia.org/mediawiki/ -> https://releases.wikimedia.org/mediawiki in less than 3 redirects - https://phabricator.wikimedia.org/T119679 (10hashar)
[17:12:34] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:14:50] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:18:29] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] prometheus: read alerts from /srv/alerts [puppet] - 10https://gerrit.wikimedia.org/r/674321 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi)
[17:21:23] <wikibugs>	 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10JMeybohm) ### onhost memcached >It's still an open question how we will inject the node IP into the mcrouter configuration. it would mean we'd need to pass the host IP as an e...
[17:38:28] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.6825 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[17:43:11] <mutante>	 Amir1: hello, would you like to do the wmde/statistics cron switch now?
[17:43:48] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[17:45:02] <Amir1>	 mutante: sure!
[17:45:21] <mutante>	 Amir1: so, I know you have shell but not root shell, so let's do it together
[17:45:50] <Amir1>	 cool
[17:45:55] <Amir1>	 it's only on stat1007
[17:46:15] <mutante>	 that's what I was trying to grep right now, thx
[17:46:22] <mutante>	 dont see where the class is used
[17:46:57] <Amir1>	 it's a bit complicated, it's in misc jobs of statistcs explorer 
[17:47:10] <Amir1>	 and that changes using a hiera config 
[17:47:33] <mutante>	 compiling on 1007. it's just unusual that I cant even find anything including wmde::graphite
[17:47:36] <mutante>	 one sec
[17:47:39] <Amir1>	 modules/profile/manifests/statistics/explorer/misc_jobs.pp
[17:47:41] <Amir1>	 look
[17:47:57] <Amir1>	 modules/profile/manifests/statistics/explorer/misc_jobs.pp
[17:48:06] <Amir1>	 sorry double paste
[17:48:07] <Amir1>	 if $::hostname in $hosts_with_jobs
[17:48:22] <Amir1>	 and that came from profile::statistics::explorer::misc_jobs::hosts_with_jobs
[17:48:36] <Amir1>	 which has "statistics::wmde"
[17:48:39] <Amir1>	 mutante: ^
[17:49:01] <mutante>	 ACK, i see it now, thx
[17:49:33] <mutante>	 https://puppet-compiler.wmflabs.org/compiler1001/28717/stat1007.eqiad.wmnet/index.html
[17:49:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/28717/stat1007.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup)
[17:50:45] <wikibugs>	 (03PS9) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775)
[17:51:26] <Amir1>	 where are the sytemd timer logs?
[17:51:29] <Amir1>	 I keep forgetting
[17:52:13] <mutante>	 Amir1: first: https://phabricator.wikimedia.org/P15055
[17:52:35] <mutante>	 Amir1: [stat1007:~] $ sudo systemctl list-timers | grep wmde
[17:53:10] <Amir1>	 I can't run that command 
[17:53:11] <mutante>	 [stat1007:~] $ sudo systemctl status wmde-analytics-daily-noon.service
[17:53:58] <mutante>	 Amir1: let me paste more things.. sec
[17:55:28] <wikibugs>	 (03PS10) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775)
[17:59:38] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[18:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T1800).
[18:00:05] <jouncebot>	 cscott and legoktm: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:22] <legoktm>	 o/
[18:00:25] <Urbanecm>	 legoktm: hi!
[18:00:30] <Urbanecm>	 legoktm: will you deploy, or should I?
[18:01:31] <legoktm>	 I can take care of it
[18:01:35] <Urbanecm>	 legoktm: go ahead then
[18:02:39] <cscott>	 o/
[18:03:18] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Bump wikimedia/parsoid to 0.13.0-a29 [vendor] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674373 (https://phabricator.wikimedia.org/T275918) (owner: 10C. Scott Ananian)
[18:04:26] <wikibugs>	 (03PS2) 10Legoktm: Add irc2001.wikimedia.org (running buster) as second irc server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672687 (https://phabricator.wikimedia.org/T224579) (owner: 10Muehlenhoff)
[18:04:29] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Add irc2001.wikimedia.org (running buster) as second irc server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672687 (https://phabricator.wikimedia.org/T224579) (owner: 10Muehlenhoff)
[18:05:00] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:06:42] <wikibugs>	 (03Merged) 10jenkins-bot: Add irc2001.wikimedia.org (running buster) as second irc server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672687 (https://phabricator.wikimedia.org/T224579) (owner: 10Muehlenhoff)
[18:07:49] <legoktm>	 testing the irc2001 change on mwdebug1002...
[18:08:53] <wikibugs>	 (03PS1) 10MSantos: mobileapps: bump to 2021-03-19-113347-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/674406
[18:10:03] <legoktm>	 syncing
[18:10:38] <logmsgbot>	 !log legoktm@deploy1002 Synchronized wmf-config/ProductionServices.php: Add irc2001.wikimedia.org (running buster) as second irc server (T224579) (duration: 01m 08s)
[18:10:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:49] <stashbot>	 T224579: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579
[18:12:50] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2021-03-19-113347-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/674406 (owner: 10MSantos)
[18:14:04] <wikibugs>	 (03PS3) 10Ottomata: Remove schema overrides for 6 finished EL migrations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673075 (https://phabricator.wikimedia.org/T267347)
[18:14:12] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:14:16] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: bump to 2021-03-19-113347-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/674406 (owner: 10MSantos)
[18:15:14] <ottomata>	 Urbanecm: legoktm i have a config patch i'd like to deploy after the window is done, let me know when it looks clear. :)
[18:15:15] <wikibugs>	 (03PS1) 10MSantos: wikifeeds: bump to 2021-03-19-113019-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/674409
[18:15:22] <wikibugs>	 10SRE, 10Wikimedia-IRC-RC-Server: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10Legoktm) Events are now going to irc2001.wikimedia.org. I watched `#en.wikipedia` on both kraz and irc2001 for a few minutes and saw identical output (note that channels won't exist on the...
[18:15:34] <legoktm>	 ack, just waiting on CI right now
[18:16:38] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[18:16:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:27] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] wikifeeds: bump to 2021-03-19-113019-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/674409 (owner: 10MSantos)
[18:18:08] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[18:18:09] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) install second SSD into payments100[5-8] - https://phabricator.wikimedia.org/T278250 (10RobH)
[18:18:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:17] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) install second SSD into payments100[5-8] - https://phabricator.wikimedia.org/T278250 (10RobH)
[18:18:58] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: bump to 2021-03-19-113019-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/674409 (owner: 10MSantos)
[18:19:38] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09524 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[18:20:17] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[18:20:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:50] <legoktm>	 what's msantos's IRC nick?
[18:21:24] <mutante>	 @seen mbsantos
[18:21:24] <wm-bot>	 mutante: Last time I saw mbsantos they were changing the nickname to thesocialdev and thesocialdev is still in the channel #wikimedia-overflow at 2/26/2021 10:18:33 AM (25d8h2m51s ago)
[18:21:31] * thesocialdev msantos
[18:21:32] <mutante>	 legoktm: ^
[18:22:07] <mutante>	 legoktm: information extracted from officewiki::Contacts 
[18:22:42] <thesocialdev>	 @legoktm I just realised I mistaken the service deploy window
[18:23:32] <legoktm>	 thesocialdev: ok, that was what I was going to ask about :)
[18:23:38] <legoktm>	 mutante: ty
[18:24:10] <thesocialdev>	 I'll update the contact in officewiki as well, thanks for the reminder
[18:24:44] <mutante>	 :)
[18:28:13] <wikibugs>	 (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.13.0-a29 [vendor] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674373 (https://phabricator.wikimedia.org/T275918) (owner: 10C. Scott Ananian)
[18:31:37] <legoktm>	 whee
[18:31:48] <legoktm>	 cscott: does mwdebug work for parsoid changes? or do I just sync it out?
[18:32:46] <wikibugs>	 (03CR) 10Dzahn: "[stat1007:~] $ for wmdetimer in analytics-minutely analytics-daily-early analytics-daily-noon analytics-weekly toolkit-analyzer-build; do " [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup)
[18:34:23] <wikibugs>	 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10wkandek) 1. Option - "Gitlab default" use 22 on the IP for Gitlab, mix with admin traffic in 2. Option - "Gerrit...
[18:36:51] <legoktm>	 subbu: around? 
[18:37:14] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:37:34] <subbu>	 yes?
[18:38:14] <legoktm>	 subbu: trying to verify the -a29 backport, can it be tested on mwdebug or do I just sync it out?
[18:38:18] <subbu>	 oh i see your qn. about to cscott  ... 
[18:38:18] <legoktm>	 I think c.scott went afk
[18:38:45] <subbu>	 sync it out. it doesn't work with mwdebug since it is a different cluster.
[18:38:50] <legoktm>	 ack
[18:40:12] <legoktm>	 syncing
[18:40:52] <logmsgbot>	 !log legoktm@deploy1002 Synchronized php-1.36.0-wmf.36/vendor/: Bump wikimedia/parsoid to 0.13.0-a29 (duration: 01m 16s)
[18:40:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:41:06] <legoktm>	 ottomata: all done
[18:42:31] <ottomata>	 ok thank youy
[18:42:41] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Remove schema overrides for 6 finished EL migrations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673075 (https://phabricator.wikimedia.org/T267347) (owner: 10Ottomata)
[18:44:10] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:44:19] <cscott>	 Sorry, my client didnt give me a ping.  Yeah, like subbu says, since group0 isn't even deployed yet there's no way to test this.
[18:44:30] <cscott>	 We'll be watching the train
[18:45:05] <logmsgbot>	 !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Remove schema overrides for 6 finished EL migrations - T267347 T271164 T267351 T267348 T267343 T267353 (duration: 01m 07s)
[18:45:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:24] <stashbot>	 T267348: PrefUpdate Event Platform Migration - https://phabricator.wikimedia.org/T267348
[18:45:24] <stashbot>	 T271164: DesktopWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T271164
[18:45:24] <stashbot>	 T267351: SuggestedTagsAction Event Platform Migration - https://phabricator.wikimedia.org/T267351
[18:45:24] <stashbot>	 T267343: EditAttemptStep Event Platform Migration - https://phabricator.wikimedia.org/T267343
[18:45:25] <stashbot>	 T267353: VisualEditorFeatureUse Event Platform Migration - https://phabricator.wikimedia.org/T267353
[18:45:25] <stashbot>	 T267347: MobileWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T267347
[18:48:50] <wikibugs>	 (03PS3) 10Ahmon Dancy: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274)
[18:50:36] <wikibugs>	 (03PS4) 10Ahmon Dancy: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274)
[18:53:44] <cscott>	 legoktm: since it's to the 1.36-wmf.36 branch, it wouldn't even be testable on mwdebug, would it?
[18:53:52] <cscott>	 as nothing's running -wmf.36 yet
[18:54:35] <legoktm>	 group0 wikis are on wmf.36 already
[18:54:59] <cscott>	 i thought that train wasn't for another 6 minutes
[18:55:22] <legoktm>	 it ran in European time today
[18:55:34] <cscott>	 ah, the deployment calendar lied
[18:55:45] <legoktm>	 https://sal.toolforge.org/log/nmdeX3gB8Fs0LHO5GBoC
[18:55:46] <subbu>	 cscott, parsoid runs on wtp* ... so, we cannot use mwdebug* anyway for verifying.
[18:55:48] <cscott>	 is the sync finished?  as soon as it is I can smoke test group0 at least.
[18:55:57] <legoktm>	 yes
[18:56:02] <legoktm>	  11:40:52 <+logmsgbot> !log legoktm@deploy1002 Synchronized php-1.36.0-wmf.36/vendor/: Bump wikimedia/parsoid to 0.13.0-a29 (duration: 01m 16s)
[18:56:15] <cscott>	 subbu: yeah, but my point is even mwdebug doesn't work if you're deploying to an undeployed branch
[18:56:24] <subbu>	 ok
[18:56:47] <cscott>	 but anyway, group0 is deployed so i'm going to go run some null edit tests
[18:57:04] <cscott>	 subbu: could you take a quick look at the logs from group0 and verify there's nothing unexpected
[18:58:37] <subbu>	 not sure what you mean by logs from group0 .. but i can look at logstash.
[19:00:04] <jouncebot>	 hashar and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T1900).
[19:00:49] <wikibugs>	 10SRE, 10Wikimedia-IRC-RC-Server: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10Legoktm) From {T123729}:  * Announce in Tech/News, wikitech-l, wikitech-ambassadors that we'll be switching irc.wikimedia.org over to a new server on XX. Include a reminder that clients sho...
[19:01:31] <cscott>	 subbu: anything unusual in logstash caused by a request to mediawiki.org or another group0 wiki.
[19:03:06] <subbu>	 all good. there is an unrelatd issue which I'll bring up in #mediawiki-core since it pertains to an ongoing flag there already.
[19:03:46] <cscott>	 basic edit tests on mediawiki.org look good as well.  nothing's on fire at least.
[19:05:17] <wikibugs>	 10SRE, 10Wikimedia-IRC-RC-Server: Set up spare irc1001.wikimedia.org in eqiad - https://phabricator.wikimedia.org/T278255 (10Legoktm)
[19:06:38] <wikibugs>	 10SRE, 10Sustainability: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Legoktm)
[19:06:41] <wikibugs>	 10SRE, 10Wikimedia-IRC-RC-Server: Set up spare irc1001.wikimedia.org in eqiad - https://phabricator.wikimedia.org/T278255 (10Legoktm)
[19:09:32] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:11:18] <Majavah>	 legoktm: https://phabricator.wikimedia.org/T244542 has a patch in review, let's see if we're faster than you removing kraz :P
[19:11:48] <legoktm>	 heh
[19:11:57] <legoktm>	 <3
[19:14:08] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:19:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:20:24] <wikibugs>	 (03CR) 10Legoktm: Include patches in restricted image (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy)
[19:22:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:24:02] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28718/console" [puppet] - 10https://gerrit.wikimedia.org/r/674393 (https://phabricator.wikimedia.org/T277286) (owner: 10Ladsgroup)
[19:25:19] <twentyafterfour>	 am I clear to deploy phatality?  Everything stable currently?
[19:25:40] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:26:28] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3333 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[19:27:58] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:31:06] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3333 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[19:33:03] <wikibugs>	 10SRE: Rewrite http://download.wikimedia.org/mediawiki/ -> https://releases.wikimedia.org/mediawiki in less than 3 redirects - https://phabricator.wikimedia.org/T119679 (10Krinkle) >>! In T119679#1833203, @Krinkle wrote: > […] >  > * GET https://download.wikimedia.org/mediawiki >  > 1. 301 Permanent Redirect to...
[19:34:19] <wikibugs>	 (03PS5) 10Ahmon Dancy: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274)
[19:35:18] <wikibugs>	 (03CR) 10Ahmon Dancy: Include patches in restricted image (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy)
[19:36:54] <wikibugs>	 (03CR) 10Krinkle: Include patches in restricted image (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy)
[19:38:19] <wikibugs>	 (03CR) 10Ahmon Dancy: Include patches in restricted image (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy)
[19:38:22] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:39:34] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:40:40] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:44:10] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:47:31] <wikibugs>	 (03CR) 10Krinkle: Include patches in restricted image (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy)
[19:51:47] <logmsgbot>	 !log jforrester@deploy1002 Started deploy [integration/docroot@9de8c9d]: Add homer-public listing, added by volans
[19:51:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:51:55] <logmsgbot>	 !log jforrester@deploy1002 Finished deploy [integration/docroot@9de8c9d]: Add homer-public listing, added by volans (duration: 00m 08s)
[19:52:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:55:48] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:58:06] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:42] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts auth1002.eqiad.wmnet
[20:01:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:02:12] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts auth1002.eqiad.wmnet
[20:02:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:03:26] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts auth1002.eqiad.wmnet
[20:03:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10RobH)
[20:12:06] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:13:07] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts auth1002.eqiad.wmnet
[20:13:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `auth1002.eqiad.wmnet` - auth1002.eqiad.wmnet (**PASS**)   - Downtimed host on Ici...
[20:13:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:24] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:21:05] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[20:21:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:12] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09524 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[20:24:23] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[20:24:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:49] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[20:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T275511 (10Jclark-ctr) moss-fe1001  A2. U42. Port33 ID5341 moss-fe1002. D4. U41  Port41   ID5342
[20:26:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T275511 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[20:32:18] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite)
[20:36:25] <wikibugs>	 (03PS1) 10Cwhite: logstash: only add dlq cleanup if enabled [puppet] - 10https://gerrit.wikimedia.org/r/674429
[20:38:18] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: only add dlq cleanup if enabled [puppet] - 10https://gerrit.wikimedia.org/r/674429 (owner: 10Cwhite)
[20:39:52] <wikibugs>	 (03PS3) 10Andrew Bogott: wmcs: add new admin script to create a new base image based off of upstream [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051)
[20:39:54] <wikibugs>	 (03PS1) 10Andrew Bogott: fullstack: switch back to the standard image [puppet] - 10https://gerrit.wikimedia.org/r/674430 (https://phabricator.wikimedia.org/T278051)
[20:40:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] fullstack: switch back to the standard image [puppet] - 10https://gerrit.wikimedia.org/r/674430 (https://phabricator.wikimedia.org/T278051) (owner: 10Andrew Bogott)
[20:41:57] <wikibugs>	 (03PS5) 10ArielGlenn: [WIP] needs more tests and some cleanup, ewww [dumps] - 10https://gerrit.wikimedia.org/r/646998 (https://phabricator.wikimedia.org/T252396)
[20:42:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] needs more tests and some cleanup, ewww [dumps] - 10https://gerrit.wikimedia.org/r/646998 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn)
[20:45:20] <wikibugs>	 (03CR) 10Hoo man: [C: 03+1] "Tested with testwikidata (and a small enough batch size that makes sure we need to separate batches):" [puppet] - 10https://gerrit.wikimedia.org/r/673679 (https://phabricator.wikimedia.org/T277300) (owner: 10Hoo man)
[20:48:09] <wikibugs>	 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10wkandek) To provide as little obstacles for developers as possible access through port 22 is the preferred optio...
[20:48:28] <wikibugs>	 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10wkandek) a:03Dzahn
[20:50:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10Jclark-ctr) bast1003 Rack D1 U41 Port24  ID3161
[20:50:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[20:53:32] <wikibugs>	 (03PS1) 10Cwhite: logstash: arrange systemd::timer::job resource around cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/674431
[20:56:40] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10SRE-tools: Support downtiming services in our cookbooks - https://phabricator.wikimedia.org/T277740 (10Volans) a:03Volans
[20:59:59] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:00:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:32] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.dns.netbox
[21:00:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:02:18] <wikibugs>	 (03PS1) 10Subramanya Sastry: Checkout master branch of testreduce always [puppet] - 10https://gerrit.wikimedia.org/r/674433
[21:02:20] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3016 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[21:04:14] <wikibugs>	 (03PS2) 10Cwhite: logstash: arrange systemd::timer::job resource around cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/674431
[21:04:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] logstash: arrange systemd::timer::job resource around cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/674431 (owner: 10Cwhite)
[21:04:52] <wikibugs>	 (03PS6) 10ArielGlenn: [WIP] needs more tests and some cleanup, ewww [dumps] - 10https://gerrit.wikimedia.org/r/646998 (https://phabricator.wikimedia.org/T252396)
[21:04:58] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:05:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:49] <wikibugs>	 (03PS3) 10Cwhite: logstash: arrange systemd::timer::job resource around cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/674431
[21:08:12] <wikibugs>	 (03PS1) 10RobH: pki-root1001 taking over auth1002 chassis [puppet] - 10https://gerrit.wikimedia.org/r/674434 (https://phabricator.wikimedia.org/T276625)
[21:09:43] <logmsgbot>	 !log ppchelko@deploy1002 Started deploy [restbase/deploy@531c474]: Add pageviews top-per-country endpoint
[21:09:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:10:15] <wikibugs>	 (03CR) 10RobH: [C: 03+2] pki-root1001 taking over auth1002 chassis [puppet] - 10https://gerrit.wikimedia.org/r/674434 (https://phabricator.wikimedia.org/T276625) (owner: 10RobH)
[21:10:22] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:10:43] <wikibugs>	 (03PS4) 10Cwhite: logstash: arrange systemd::timer::job resource around cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/674431
[21:12:26] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: arrange systemd::timer::job resource around cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/674431 (owner: 10Cwhite)
[21:16:46] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:27:41] <logmsgbot>	 !log ppchelko@deploy1002 Finished deploy [restbase/deploy@531c474]: Add pageviews top-per-country endpoint (duration: 17m 58s)
[21:27:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:52] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:28:33] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm)
[21:30:00] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:33:09] <mutante>	 volans: is there a workflow to get a second IP on an exiting host? like the netbox part of it
[21:40:56] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:41:31] <mutante>	 volans: already talked about it with Cas on another channel, good for now, thx
[21:43:55] <wikibugs>	 (03PS4) 10Andrew Bogott: wmcs: add new admin script to create a new base image based off of upstream [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051)
[21:45:18] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:46:51] <wikibugs>	 (03PS5) 10Andrew Bogott: wmcs: add new admin script to create a new base image based off of upstream [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051)
[21:48:06] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.07937 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[21:48:43] <wikibugs>	 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) Should it be gitlab.wikimedia.org for both, https and ssh? (so both the webserver and second sshd would l...
[21:54:26] <wikibugs>	 (03PS1) 10Dzahn: drop gitlab CNAME, in favor of service name on separate IP [dns] - 10https://gerrit.wikimedia.org/r/674439 (https://phabricator.wikimedia.org/T276148)
[21:55:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:55:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "not used yet" [dns] - 10https://gerrit.wikimedia.org/r/674439 (https://phabricator.wikimedia.org/T276148) (owner: 10Dzahn)
[21:56:06] <wikibugs>	 (03PS2) 10Dzahn: drop gitlab CNAME, in favor of service name on separate IP [dns] - 10https://gerrit.wikimedia.org/r/674439 (https://phabricator.wikimedia.org/T276148)
[21:56:32] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:58:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:03:20] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:05:56] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[22:06:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:09:25] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:09:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:12:08] <wikibugs>	 (03CR) 10Arlolra: [C: 03+1] Checkout master branch of testreduce always [puppet] - 10https://gerrit.wikimedia.org/r/674433 (owner: 10Subramanya Sastry)
[22:13:12] <wikibugs>	 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) ` +gitlab                                   1H IN A 208.80.154.14...
[22:13:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Checkout master branch of testreduce always [puppet] - 10https://gerrit.wikimedia.org/r/674433 (owner: 10Subramanya Sastry)
[22:14:20] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:15:27] <wikibugs>	 (03CR) 10Dzahn: "deployed on testreduce1001, it's not a puppet change unless you delete the dir and let it reclone" [puppet] - 10https://gerrit.wikimedia.org/r/674433 (owner: 10Subramanya Sastry)
[22:16:18] <wikibugs>	 (03PS1) 10Legoktm: releases: Provide server-agnostic rsync for security patches [puppet] - 10https://gerrit.wikimedia.org/r/674443
[22:16:38] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:17:07] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@3fd7d7b]: partition ores dumps by namespace
[22:17:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:14] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@3fd7d7b]: partition ores dumps by namespace (duration: 02m 07s)
[22:19:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:20:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:23:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:23:03] <wikibugs>	 (03PS2) 10Legoktm: releases: Provide server-agnostic rsync for security patches [puppet] - 10https://gerrit.wikimedia.org/r/674443
[22:26:03] <wikibugs>	 (03PS3) 10Legoktm: releases: Provide server-agnostic rsync for security patches [puppet] - 10https://gerrit.wikimedia.org/r/674443
[22:26:17] <wikibugs>	 (03CR) 10Dzahn: httpd: add parameters and template to allow custom ports.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn)
[22:27:01] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28728/console" [puppet] - 10https://gerrit.wikimedia.org/r/674443 (owner: 10Legoktm)
[22:27:32] <wikibugs>	 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Sergey.Trofimovsky.SF) >>! In T276148#6939506, @Dzahn wrote: > Should it be gitlab.wikimedia.org for both, https...
[22:28:26] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:29:21] <wikibugs>	 (03PS1) 10Dzahn: gitlab: add gitlab.wikimedia.org service IP with interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/674446 (https://phabricator.wikimedia.org/T276148)
[22:31:15] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/674443 (owner: 10Legoktm)
[22:35:10] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:39:34] <wikibugs>	 (03CR) 10Dzahn: "compiler output looks like it should not change anything about what actually syncs to what. so as long as all hosts still from the primary" [puppet] - 10https://gerrit.wikimedia.org/r/674443 (owner: 10Legoktm)
[22:41:52] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/28729/gitlab1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/674446 (https://phabricator.wikimedia.org/T276148) (owner: 10Dzahn)
[22:42:31] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] gitlab: add gitlab.wikimedia.org service IP with interface::alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674446 (https://phabricator.wikimedia.org/T276148) (owner: 10Dzahn)
[22:44:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` pki-root1001.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-...
[22:44:19] <wikibugs>	 10SRE, 10Wikimedia-IRC-RC-Server, 10User-notice: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10Legoktm) >>! In T224579#6939027, @Legoktm wrote: > When should XX be?  Moritz is going to switch DNS and reboot kraz "Thursday during the European morning", announcement to...
[22:45:00] <legoktm>	 mutante: "rsync on the primary releases server" ?
[22:45:06] <legoktm>	 (instead of "to")
[22:45:42] <legoktm>	 or "from"?
[22:45:59] <mutante>	 legoktm: hmmm.... yes, "pulling from the deployment server"
[22:46:18] <mutante>	 everyone just pulls from primary_deploy
[22:46:37] <legoktm>	 I'll put "...from the deployment server" in the class-level comment
[22:46:43] <mutante>	 had a little chat about it with security-team as well
[22:46:51] <mutante>	 not so long ago when they were upgraded
[22:47:00] <mutante>	 sounds great, thanks
[22:47:38] <wikibugs>	 (03PS4) 10Legoktm: releases: Provide server-agnostic rsync for security patches [puppet] - 10https://gerrit.wikimedia.org/r/674443
[22:48:24] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] releases: Provide server-agnostic rsync for security patches [puppet] - 10https://gerrit.wikimedia.org/r/674443 (owner: 10Legoktm)
[22:49:21] <mutante>	 for this class :;security it's not "between releases servers" at all, but too nitpicky
[22:50:22] <legoktm>	 I will leave that for the next time we tweak that file :p
[22:50:26] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] releases: Provide server-agnostic rsync for security patches [puppet] - 10https://gerrit.wikimedia.org/r/674443 (owner: 10Legoktm)
[22:50:33] <legoktm>	 thanks for the review
[22:50:59] <mutante>	 it's there because that changed and I did not update the comment, i take the blame
[22:51:02] <mutante>	 yw
[22:54:40] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:55:20] <wikibugs>	 (03PS2) 10Legoktm: acme_chief::cert: Use normal spaces in documentation [puppet] - 10https://gerrit.wikimedia.org/r/673642
[22:56:33] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] acme_chief::cert: Use normal spaces in documentation [puppet] - 10https://gerrit.wikimedia.org/r/673642 (owner: 10Legoktm)
[22:57:48] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pki-root1001.eqiad.wmnet with reason: REIMAGE
[22:57:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:59:52] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki-root1001.eqiad.wmnet with reason: REIMAGE
[22:59:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T2300).
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:00:22] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:05:59] <wikibugs>	 (03PS2) 10Dzahn: gitlab: add gitlab.wm.org service IP, with lookup from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/674446 (https://phabricator.wikimedia.org/T276148)
[23:06:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pki-root1001.eqiad.wmnet'] `  and were **ALL** successful.
[23:07:19] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] "Minor PHP code comments inline." (036 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/672767 (https://phabricator.wikimedia.org/T276908) (owner: 10Giuseppe Lavagetto)
[23:08:01] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/28730/gitlab1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/674446 (https://phabricator.wikimedia.org/T276148) (owner: 10Dzahn)
[23:10:38] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:13:38] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:15:02] <wikibugs>	 (03CR) 10Legoktm: "I don't think this meets what the performance.wikimedia.org site needs...I guess we could have apache listen some non-443 port for HTTPS, " [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn)
[23:23:36] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:25:12] <wikibugs>	 (03CR) 10Dzahn: "I see...will amend" [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn)
[23:30:08] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:30:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10RobH)
[23:31:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: apply new hostname label for pki-root1001 (was auth1002) - https://phabricator.wikimedia.org/T278273 (10RobH)
[23:32:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10RobH) @jbond: be aware pki-root1001 is now staged for your use.
[23:33:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10RobH) 05Open→03Resolved
[23:34:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: apply new hostname label for pki-root1001 (was auth1002) - https://phabricator.wikimedia.org/T278273 (10RobH) p:05Triage→03Low
[23:35:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: apply new hostname label for pki-root1001 (was auth1002) - https://phabricator.wikimedia.org/T278273 (10RobH)
[23:42:52] <wikibugs>	 (03PS1) 10Papaul: DHCP: Add MAC address for mw2377 and mw2378 [puppet] - 10https://gerrit.wikimedia.org/r/674454 (https://phabricator.wikimedia.org/T274171)
[23:44:45] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for mw2377 and mw2378 [puppet] - 10https://gerrit.wikimedia.org/r/674454 (https://phabricator.wikimedia.org/T274171) (owner: 10Papaul)
[23:48:30] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Have puppet httpd class support enabling mod_ssl without having apache listen on port 443 - https://phabricator.wikimedia.org/T277989 (10Krinkle)
[23:48:38] <wikibugs>	 (03CR) 10Dave Pifke: "Having Envoy act as a HTTPS to HTTP proxy between Varnish and Apache, and having Apache also listening for HTTPS requests (unused) sounds " [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm)
[23:52:18] <wikibugs>	 (03PS3) 10Dzahn: gitlab: open firewall on 22,80,443. use drange to limit to service IP [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144)
[23:52:43] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "I realized role::lists3 isn't actually including the standard profile nor the base firewall, going to fix that in a minute and then rebase" [puppet] - 10https://gerrit.wikimedia.org/r/674393 (https://phabricator.wikimedia.org/T277286) (owner: 10Ladsgroup)
[23:53:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gitlab: open firewall on 22,80,443. use drange to limit to service IP [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn)
[23:53:52] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:56:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:56:32] <wikibugs>	 (03PS4) 10Dzahn: gitlab: open firewall on 22,80,443. use drange to limit to service IP [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144)
[23:58:20] <wikibugs>	 (03PS1) 10Papaul: Add mw237[7-8] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/674455 (https://phabricator.wikimedia.org/T274171)
[23:59:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] Add mw237[7-8] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/674455 (https://phabricator.wikimedia.org/T274171) (owner: 10Papaul)
[23:59:52] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add mw237[7-8] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/674455 (https://phabricator.wikimedia.org/T274171) (owner: 10Papaul)