[00:00:00] (03Merged) 10jenkins-bot: Use the RequestTimeout library to set time limits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672579 (https://phabricator.wikimedia.org/T269326) (owner: 10Tim Starling) [00:04:52] !log tstarling@deploy1002 Synchronized wmf-config/PhpAutoPrepend.php: use RequestTimeout library step 1: disable old request timeout system (duration: 00m 58s) [00:04:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) [00:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10RobH) a:05RobH→03Cmjohnson It appears this is plugged into the non cloud switch (after IRC sync). I chatted with Chris, who... [00:06:35] !log tstarling@deploy1002 Synchronized wmf-config/CommonSettings.php: use RequestTimeout library step 2: enable new system (duration: 00m 57s) [00:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:39] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install cumin2002.codfw.wmnet - https://phabricator.wikimedia.org/T276587 (10Papaul) [00:07:57] !log tstarling@deploy1002 Synchronized wmf-config: use RequestTimeout library step 3: clean up (duration: 00m 58s) [00:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:05] (03CR) 10Dzahn: "There's of course nothing wrong with changing the port but for the record, changing which port envoy uses for TLS termination would have b" [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm) [00:14:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:10] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve200[1-4] - https://phabricator.wikimedia.org/T267670 (10wiki_willy) Hi @klausman - just following up here, to see if we can close out this task? Thanks, Willy [00:44:25] (03PS1) 10Dzahn: httpd: add parameters and template to allow custom ports.conf [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) [00:45:28] (03CR) 10jerkins-bot: [V: 04-1] httpd: add parameters and template to allow custom ports.conf [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [00:46:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:01] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:02] (03PS2) 10Dzahn: httpd: add parameters and template to allow custom ports.conf [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) [01:02:09] (03CR) 10jerkins-bot: [V: 04-1] httpd: add parameters and template to allow custom ports.conf [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [01:16:33] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:25:18] (03PS1) 10Razzi: site: remove decommissioned node labsdb1012 [puppet] - 10https://gerrit.wikimedia.org/r/674182 (https://phabricator.wikimedia.org/T269211) [01:26:48] (03CR) 10Razzi: [C: 03+2] site: remove decommissioned node labsdb1012 [puppet] - 10https://gerrit.wikimedia.org/r/674182 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [01:29:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:43:20] (03CR) 10Legoktm: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm) [01:45:01] (03CR) 10Legoktm: [C: 03+1] "Unfortunate that it's not workable for the general case." [puppet] - 10https://gerrit.wikimedia.org/r/674165 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [01:47:04] (03CR) 10Legoktm: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/674151 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [01:48:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:59:11] (03PS3) 10Dzahn: httpd: add parameters and template to allow custom ports.conf [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) [02:02:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:46] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.36 [core] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674183 [02:12:07] (03PS2) 10Jforrester: Branch commit for wmf/1.36.0-wmf.36 [core] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674183 (https://phabricator.wikimedia.org/T274940) (owner: 10TrainBranchBot) [02:18:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:51:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:04:19] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:21] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:22:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:22:37] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 5.527 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:34:11] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:43:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:08:31] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:16:41] (03PS1) 10Andrew Bogott: wmcs: add new admin script to create a new base image based off of upstream [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051) [04:17:15] (03CR) 10jerkins-bot: [V: 04-1] wmcs: add new admin script to create a new base image based off of upstream [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051) (owner: 10Andrew Bogott) [04:25:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_navigationtiming_hourly.service,systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:51] PROBLEM - Disk space on backup1002 is CRITICAL: DISK CRITICAL - free space: /srv 3018164 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup1002&var-datasource=eqiad+prometheus/ops [04:56:12] (03PS2) 10Andrew Bogott: wmcs: add new admin script to create a new base image based off of upstream [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051) [05:01:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set weight 0 to db1136 before failover T274336', diff saved to https://phabricator.wikimedia.org/P14992 and previous config saved to /var/cache/conftool/dbconfig/20210323-051210-marostegui.json [05:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:18] T274336: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 [05:13:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1174 to api T274336', diff saved to https://phabricator.wikimedia.org/P14993 and previous config saved to /var/cache/conftool/dbconfig/20210323-051346-marostegui.json [05:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:33:02] (03PS4) 10Marostegui: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) [05:33:59] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/673195 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui) [05:50:51] PROBLEM - SSH on mw2248.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:00:04] marostegui, kormat, and jynus: (Dis)respected human, time to deploy s7 primary master switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T0600). Please do the needful. [06:00:10] let's go? [06:00:19] sure [06:00:23] \o/ [06:00:29] kormat: you ready? [06:00:31] o/ [06:00:38] !log Starting s7 eqiad failover from db1086 to db1136 - T274336 [06:00:40] marostegui: as i'll ever be [06:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:49] T274336: Switchover s7 from db1086 to db1136 - https://phabricator.wikimedia.org/T274336 [06:01:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s7 as read-only for maintenance T274336', diff saved to https://phabricator.wikimedia.org/P14994 and previous config saved to /var/cache/conftool/dbconfig/20210323-060104-marostegui.json [06:01:06] RO set [06:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:45] I cannot edit, so proceeding [06:02:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1136 to s7 master and remove read-only from s7 T274336', diff saved to https://phabricator.wikimedia.org/P14995 and previous config saved to /var/cache/conftool/dbconfig/20210323-060216-marostegui.json [06:02:20] all done [06:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:43] I can edit eswiki [06:03:02] marostegui: nice work :) [06:03:29] I see recentchanges advancing [06:03:33] (on eswiki) [06:03:55] no centralauth errors? [06:04:35] I am going thru kibana now [06:04:45] at least centralauth changes are being made [06:04:55] Majavah: thanks for checking :) [06:05:13] I only see "GrowthExperiments\Maintenance\RefreshLinkRecommendations::execute: no transaction to commit, something got out of sync" [06:05:41] on testwiki [06:05:42] marostegui: i'll run puppet to fix the RO alerts [06:05:48] kormat: just did :) [06:06:03] then i'll just _pretend_ i was useful [06:07:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 25%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P14996 and previous config saved to /var/cache/conftool/dbconfig/20210323-060701-root.json [06:07:03] kormat: don't say the pretend part publicly, makes it much easier [06:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:12] :) [06:07:47] thats already reported at T277702 [06:07:48] T277702: GrowthExperiments\Maintenance\RefreshLinkRecommendations: no transaction to commit, something got out of sync - https://phabricator.wikimedia.org/T277702 [06:07:54] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s7-master cname [dns] - 10https://gerrit.wikimedia.org/r/673196 (https://phabricator.wikimedia.org/T274336) (owner: 10Marostegui) [06:09:47] I see a few "Error connecting to 10.64.0.204" [06:10:11] but that s2 [06:16:37] 10SRE, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) [06:18:42] (03CR) 1020after4: "This is used for sending email to phabricator. I have no idea how it works really and I never use the email->phab feature to create tasks." [puppet] - 10https://gerrit.wikimedia.org/r/655743 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [06:19:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:20:10] !log Upgrade kernel on db1086 [06:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1086', diff saved to https://phabricator.wikimedia.org/P14997 and previous config saved to /var/cache/conftool/dbconfig/20210323-062059-marostegui.json [06:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 10%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P14998 and previous config saved to /var/cache/conftool/dbconfig/20210323-062338-root.json [06:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:16] (03PS1) 10Marostegui: wiki-replicas.sql: Add analytics user [puppet] - 10https://gerrit.wikimedia.org/r/674194 (https://phabricator.wikimedia.org/T269211) [06:27:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:28:17] (03CR) 10Marostegui: "This change was applied yesterday to the DB, this is just to keep track of it on puppet" [puppet] - 10https://gerrit.wikimedia.org/r/674194 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [06:28:54] RECOVERY - Disk space on backup1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup1002&var-datasource=eqiad+prometheus/ops [06:29:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1146:3314', diff saved to https://phabricator.wikimedia.org/P14999 and previous config saved to /var/cache/conftool/dbconfig/20210323-062942-marostegui.json [06:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:20] 10SRE, 10Dumps-Generation, 10SRE-Access-Requests, 10Patch-For-Review: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it - https://phabricator.wikimedia.org/T277629 (10ArielGlenn) This was approved in yesterday's SRE meeting, though I guess someone on that team p... [06:38:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 25%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P15000 and previous config saved to /var/cache/conftool/dbconfig/20210323-063842-root.json [06:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:09] (03PS2) 10Marostegui: wiki-replicas.sql: Add analytics user [puppet] - 10https://gerrit.wikimedia.org/r/674194 (https://phabricator.wikimedia.org/T269211) [06:42:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:31] (03CR) 10Elukey: [C: 03+1] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/674194 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [06:44:45] (03CR) 10Marostegui: [C: 03+2] wiki-replicas.sql: Add analytics user [puppet] - 10https://gerrit.wikimedia.org/r/674194 (https://phabricator.wikimedia.org/T269211) (owner: 10Marostegui) [06:47:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:49:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:51:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:51:58] RECOVERY - SSH on mw2248.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:53:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 50%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P15001 and previous config saved to /var/cache/conftool/dbconfig/20210323-065345-root.json [06:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rakefile: fix most rubocop violations [deployment-charts] - 10https://gerrit.wikimedia.org/r/673992 (owner: 10Giuseppe Lavagetto) [06:58:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3317 to enable report_host T266483', diff saved to https://phabricator.wikimedia.org/P15002 and previous config saved to /var/cache/conftool/dbconfig/20210323-065836-marostegui.json [06:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:44] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [06:59:07] (03Merged) 10jenkins-bot: Rakefile: fix most rubocop violations [deployment-charts] - 10https://gerrit.wikimedia.org/r/673992 (owner: 10Giuseppe Lavagetto) [06:59:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1101:3318 to enable report_host T266483', diff saved to https://phabricator.wikimedia.org/P15003 and previous config saved to /var/cache/conftool/dbconfig/20210323-065947-marostegui.json [06:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:16] !log Upgrade kernel on db1101 [07:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 25%: Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P15004 and previous config saved to /var/cache/conftool/dbconfig/20210323-070705-root.json [07:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 25%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P15005 and previous config saved to /var/cache/conftool/dbconfig/20210323-070719-root.json [07:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:04] (03PS1) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) [07:08:29] (03CR) 10jerkins-bot: [V: 04-1] statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:08:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 75%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P15006 and previous config saved to /var/cache/conftool/dbconfig/20210323-070849-root.json [07:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:57] (03PS2) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) [07:13:55] (03CR) 10jerkins-bot: [V: 04-1] statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:14:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:46] (03PS3) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) [07:22:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 50%: Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P15007 and previous config saved to /var/cache/conftool/dbconfig/20210323-072209-root.json [07:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 50%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P15008 and previous config saved to /var/cache/conftool/dbconfig/20210323-072223-root.json [07:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:23:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 100%: Slowly repool db1086 after removing it from master', diff saved to https://phabricator.wikimedia.org/P15009 and previous config saved to /var/cache/conftool/dbconfig/20210323-072352-root.json [07:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:39] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Without getting into how cfssl is used, the patch would disrupt all existing installations of etcd in production." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [07:29:02] (03PS4) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) [07:32:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1181.eqiad.wmnet with reason: REIMAGE [07:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: REIMAGE [07:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:51] (03PS5) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) [07:36:37] !log create a 50g lvm volume on prometheus[12]00[34] for the k8s-mlserve cluster - T272918 [07:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:44] T272918: Create ml-serve k8s cluster - https://phabricator.wikimedia.org/T272918 [07:37:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 25%: Slowly repool db1146:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P15010 and previous config saved to /var/cache/conftool/dbconfig/20210323-073702-root.json [07:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 75%: Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P15011 and previous config saved to /var/cache/conftool/dbconfig/20210323-073713-root.json [07:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 75%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P15012 and previous config saved to /var/cache/conftool/dbconfig/20210323-073726-root.json [07:37:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:48] (03PS1) 10Marostegui: db1165: Specify the future of this host [puppet] - 10https://gerrit.wikimedia.org/r/674250 (https://phabricator.wikimedia.org/T258361) [07:42:13] (03PS6) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) [07:42:32] (03CR) 10Marostegui: [C: 03+2] db1165: Specify the future of this host [puppet] - 10https://gerrit.wikimedia.org/r/674250 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [07:46:28] (03CR) 10Ladsgroup: "PCC success: https://puppet-compiler.wmflabs.org/compiler1002/28708/stat1007.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:52:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 50%: Slowly repool db1146:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P15013 and previous config saved to /var/cache/conftool/dbconfig/20210323-075206-root.json [07:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3317 (re)pooling @ 100%: Slowly repool db1101:3317', diff saved to https://phabricator.wikimedia.org/P15014 and previous config saved to /var/cache/conftool/dbconfig/20210323-075216-root.json [07:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1101:3318 (re)pooling @ 100%: Slowly repool db1101:3318', diff saved to https://phabricator.wikimedia.org/P15015 and previous config saved to /var/cache/conftool/dbconfig/20210323-075230-root.json [07:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1098:3317 to enable report_host T266483', diff saved to https://phabricator.wikimedia.org/P15016 and previous config saved to /var/cache/conftool/dbconfig/20210323-075253-marostegui.json [07:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:01] T266483: Enable report_host for mariadb - https://phabricator.wikimedia.org/T266483 [07:54:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 25%: Slowly repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15017 and previous config saved to /var/cache/conftool/dbconfig/20210323-075445-root.json [07:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:00] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 82 probes of 606 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:00:36] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:02:04] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.059 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:03:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:56] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:58] --- SREs: if you have to merge please wait a bit, there is an inconsistency (probably triggered/caused by me) in the puppetmaster1001 [08:05:01] --- [08:07:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 75%: Slowly repool db1146:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P15019 and previous config saved to /var/cache/conftool/dbconfig/20210323-080709-root.json [08:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 50%: Slowly repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15020 and previous config saved to /var/cache/conftool/dbconfig/20210323-080949-root.json [08:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:44] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 44 probes of 606 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:14:30] (03CR) 10JMeybohm: [C: 03+1] downtime: Support services and other special icinga host [puppet] - 10https://gerrit.wikimedia.org/r/674147 (https://phabricator.wikimedia.org/T277191) (owner: 10Alexandros Kosiaris) [08:16:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:10] (03PS3) 10JMeybohm: kubernetes eqiad: Populate hiera keys for k8s worker updates [puppet] - 10https://gerrit.wikimedia.org/r/673949 (https://phabricator.wikimedia.org/T277741) [08:21:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] downtime: Support services and other special icinga host [puppet] - 10https://gerrit.wikimedia.org/r/674147 (https://phabricator.wikimedia.org/T277191) (owner: 10Alexandros Kosiaris) [08:22:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1146:3314 (re)pooling @ 100%: Slowly repool db1146:3314 after schema change', diff saved to https://phabricator.wikimedia.org/P15021 and previous config saved to /var/cache/conftool/dbconfig/20210323-082213-root.json [08:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:23] (03PS1) 10Elukey: prometheus: add the ml-serve clusters settings [puppet] - 10https://gerrit.wikimedia.org/r/674258 (https://phabricator.wikimedia.org/T272918) [08:23:44] (03PS3) 10JMeybohm: kubernetes eqiad: Apply role and hiera values to new masters [puppet] - 10https://gerrit.wikimedia.org/r/673952 (https://phabricator.wikimedia.org/T277741) [08:24:15] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 18 hosts with reason: Reinitialize eqiad k8s cluster with new etcd [08:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:22] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 18 hosts with reason: Reinitialize eqiad k8s cluster with new etcd [08:24:24] !log installing mariadb-10.3 updates on buster (just client-side libs/tools, unrelated to the main wmf-mariadb packages) [08:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 75%: Slowly repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15022 and previous config saved to /var/cache/conftool/dbconfig/20210323-082454-root.json [08:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:10] !log beginning the k8s upgrade/reinit process. T277741 [08:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:17] T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 [08:28:17] !log downtime all services in T277741 for 24H [08:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:40] (03PS12) 10Ayounsi: Add Capirca support to Homer [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) [08:31:29] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=apertium [08:31:29] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=api-gateway [08:31:29] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=blubberoid [08:31:30] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=citoid [08:31:30] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=cxserver [08:31:30] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=echostore [08:31:31] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics [08:31:31] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics-external [08:31:31] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-logging-external [08:31:32] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-main [08:31:32] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventstreams [08:31:32] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventstreams-internal [08:31:33] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=linkrecommendation [08:31:33] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mathoid [08:31:34] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mobileapps [08:31:34] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=proton [08:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:35] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=push-notifications [08:31:35] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=recommendation-api [08:31:36] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=sessionstore [08:31:36] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=similar-users [08:31:37] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=termbox [08:31:37] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=wikifeeds [08:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:56] !log eqiad services in k8s depooled. T277741 [08:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:11] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add the ml-serve clusters settings [puppet] - 10https://gerrit.wikimedia.org/r/674258 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [08:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:59] (03CR) 10jerkins-bot: [V: 04-1] Add Capirca support to Homer [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [08:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:46] T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 [08:33:53] (03PS13) 10Ayounsi: Add Capirca support to Homer [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) [08:35:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:02] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is CRITICAL: 114.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37 [08:39:05] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=apertium [08:39:06] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=api-gateway [08:39:06] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=blubberoid [08:39:06] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=citoid [08:39:07] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=cxserver [08:39:07] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=echostore [08:39:07] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics [08:39:08] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-analytics-external [08:39:08] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-logging-external [08:39:09] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventgate-main [08:39:09] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventstreams [08:39:09] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=eventstreams-internal [08:39:10] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=linkrecommendation [08:39:10] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mathoid [08:39:10] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=mobileapps [08:39:11] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=proton [08:39:11] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=push-notifications [08:39:12] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=recommendation-api [08:39:12] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=sessionstore [08:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:13] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=similar-users [08:39:13] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=termbox [08:39:14] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=wikifeeds [08:39:14] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=zotero [08:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1098:3317 (re)pooling @ 100%: Slowly repool db1098:3317', diff saved to https://phabricator.wikimedia.org/P15023 and previous config saved to /var/cache/conftool/dbconfig/20210323-083957-root.json [08:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:04] (03CR) 10Hashar: [C: 03+2] Branch commit for wmf/1.36.0-wmf.36 [core] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674183 (https://phabricator.wikimedia.org/T274940) (owner: 10TrainBranchBot) [08:43:22] !log poweroff argon and chlorine T277741 [08:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:30] T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 [08:45:39] (03CR) 10JMeybohm: [C: 03+2] Add kubernetes1017 to BGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/672709 (https://phabricator.wikimedia.org/T277741) (owner: 10Alexandros Kosiaris) [08:46:09] (03Merged) 10jenkins-bot: Add kubernetes1017 to BGP peers [homer/public] - 10https://gerrit.wikimedia.org/r/672709 (https://phabricator.wikimedia.org/T277741) (owner: 10Alexandros Kosiaris) [08:48:09] (03PS1) 10Alexandros Kosiaris: Add kubemaster.svc.eqiad.wmnet.cert [puppet] - 10https://gerrit.wikimedia.org/r/674261 (https://phabricator.wikimedia.org/T277741) [08:48:40] (03CR) 10jerkins-bot: [V: 04-1] Add kubemaster.svc.eqiad.wmnet.cert [puppet] - 10https://gerrit.wikimedia.org/r/674261 (https://phabricator.wikimedia.org/T277741) (owner: 10Alexandros Kosiaris) [08:50:09] (03PS2) 10Alexandros Kosiaris: Add kubemaster.svc.eqiad.wmnet.cert [puppet] - 10https://gerrit.wikimedia.org/r/674261 (https://phabricator.wikimedia.org/T277741) [08:51:40] (03CR) 10Gehel: [C: 03+1] "> Patch Set 1:" [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) (owner: 10Ryan Kemper) [08:52:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add kubemaster.svc.eqiad.wmnet.cert [puppet] - 10https://gerrit.wikimedia.org/r/674261 (https://phabricator.wikimedia.org/T277741) (owner: 10Alexandros Kosiaris) [08:55:12] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([argon.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [08:55:22] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([argon.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [08:58:36] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] kubernetes eqiad: Populate hiera keys for k8s worker updates [puppet] - 10https://gerrit.wikimedia.org/r/673949 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm) [08:59:55] (03PS3) 10Filippo Giunchedi: alerts: deploy to Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/674025 (https://phabricator.wikimedia.org/T272977) [09:01:03] (03CR) 10jerkins-bot: [V: 04-1] alerts: deploy to Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/674025 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [09:02:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:04:34] (03PS4) 10Filippo Giunchedi: alerts: deploy to Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/674025 (https://phabricator.wikimedia.org/T272977) [09:04:36] !log empty etcd T277741 [09:04:43] (03CR) 10Filippo Giunchedi: alerts: deploy to Prometheus hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674025 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [09:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:45] T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 [09:05:03] !log reboot kubetcd100[456] for kernel upgrades. T277741 T273278 [09:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:36] (03CR) 10Volans: "I'm rebasing it on master so that CI can do it's course, please rebase your local copy too if you're modifying it ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) (owner: 10Ryan Kemper) [09:06:39] (03PS2) 10Volans: elasticsearch: combined plugin upgrade + reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/673343 (https://phabricator.wikimedia.org/T277792) (owner: 10Ryan Kemper) [09:07:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:07:39] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.36 [core] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674183 (https://phabricator.wikimedia.org/T274940) (owner: 10TrainBranchBot) [09:13:17] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [09:13:30] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:13:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_cxserver_cluster_eqiad,swagger_check_eventgate_analytics_cluster_eqiad,swagger_check_sessionstore_eqiad,swagger_check_termbox_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:14:21] (03PS4) 10Alexandros Kosiaris: kubernetes eqiad: Apply role and hiera values to new masters [puppet] - 10https://gerrit.wikimedia.org/r/673952 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm) [09:14:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1147', diff saved to https://phabricator.wikimedia.org/P15024 and previous config saved to /var/cache/conftool/dbconfig/20210323-091432-marostegui.json [09:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] kubernetes eqiad: Apply role and hiera values to new masters [puppet] - 10https://gerrit.wikimedia.org/r/673952 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm) [09:16:06] 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) [09:16:33] 10SRE: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff) [09:16:42] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes1017.eqiad.wmnet [09:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:07] (03PS1) 10Ema: VCL: test Server-Timing response header [puppet] - 10https://gerrit.wikimedia.org/r/674265 (https://phabricator.wikimedia.org/T277769) [09:17:24] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,service=kubemaster,cluster=kubernetes [09:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:31] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=kubemaster,cluster=kubernetes [09:17:31] (03CR) 10Ayounsi: [C: 03+1] tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans) [09:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:06] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dc=eqiad,cluster=kubernetes,name=kubernetes1017.eqiad.wmnet [09:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:50] (03CR) 10Volans: [C: 03+2] tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans) [09:20:00] (03CR) 10Volans: [C: 03+2] tests: generate documentation from schemas [homer/public] - 10https://gerrit.wikimedia.org/r/673071 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans) [09:20:09] (03PS2) 10Volans: tests: update deprecated pytest option [homer/public] - 10https://gerrit.wikimedia.org/r/674069 [09:20:21] (03Merged) 10jenkins-bot: tests: add tests for the configuration files [homer/public] - 10https://gerrit.wikimedia.org/r/672765 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans) [09:20:37] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kubemaster is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [09:20:39] (03Merged) 10jenkins-bot: tests: generate documentation from schemas [homer/public] - 10https://gerrit.wikimedia.org/r/673071 (https://phabricator.wikimedia.org/T272688) (owner: 10Volans) [09:20:59] (03CR) 10Volans: [C: 03+2] tests: update deprecated pytest option [homer/public] - 10https://gerrit.wikimedia.org/r/674069 (owner: 10Volans) [09:21:31] (03CR) 10Elukey: [C: 03+2] prometheus: add the ml-serve clusters settings [puppet] - 10https://gerrit.wikimedia.org/r/674258 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [09:21:34] (03Merged) 10jenkins-bot: tests: update deprecated pytest option [homer/public] - 10https://gerrit.wikimedia.org/r/674069 (owner: 10Volans) [09:24:56] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1001.eqiad.wmnet with reason: REIMAGE [09:24:56] (03CR) 10Ema: [C: 03+2] VCL: test Server-Timing response header [puppet] - 10https://gerrit.wikimedia.org/r/674265 (https://phabricator.wikimedia.org/T277769) (owner: 10Ema) [09:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:10] PROBLEM - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.16 and port 4969: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:25:26] s'up? [09:25:32] k8s maintenance? [09:25:34] any WIP? [09:25:38] part of the upgrade? [09:25:43] I think so yes [09:25:48] zotero was mentioned as a special case earlier [09:25:55] they are upgrading the eqiad k8s cluster [09:25:55] probably us, yes [09:25:58] ignore please [09:26:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1086 to clone db1181 T275633', diff saved to https://phabricator.wikimedia.org/P15025 and previous config saved to /var/cache/conftool/dbconfig/20210323-092600-marostegui.json [09:26:03] <_joe_> yes, don't worry [09:26:06] mistake on our side, we forgot to downtime this one [09:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:08] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [09:26:58] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1001.eqiad.wmnet with reason: REIMAGE [09:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:10] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1002.eqiad.wmnet with reason: REIMAGE [09:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:38] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1003.eqiad.wmnet with reason: REIMAGE [09:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:58] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1007.eqiad.wmnet with reason: REIMAGE [09:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:11] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1002.eqiad.wmnet with reason: REIMAGE [09:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:20] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1003.eqiad.wmnet with reason: REIMAGE [09:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:27] (03PS1) 10Marostegui: instances.yaml: Add db1165 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/674267 (https://phabricator.wikimedia.org/T258361) [09:30:58] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1008.eqiad.wmnet with reason: REIMAGE [09:31:00] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1004.eqiad.wmnet with reason: REIMAGE [09:31:03] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1165 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/674267 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [09:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:17] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1007.eqiad.wmnet with reason: REIMAGE [09:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1165 to dbctl, depooled - T258361', diff saved to https://phabricator.wikimedia.org/P15027 and previous config saved to /var/cache/conftool/dbconfig/20210323-093246-marostegui.json [09:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:54] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [09:32:56] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1009.eqiad.wmnet with reason: REIMAGE [09:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:17] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1008.eqiad.wmnet with reason: REIMAGE [09:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:56] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1017.eqiad.wmnet [09:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:55] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1010.eqiad.wmnet with reason: REIMAGE [09:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:12] (03PS1) 10Marostegui: db1165: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/674268 (https://phabricator.wikimedia.org/T258361) [09:35:16] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1009.eqiad.wmnet with reason: REIMAGE [09:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:34] 10SRE: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff) [09:35:58] (03CR) 10Marostegui: [C: 03+2] db1165: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/674268 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [09:36:00] !log jayme@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on kubernetes1004.eqiad.wmnet with reason: REIMAGE [09:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:18] 10SRE: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [09:36:48] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1011.eqiad.wmnet with reason: REIMAGE [09:36:50] (03PS1) 10JMeybohm: contool-data: Add kubernetes1017.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/674269 (https://phabricator.wikimedia.org/T277741) [09:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:58] PROBLEM - WDQS high update lag on wdqs1011 is CRITICAL: 3630 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:37:14] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1010.eqiad.wmnet with reason: REIMAGE [09:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:03] PROBLEM - Check systemd state on kubernetes1003 is CRITICAL: CRITICAL - degraded: The following units failed: nic-saturation-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:07] PROBLEM - WDQS high update lag on wdqs2006 is CRITICAL: 3698 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:38:09] PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-sidecar@k8s-mlserve.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:38:44] this is me --^ [09:38:48] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1012.eqiad.wmnet with reason: REIMAGE [09:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:05] RECOVERY - Check systemd state on kubernetes1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:14] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1011.eqiad.wmnet with reason: REIMAGE [09:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:44] (03CR) 10Alexandros Kosiaris: [C: 03+2] contool-data: Add kubernetes1017.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/674269 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm) [09:40:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin_ng: Enable eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/673955 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm) [09:40:20] PROBLEM - Host kubernetes1003 is DOWN: PING CRITICAL - Packet loss = 100% [09:40:36] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes1005.eqiad.wmnet [09:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:55] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes1015.eqiad.wmnet [09:40:56] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1013.eqiad.wmnet with reason: REIMAGE [09:41:00] RECOVERY - Host kubernetes1003 is UP: PING WARNING - Packet loss = 50%, RTA = 0.30 ms [09:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:04] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes1016.eqiad.wmnet [09:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:15] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1012.eqiad.wmnet with reason: REIMAGE [09:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:30] (03Merged) 10jenkins-bot: admin_ng: Enable eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/673955 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm) [09:42:24] PROBLEM - WDQS high update lag on wdqs2005 is CRITICAL: 3956 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:42:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1165 into s6 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P15028 and previous config saved to /var/cache/conftool/dbconfig/20210323-094257-marostegui.json [09:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:06] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [09:43:17] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1013.eqiad.wmnet with reason: REIMAGE [09:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:26] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1014.eqiad.wmnet with reason: REIMAGE [09:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:08] PROBLEM - WDQS high update lag on wdqs2004 is CRITICAL: 4060 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:44:20] PROBLEM - WDQS high update lag on wdqs1003 is CRITICAL: 4072 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:44:23] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1165 pooled with minimal weight for now, once it looks good, I will start the automatic pooling [09:44:26] PROBLEM - WDQS high update lag on wdqs1008 is CRITICAL: 4078 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:44:32] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [09:44:35] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [09:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:24] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1014.eqiad.wmnet with reason: REIMAGE [09:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:35] (03CR) 10Ema: [C: 04-1] "See inline comments. I've also just added https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/673295 (https://phabricator.wikimedia.org/T277769) (owner: 10Gilles) [09:45:37] (03CR) 10Michael Große: [C: 03+1] "Would be really useful to have 6.14 🙌" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/674134 (https://phabricator.wikimedia.org/T278180) (owner: 10Addshore) [09:45:43] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [09:45:46] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [09:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:53] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [09:46:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:02] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [09:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:49] (03PS1) 10JMeybohm: contool-data: Add kubernetes2017.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/674270 (https://phabricator.wikimedia.org/T277191) [09:47:43] (03PS2) 10JMeybohm: conftool-data: Add kubernetes2017.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/674270 (https://phabricator.wikimedia.org/T277191) [09:48:20] (03CR) 10Alexandros Kosiaris: [C: 03+2] conftool-data: Add kubernetes2017.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/674270 (https://phabricator.wikimedia.org/T277191) (owner: 10JMeybohm) [09:49:44] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [09:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:10] PROBLEM - Check systemd state on prometheus1004 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-sidecar@k8s-mlserve.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:50:24] !log jayme@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,service=kubesvc,name=kubernetes1017.eqiad.wmnet [09:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:30] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=kubesvc,name=kubernetes1017.eqiad.wmnet [09:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:02] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [09:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:10] !log jayme@cumin1001 conftool action : set/weight=10; selector: dc=codfw,service=kubesvc,name=kubernetes2017.codfw.wmnet [09:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:15] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,service=kubesvc,name=kubernetes2017.codfw.wmnet [09:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:23] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [09:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:42] !log scap prep 1.36.0-wmf.36 # T274940 [09:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:50] T274940: 1.36.0-wmf.36 deployment blockers - https://phabricator.wikimedia.org/T274940 [09:53:57] !log deploy helmfile.d/admin_ng for eqiad T277741 [09:53:57] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1016.eqiad.wmnet [09:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:04] T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 [09:54:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:31] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [09:54:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1165 into s6 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P15029 and previous config saved to /var/cache/conftool/dbconfig/20210323-095437-marostegui.json [09:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:46] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [09:54:51] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes1006.eqiad.wmnet [09:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:11] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1015.eqiad.wmnet [09:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:31] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1006.eqiad.wmnet [09:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:01] !log Applied security patches for 1.36.0-wmf.36 # T274940 [10:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:09] T274940: 1.36.0-wmf.36 deployment blockers - https://phabricator.wikimedia.org/T274940 [10:02:34] !log scap clean --delete 1.36.0-wmf.32 # T274940 [10:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:07] PROBLEM - Check systemd state on prometheus2003 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-sidecar@k8s-mlserve.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:25] PROBLEM - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/kubemaster is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:07:49] PROBLEM - Check systemd state on prometheus2004 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-sidecar@k8s-mlserve.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:47] (03CR) 10Arturo Borrero Gonzalez: "Thanks for this!" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051) (owner: 10Andrew Bogott) [10:10:10] 10SRE, 10Traffic, 10Patch-For-Review: Remove SLAAC IPs from Ganeti hosts - https://phabricator.wikimedia.org/T265904 (10ayounsi) @volans, @crusnov, what do you think about changing the `status` of those IPs from `active` to `SLAAC`? See for example https://netbox-next.wikimedia.org/ipam/ip-addresses/4540/ Th... [10:10:23] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1005.eqiad.wmnet [10:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:11] (03PS1) 10Elukey: prometheus: change port for k8s-mlserve clusters [puppet] - 10https://gerrit.wikimedia.org/r/674273 (https://phabricator.wikimedia.org/T272918) [10:15:41] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1005-cloudelastic-chi-eqiad on cloudelastic1005 is OK: (C)100 gt (W)80 gt 65.08 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1005&panelId=37 [10:16:45] !log hashar@deploy1002 Pruned MediaWiki: 1.36.0-wmf.32 (duration: 14m 47s) [10:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 25%: Slowly repool db1147 after schema change', diff saved to https://phabricator.wikimedia.org/P15030 and previous config saved to /var/cache/conftool/dbconfig/20210323-101836-root.json [10:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:13] !log hashar@deploy1002 Pruned MediaWiki: 1.36.0-wmf.33 (duration: 01m 48s) [10:19:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:24] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [10:19:24] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [10:19:24] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [10:19:24] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [10:19:24] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [10:19:24] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [10:19:25] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [10:19:25] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [10:19:26] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [10:19:26] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [10:19:27] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'similar-users' for release 'main' . [10:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:23] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'production' . [10:20:23] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'apertium' for release 'staging' . [10:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 5%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15031 and previous config saved to /var/cache/conftool/dbconfig/20210323-102119-root.json [10:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:28] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [10:21:29] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [10:21:29] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [10:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:47] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [10:22:09] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [10:22:09] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [10:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:20] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=kubesvc [10:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:37] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'staging' . [10:22:37] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' . [10:22:40] (03CR) 10Elukey: [C: 03+2] prometheus: change port for k8s-mlserve clusters [puppet] - 10https://gerrit.wikimedia.org/r/674273 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [10:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:31] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'staging' . [10:23:31] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [10:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:54] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'staging' . [10:23:54] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [10:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:37] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:25:39] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [10:25:39] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [10:25:39] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [10:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:25] RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:26:39] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'production' . [10:26:39] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'staging' . [10:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:21] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [10:27:21] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [10:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:08] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [10:28:08] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [10:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:59] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [10:28:59] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [10:29:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:46] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [10:29:46] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [10:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:53] RECOVERY - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:29:53] RECOVERY - Confd template for /srv/config-master/pybal/codfw/kubemaster on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [10:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:42] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [10:31:42] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [10:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:22] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' . [10:32:22] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams-internal' for release 'canary' . [10:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:43] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [10:32:43] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [10:32:43] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [10:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:29] 10SRE, 10Mail: Domains of most projects do not have DMARC policy - https://phabricator.wikimedia.org/T211403 (10Beeloser) The domain wikipedia.org does have a DMARC record however it has been applied incorrectly and therefore is not working. If the policy is not to deploy DMARC records then the record for wiki... [10:33:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 50%: Slowly repool db1147 after schema change', diff saved to https://phabricator.wikimedia.org/P15033 and previous config saved to /var/cache/conftool/dbconfig/20210323-103340-root.json [10:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:06] (03PS7) 10Ayounsi: Add Capirca definitions exporter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) [10:34:19] RECOVERY - Check systemd state on prometheus1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:35:18] (03CR) 10jerkins-bot: [V: 04-1] Add Capirca definitions exporter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [10:36:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 10%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15034 and previous config saved to /var/cache/conftool/dbconfig/20210323-103623-root.json [10:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:31] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - mathoid_4001: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet are marked down but pooled: recommendation-api_4632: Servers kubernetes1012.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernete [10:36:31] , kubernetes1017.eqiad.wmnet are marked down but pooled: push-notifications_4104: Servers kubernetes1012.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1017.eqiad.wmnet are marked down but pooled: proton_4030: Servers kubernetes1012.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1017.eqiad.wmnet are marked down but pooled: mobileapps_4102: Servers kubernetes1012.eqia [10:36:31] es1014.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet are marked down but pooled: similar-users_4110: Servers kubernetes1012.eqiad.wmnet https://wikitech.wikimedia.org/wiki/PyBal [10:36:31] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [10:37:49] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' . [10:37:49] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [10:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:31] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [10:39:31] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [10:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:13] (03PS1) 10DCausse: [wdqs] switch reporting topic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/674278 [10:41:11] (03PS8) 10Ayounsi: Add Capirca definitions exporter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) [10:41:33] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [10:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:41] (03CR) 10Ayounsi: "And fixed the output indentation 😊" (035 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/666876 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [10:42:50] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [10:42:50] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'push-notifications' for release 'canary' . [10:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:02] (03CR) 10Gehel: [C: 03+2] [wdqs] switch reporting topic to codfw [puppet] - 10https://gerrit.wikimedia.org/r/674278 (owner: 10DCausse) [10:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:05] RECOVERY - Check systemd state on prometheus2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:17] (03Abandoned) 10Arturo Borrero Gonzalez: openstack: nova: disable /etc/host management from cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/673475 (https://phabricator.wikimedia.org/T277866) (owner: 10Arturo Borrero Gonzalez) [10:43:32] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'recommendation-api' for release 'production' . [10:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:04] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [10:44:04] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'staging' . [10:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:31] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'similar-users' for release 'main' . [10:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:14] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'staging' . [10:45:14] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' . [10:45:14] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'test' . [10:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:43] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [10:45:43] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [10:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:15] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' . [10:46:15] !log jayme@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'staging' . [10:46:19] RECOVERY - Check systemd state on prometheus2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:42] RECOVERY - LVS zotero eqiad port 4969/tcp - Zotero- zotero.svc.eqiad.wmnet IPv4 #page on zotero.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 197 bytes in 1.103 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:47:00] welcome back zotero [10:48:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 75%: Slowly repool db1147 after schema change', diff saved to https://phabricator.wikimedia.org/P15035 and previous config saved to /var/cache/conftool/dbconfig/20210323-104843-root.json [10:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:41] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:50:47] RECOVERY - WDQS high update lag on wdqs2004 is OK: (C)3600 ge (W)1200 ge 28.16 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:51:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 15%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15036 and previous config saved to /var/cache/conftool/dbconfig/20210323-105126-root.json [10:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:35] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [10:53:09] RECOVERY - WDQS high update lag on wdqs1008 is OK: (C)3600 ge (W)1200 ge 57.36 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:53:11] RECOVERY - WDQS high update lag on wdqs2006 is OK: (C)3600 ge (W)1200 ge 17.66 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:53:31] PROBLEM - WDQS high update lag on wdqs1003 is CRITICAL: 8224 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:56:14] !log all services re-deployed to k8s eqiad - T277741 [10:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:22] T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 [10:58:19] RECOVERY - WDQS high update lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 52.4 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:58:45] (03PS1) 10Elukey: Add config for prometheus@k8s-mlserve [puppet] - 10https://gerrit.wikimedia.org/r/674279 (https://phabricator.wikimedia.org/T272918) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: How many deployers does it take to do European mid-day backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T1100). [11:00:04] MatmaRex: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:55] hi [11:01:05] (03CR) 10Elukey: "Filippo: I can split the change if you prefer (lvs part first, then grafana)" [puppet] - 10https://gerrit.wikimedia.org/r/674279 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [11:01:07] !log installing tomcat8 security updates [11:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:13] I’m in a meeting, but I can probably take over the window later if no one else is around [11:01:19] jayme: hi, when will you switch changeprop back to eqiad? [11:01:59] dcausse: probably some time tomorrow EU morning [11:02:07] ok thanks! [11:03:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1147 (re)pooling @ 100%: Slowly repool db1147 after schema change', diff saved to https://phabricator.wikimedia.org/P15037 and previous config saved to /var/cache/conftool/dbconfig/20210323-110347-root.json [11:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:22] (03PS1) 10DCausse: Revert "[wdqs] switch reporting topic to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/674118 [11:05:41] (03CR) 10DCausse: [C: 04-1] "not before changeprop is moved back to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/674118 (owner: 10DCausse) [11:05:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1148', diff saved to https://phabricator.wikimedia.org/P15038 and previous config saved to /var/cache/conftool/dbconfig/20210323-110553-marostegui.json [11:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 20%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15039 and previous config saved to /var/cache/conftool/dbconfig/20210323-110630-root.json [11:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:38] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [11:08:08] (03PS1) 10Marostegui: mariadb: Productionize db1181 [puppet] - 10https://gerrit.wikimedia.org/r/674280 (https://phabricator.wikimedia.org/T275633) [11:08:13] RECOVERY - WDQS high update lag on wdqs2005 is OK: (C)3600 ge (W)1200 ge 49.72 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:08:39] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:08:59] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1181 [puppet] - 10https://gerrit.wikimedia.org/r/674280 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [11:10:25] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:10:41] RECOVERY - WDQS high update lag on wdqs1011 is OK: (C)3600 ge (W)1200 ge 5.633 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [11:11:50] (03PS1) 10Hnowlan: aqs_next: add more hosts to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119) [11:14:21] (03CR) 10Elukey: "Hugh: Does this need hiera config too?" [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [11:16:03] (03PS2) 10Hnowlan: aqs_next: add more hosts to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119) [11:16:56] (03CR) 10Hnowlan: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [11:17:45] (03CR) 10Filippo Giunchedi: [C: 03+2] alerts: deploy to Prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/674025 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [11:17:49] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28710/console" [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [11:19:47] ACKNOWLEDGEMENT - HP RAID on db1086 is CRITICAL: CRITICAL: Slot 1: Failed: 1I:1:4 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T278226 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [11:19:50] 10SRE, 10ops-eqiad: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10ops-monitoring-bot) [11:21:11] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10Marostegui) a:03wiki_willy @wiki_willy this host will be decommissioned in a few weeks, but I would like this disk to be replaced (it is out of warranty) with some used ones if we still have. This host is a st... [11:21:28] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10Marostegui) p:05Triage→03Medium [11:21:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 25%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15040 and previous config saved to /var/cache/conftool/dbconfig/20210323-112133-root.json [11:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:42] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [11:22:16] (03PS1) 10Effie Mouzeli: hiera/nodes: put mc1037 into our session redis and memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/674282 (https://phabricator.wikimedia.org/T278225) [11:22:25] (03CR) 10Elukey: [C: 03+1] aqs_next: add more hosts to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [11:22:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:28] Lucas_WMDE: please let me know if you're able to do the deployment [11:23:36] or if i should reschedule [11:24:36] I can probably do it now [11:24:48] MatmaRex: I assume you’ll be able to test it on mwdebug? [11:25:02] yeah [11:25:54] (03PS2) 10Lucas Werkmeister (WMDE): Enable DiscussionTools' beta features on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674098 (https://phabricator.wikimedia.org/T276494) (owner: 10Bartosz Dziewoński) [11:26:00] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Enable DiscussionTools' beta features on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674098 (https://phabricator.wikimedia.org/T276494) (owner: 10Bartosz Dziewoński) [11:27:21] (03Merged) 10jenkins-bot: Enable DiscussionTools' beta features on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674098 (https://phabricator.wikimedia.org/T276494) (owner: 10Bartosz Dziewoński) [11:27:23] (03CR) 10Filippo Giunchedi: "LGTM, yeah I think that's fine as it is" [puppet] - 10https://gerrit.wikimedia.org/r/674279 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [11:27:44] MatmaRex: alright, it should be live on mwdebug1001 now [11:27:49] * Lucas_WMDE also tries it [11:28:21] seems to be working [11:29:00] yup, syncing [11:29:40] (03PS3) 10Hnowlan: aqs_next: add more hosts to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119) [11:30:25] !log lucaswerkmeister-wmde@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:674098|Enable DiscussionTools' beta features on dewiki (T276494)]] (duration: 00m 58s) [11:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:34] T276494: Make the Reply and New Discussion Tools available as Beta Features at de.wiki - https://phabricator.wikimedia.org/T276494 [11:30:39] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1005.eqiad. [11:30:39] down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:30:41] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1006.eqiad. [11:30:41] down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:30:45] thanks Lucas_WMDE! [11:31:17] np, excited to see this rolling out to more wikis :) [11:31:41] (03CR) 10Hnowlan: [C: 03+2] aqs_next: add more hosts to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/674281 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [11:31:49] !log EU backport&config window done [11:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:53] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:32:55] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:34:15] (03CR) 10Elukey: [C: 03+1] "LGTM, I had a chat with Effie on IRC about the rollout plan (disable puppet on mw nodes first, run puppet on mc1037 since it is still with" [puppet] - 10https://gerrit.wikimedia.org/r/674282 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [11:36:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 35%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15041 and previous config saved to /var/cache/conftool/dbconfig/20210323-113637-root.json [11:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:46] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [11:37:39] (03PS2) 10Effie Mouzeli: hiera/nodes: put mc1037 into our session redis and memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/674282 (https://phabricator.wikimedia.org/T278225) [11:38:29] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on aqs[1012-1015].eqiad.wmnet with reason: New buster hosts, not in use [11:38:30] (03PS1) 10Marostegui: site.pp: Remove duplicate role declaration [puppet] - 10https://gerrit.wikimedia.org/r/674287 (https://phabricator.wikimedia.org/T273566) [11:38:30] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on aqs[1012-1015].eqiad.wmnet with reason: New buster hosts, not in use [11:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:40:23] (03CR) 10Kormat: [C: 03+1] site.pp: Remove duplicate role declaration [puppet] - 10https://gerrit.wikimedia.org/r/674287 (https://phabricator.wikimedia.org/T273566) (owner: 10Marostegui) [11:40:38] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove duplicate role declaration [puppet] - 10https://gerrit.wikimedia.org/r/674287 (https://phabricator.wikimedia.org/T273566) (owner: 10Marostegui) [11:42:03] Lucas_WMDE: oh, you also posted on-wiki about it, thanks :D [11:42:16] yeah, I figured I might as well since it was dewiki ^ [11:42:20] *^^ [11:44:25] (03CR) 10Effie Mouzeli: "pcc https://puppet-compiler.wmflabs.org/compiler1003/28713/mc1034.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/674282 (https://phabricator.wikimedia.org/T278225) (owner: 10Effie Mouzeli) [11:51:23] (03PS1) 10Effie Mouzeli: hiera/nodes: put mc1038 into our session redis and memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/674290 (https://phabricator.wikimedia.org/T278225) [11:51:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 50%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15042 and previous config saved to /var/cache/conftool/dbconfig/20210323-115141-root.json [11:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:49] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [11:55:02] !log installing libcaca security updates [11:55:02] 10SRE, 10Proton, 10Product-Infrastructure-Team-Backlog (Kanban): Proton metrics broken - https://phabricator.wikimedia.org/T277857 (10MSantos) p:05Triage→03High a:03Jgiannelos [11:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:17] 10SRE, 10Proton, 10Product-Infrastructure-Team-Backlog (Kanban): Proton metrics broken - https://phabricator.wikimedia.org/T277857 (10MSantos) [11:55:54] (03PS1) 10Hnowlan: aqs_next: add additional hiera config for new nodes [puppet] - 10https://gerrit.wikimedia.org/r/674291 (https://phabricator.wikimedia.org/T274119) [11:57:47] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28714/console" [puppet] - 10https://gerrit.wikimedia.org/r/674291 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T1200) [12:05:34] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1016.eqiad. [12:05:34] down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:06:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 60%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15043 and previous config saved to /var/cache/conftool/dbconfig/20210323-120644-root.json [12:06:47] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: make cpp installation more general [puppet] - 10https://gerrit.wikimedia.org/r/674292 [12:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:53] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [12:07:08] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:07:21] (03PS2) 10Arturo Borrero Gonzalez: sonofgridengine: install cpp at the same time as 'gridengine-common' [puppet] - 10https://gerrit.wikimedia.org/r/674292 [12:07:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: install cpp at the same time as 'gridengine-common' [puppet] - 10https://gerrit.wikimedia.org/r/674292 (owner: 10Arturo Borrero Gonzalez) [12:09:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:09:23] !log drain ganeti2016 [12:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:16:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 25%: Slowly repool db1148 after schema change', diff saved to https://phabricator.wikimedia.org/P15044 and previous config saved to /var/cache/conftool/dbconfig/20210323-121642-root.json [12:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:02] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet [12:17:03] !log remove all schedule downtimes for k8s cluster. T277741 [12:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:15] T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 [12:17:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:19:04] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [12:19:40] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [12:20:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 25%: Slowly repool db1086 after cloning db1181', diff saved to https://phabricator.wikimedia.org/P15045 and previous config saved to /var/cache/conftool/dbconfig/20210323-122032-root.json [12:20:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 75%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15046 and previous config saved to /var/cache/conftool/dbconfig/20210323-122148-root.json [12:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:58] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [12:22:10] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:22:12] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikime [12:22:12] l [12:24:59] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet [12:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:26] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:26:28] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:27:53] !log drain ganeti2017 [12:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:24] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:31:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:31:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 50%: Slowly repool db1148 after schema change', diff saved to https://phabricator.wikimedia.org/P15047 and previous config saved to /var/cache/conftool/dbconfig/20210323-123146-root.json [12:31:52] (03PS1) 10Esanders: Remove redundant wgDiscussionToolsEnable overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674300 [12:31:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove helmfile.d/admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/673956 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm) [12:35:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 50%: Slowly repool db1086 after cloning db1181', diff saved to https://phabricator.wikimedia.org/P15048 and previous config saved to /var/cache/conftool/dbconfig/20210323-123535-root.json [12:35:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:33] (03Merged) 10jenkins-bot: Remove helmfile.d/admin [deployment-charts] - 10https://gerrit.wikimedia.org/r/673956 (https://phabricator.wikimedia.org/T277741) (owner: 10JMeybohm) [12:36:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 85%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15049 and previous config saved to /var/cache/conftool/dbconfig/20210323-123651-root.json [12:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:03] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [12:40:22] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.8571 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [12:41:22] (03PS1) 10Marostegui: install_server: Do not format db1181 [puppet] - 10https://gerrit.wikimedia.org/r/674301 (https://phabricator.wikimedia.org/T275633) [12:41:56] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [12:42:20] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2017.codfw.wmnet [12:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:12] (03CR) 10Marostegui: [C: 03+2] install_server: Do not format db1181 [puppet] - 10https://gerrit.wikimedia.org/r/674301 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [12:43:32] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:43:33] (03PS1) 10Muehlenhoff: Failover tendril to dbmonitor1002 [dns] - 10https://gerrit.wikimedia.org/r/674303 (https://phabricator.wikimedia.org/T224589) [12:46:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 75%: Slowly repool db1148 after schema change', diff saved to https://phabricator.wikimedia.org/P15050 and previous config saved to /var/cache/conftool/dbconfig/20210323-124650-root.json [12:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:18] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2017.codfw.wmnet [12:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:36] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [12:50:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 75%: Slowly repool db1086 after cloning db1181', diff saved to https://phabricator.wikimedia.org/P15051 and previous config saved to /var/cache/conftool/dbconfig/20210323-125039-root.json [12:50:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1165 (re)pooling @ 100%: Slowly pool db1165 into s6 T258361', diff saved to https://phabricator.wikimedia.org/P15052 and previous config saved to /var/cache/conftool/dbconfig/20210323-125155-root.json [12:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:05] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [12:55:37] (03CR) 10Marostegui: [C: 03+1] "Let's merge it tomorrow early morning as we agreed on IRC" [dns] - 10https://gerrit.wikimedia.org/r/674303 (https://phabricator.wikimedia.org/T224589) (owner: 10Muehlenhoff) [12:56:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:58:00] !log remove and decomission argon, chroline, acrab, acrux T277741, T277191 [12:58:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:09] T277191: Update Kubernetes cluster codfw to kubernetes 1.16 - https://phabricator.wikimedia.org/T277191 [12:58:11] T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 [12:58:13] !log drain ganeti2018 [12:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:50] (03CR) 10Elukey: [C: 03+2] Add config for prometheus@k8s-mlserve [puppet] - 10https://gerrit.wikimedia.org/r/674279 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [13:00:05] hashar and dancy: That opportune time is upon us again. Time for a Mediawiki train - European+American Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T1300). [13:00:22] (03PS1) 10Alexandros Kosiaris: Decommission argon, chlorine, acrab, acrux [puppet] - 10https://gerrit.wikimedia.org/r/674307 (https://phabricator.wikimedia.org/T277741) [13:00:57] (03PS38) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [13:00:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:14] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1003.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1006.eqiad.wmnet are marked down but pooled https://wikitech.wikime [13:01:14] l [13:01:16] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikime [13:01:17] l [13:01:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1148 (re)pooling @ 100%: Slowly repool db1148 after schema change', diff saved to https://phabricator.wikimedia.org/P15053 and previous config saved to /var/cache/conftool/dbconfig/20210323-130153-root.json [13:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:32] PROBLEM - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.23 and port 4006: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:04:50] RECOVERY - LVS linkrecommendation-external eqiad port 4006/tcp - Link Recommendation- public release- linkrecommendation.svc.eqiad.wmnet IPv4 on linkrecommendation.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 196 bytes in 7.257 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:05:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1086 (re)pooling @ 100%: Slowly repool db1086 after cloning db1181', diff saved to https://phabricator.wikimedia.org/P15054 and previous config saved to /var/cache/conftool/dbconfig/20210323-130543-root.json [13:05:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:58] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09524 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [13:12:30] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:13:07] jayme, akosiaris are the linkreccomendation flaps expected? [13:13:24] I think we got a task for that already [13:13:32] it's weird though [13:13:46] I 'll double check [13:14:03] https://phabricator.wikimedia.org/T278223 [13:14:42] akosiaris: ah ok sorry I thought it was related to the k8s upgrade [13:14:44] thanks :) [13:15:34] !log cp3054: install varnishkafka built explicitly against varnish 6.0.1-1wm2 to fix broken dpkg status T264398 [13:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:48] T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 [13:16:14] moritzm: this fixes dpkg on cp3054, I've tried upgrading screen and it worked fine ^ [13:17:32] going to run train for test wikis [13:17:37] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2018.codfw.wmnet [13:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:57] ema: ack, thx! [13:18:41] (03PS1) 10Hashar: testwikis wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674310 [13:18:43] (03CR) 10Hashar: [C: 03+2] testwikis wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674310 (owner: 10Hashar) [13:19:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:19:30] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674310 (owner: 10Hashar) [13:20:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:21:51] (03CR) 10Jbond: [C: 03+1] "lgtm some minor nits" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [13:23:10] !log hashar@deploy1002 Started scap: Promote testwikis from 1.36.0-wmf.35 to 1.36.0-wmf.36 - T274940 [13:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:18] T274940: 1.36.0-wmf.36 deployment blockers - https://phabricator.wikimedia.org/T274940 [13:25:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2018.codfw.wmnet [13:26:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:35] (03PS1) 10Elukey: role::ml_k8s::master: add prometheus instance endpoint [puppet] - 10https://gerrit.wikimedia.org/r/674313 (https://phabricator.wikimedia.org/T272918) [13:27:52] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, few typos inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [13:27:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:00] !log drain ganeti2008 [13:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:16] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28715/console" [puppet] - 10https://gerrit.wikimedia.org/r/674313 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [13:31:18] (03CR) 10Elukey: [C: 03+1] aqs_next: add additional hiera config for new nodes [puppet] - 10https://gerrit.wikimedia.org/r/674291 (https://phabricator.wikimedia.org/T274119) (owner: 10Hnowlan) [13:31:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:30] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:32:58] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28716/console" [puppet] - 10https://gerrit.wikimedia.org/r/674313 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [13:35:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:35:51] (03CR) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [13:37:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:40:40] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1003.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1005.eqiad. [13:40:40] down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:40:46] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1003.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1016.eqiad. [13:40:46] down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:41:20] I 'll silence this while working on https://phabricator.wikimedia.org/T278223 [13:41:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:34] (03PS1) 10Ayounsi: JSON schema, add coverage to secrets [homer/public] - 10https://gerrit.wikimedia.org/r/674318 (https://phabricator.wikimedia.org/T272688) [13:52:09] (03PS1) 10Palak199: Fix:: InvalidQueryException handling [software/transferpy] - 10https://gerrit.wikimedia.org/r/674319 (https://phabricator.wikimedia.org/T268258) [13:52:25] (03PS2) 10Ayounsi: JSON schema, add coverage to secrets [homer/public] - 10https://gerrit.wikimedia.org/r/674318 (https://phabricator.wikimedia.org/T272688) [13:53:55] (03PS1) 10Filippo Giunchedi: prometheus: read alerts from /srv/alerts [puppet] - 10https://gerrit.wikimedia.org/r/674321 (https://phabricator.wikimedia.org/T272977) [13:54:33] !log sudo systemctl reload apache2 on prometheus[12]00[34] to pick up new k8s-mlserve instance settings [13:54:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:05] !log hashar@deploy1002 Finished scap: Promote testwikis from 1.36.0-wmf.35 to 1.36.0-wmf.36 - T274940 (duration: 31m 57s) [13:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:13] T274940: 1.36.0-wmf.36 deployment blockers - https://phabricator.wikimedia.org/T274940 [13:55:28] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:55:32] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:55:40] (03CR) 10Jbond: etcd: Use cfssl for peer-to-peer communication (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674077 (owner: 10Majavah) [13:58:08] (03CR) 10Andrew Bogott: wmcs: add new admin script to create a new base image based off of upstream (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051) (owner: 10Andrew Bogott) [13:59:24] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host ganeti2008.codfw.wmnet [13:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:34] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:01:13] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10wiki_willy) a:05wiki_willy→03Cmjohnson Hi @Cmjohnson - since this host is out of warranty, can you grab a drive from a decom'd server for this one? Thanks, Willy [14:04:49] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2008.codfw.wmnet [14:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:59] stupid train [14:05:12] no doing group0 [14:05:14] !log installing pygments security updates on stretch [14:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:35] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=apertium [14:05:35] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=api-gateway [14:05:36] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=blubberoid [14:05:36] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=citoid [14:05:36] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=cxserver [14:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:46] (03PS1) 10Hashar: group0 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674325 [14:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:50] (03CR) 10Hashar: [C: 03+2] group0 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674325 (owner: 10Hashar) [14:05:52] now doing group 0 [14:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:54] !log pool a few services in eqiad k8s. T277741 [14:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:13] T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 [14:06:30] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.36 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674325 (owner: 10Hashar) [14:07:01] (03CR) 10Jbond: [C: 03+1] statistics: Migrate wmde cronjobs to systemd timers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:07:34] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikime [14:07:34] l [14:07:48] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.36 [14:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:42] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_k8s::master: add prometheus instance endpoint [puppet] - 10https://gerrit.wikimedia.org/r/674313 (https://phabricator.wikimedia.org/T272918) (owner: 10Elukey) [14:10:08] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/674165 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [14:10:38] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3175 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:11:42] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1003.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1005.eqiad. [14:11:42] down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:12:25] (03PS39) 10Jbond: netbase: add new module to manage /etc/services [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) [14:12:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:45] (03CR) 10Jbond: "thanks fixed" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [14:14:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great!" [puppet] - 10https://gerrit.wikimedia.org/r/670917 (https://phabricator.wikimedia.org/T277146) (owner: 10Jbond) [14:17:44] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:17:48] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:19:29] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=similar-users [14:19:29] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=termbox [14:19:29] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=wikifeeds [14:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:03] !log pool a few more services in eqiad k8s. T277741 [14:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:11] T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 [14:20:36] (03CR) 10Effie Mouzeli: [C: 03+1] "Very minor nits 😊" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667165 (https://phabricator.wikimedia.org/T275874) (owner: 10Jgiannelos) [14:25:24] (03CR) 10Effie Mouzeli: [C: 04-1] "just noticed it was -1'ed as it is WIP, sorry!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/667165 (https://phabricator.wikimedia.org/T275874) (owner: 10Jgiannelos) [14:29:35] (03CR) 10Ladsgroup: statistics: Migrate wmde cronjobs to systemd timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:30:09] (03CR) 10Andrew Bogott: [C: 03+1] wmcs.ceph.codfw: Upgrade to latest 5.X kernel [puppet] - 10https://gerrit.wikimedia.org/r/674074 (https://phabricator.wikimedia.org/T274565) (owner: 10David Caro) [14:31:47] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.619 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:33:10] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/674350 (https://phabricator.wikimedia.org/T278223) [14:34:37] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:34:39] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - linkrecommendation-external_4006: Servers kubernetes1010.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:34:43] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.07937 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:34:43] (03CR) 10Kosta Harlan: "I don't have time to deploy this now, but could do later this evening. In the meantime, Alexandros or Janis: if you want to deploy this ea" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674350 (https://phabricator.wikimedia.org/T278223) (owner: 10Kosta Harlan) [14:41:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] Decommission argon, chlorine, acrab, acrux [puppet] - 10https://gerrit.wikimedia.org/r/674307 (https://phabricator.wikimedia.org/T277741) (owner: 10Alexandros Kosiaris) [14:42:32] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventstreams [14:42:33] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=eventstreams-internal [14:42:33] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=linkrecommendation [14:42:33] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=mathoid [14:42:34] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=mobileapps [14:42:34] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=proton [14:42:34] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=push-notifications [14:42:35] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=recommendation-api [14:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:01] !log pool more services in eqiad k8s. T277741. Only the very large ones traffic wise are still on codfw [14:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:35] T277741: Update Kubernetes cluster eqiad to kubernetes 1.16 - https://phabricator.wikimedia.org/T277741 [14:44:13] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Sure, I 'll take this over and deploy to see if it fixes it. If not, I 'll revert." [deployment-charts] - 10https://gerrit.wikimedia.org/r/674350 (https://phabricator.wikimedia.org/T278223) (owner: 10Kosta Harlan) [14:44:18] 10SRE, 10MediaWiki-General, 10observability, 10serviceops, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10AMooney) @lmata, This needs PET code review correct? [14:44:27] (03PS4) 10ArielGlenn: [WIP] needs more tests [dumps] - 10https://gerrit.wikimedia.org/r/646998 (https://phabricator.wikimedia.org/T252396) [14:45:38] (03Merged) 10jenkins-bot: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/674350 (https://phabricator.wikimedia.org/T278223) (owner: 10Kosta Harlan) [14:46:31] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [14:46:31] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [14:46:31] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [14:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:15] (03PS1) 10Nikerabbit: MassMessage: Unbreak remote content fetching [extensions/MassMessage] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674366 (https://phabricator.wikimedia.org/T276936) [14:48:33] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:49:02] (03PS1) 10Nikerabbit: MassMessage: Unbreak remote content fetching [extensions/MassMessage] (wmf/1.36.0-wmf.35) - 10https://gerrit.wikimedia.org/r/674367 (https://phabricator.wikimedia.org/T276936) [14:49:11] 10SRE, 10MediaWiki-General, 10observability, 10serviceops, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10lmata) @AMooney yes please [14:53:35] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [14:53:35] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [14:53:35] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [14:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:58] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:04:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:23] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'external' . [15:10:23] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'staging' . [15:10:23] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'linkrecommendation' for release 'production' . [15:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:40] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:43] i am doing some work in rack A3 in case you see mgmt issue on some servers [15:22:33] (03PS1) 10Alexandros Kosiaris: Revert "Update cxserver to 2021-03-15-131520-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674369 [15:24:50] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:41] !log reindexing Italian wikis on elastic@eqiad, elastic@codfw, and cloudelastic complete (T274200) [15:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:53] T274200: Reindex English and Italian wikis to enable homoglyph plugin - https://phabricator.wikimedia.org/T274200 [15:26:25] (03PS1) 10Ahmon Dancy: logspam.pl: Update execution time limit regexp [puppet] - 10https://gerrit.wikimedia.org/r/674387 [15:26:43] (03PS2) 10Alexandros Kosiaris: Revert "Update cxserver to 2021-03-15-131520-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674369 [15:27:32] (03PS3) 10Alexandros Kosiaris: Partially revert "Update cxserver to 2021-03-15-131520-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674369 [15:27:36] (03PS4) 10Alexandros Kosiaris: Partially revert "Update cxserver to 2021-03-15-131520-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674369 [15:29:42] (03CR) 10Alexandros Kosiaris: [C: 03+2] Partially revert "Update cxserver to 2021-03-15-131520-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674369 (owner: 10Alexandros Kosiaris) [15:31:02] !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: name=eqiad,dnsdisc=echostore [15:31:03] (03Merged) 10jenkins-bot: Partially revert "Update cxserver to 2021-03-15-131520-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/674369 (owner: 10Alexandros Kosiaris) [15:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:19] !log pool echostore for eqiad (the first of the larger services traffic wise) [15:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:35] !log installing libsdl2 security updates [15:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:31] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [15:36:31] !log akosiaris@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [15:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:19] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [15:38:19] !log akosiaris@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [15:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:13] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'production' . [15:39:13] !log akosiaris@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [15:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:56] (03PS1) 10Muehlenhoff: Add library hint for libsdl2 [puppet] - 10https://gerrit.wikimedia.org/r/674391 [15:51:21] (03PS1) 10Muehlenhoff: Remove deployment-logstash2 [puppet] - 10https://gerrit.wikimedia.org/r/674392 (https://phabricator.wikimedia.org/T238707) [15:52:17] (03CR) 10Majavah: [C: 04-1] Remove deployment-logstash2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674392 (https://phabricator.wikimedia.org/T238707) (owner: 10Muehlenhoff) [15:53:33] (03CR) 10Muehlenhoff: Remove deployment-logstash2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674392 (https://phabricator.wikimedia.org/T238707) (owner: 10Muehlenhoff) [15:54:07] (03PS2) 10Muehlenhoff: Remove deployment-logstash2 [puppet] - 10https://gerrit.wikimedia.org/r/674392 (https://phabricator.wikimedia.org/T238707) [15:55:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:05] jbond42 and cdanis: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T1600). [16:00:05] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:26] o/ [16:01:01] j.bond may be out, I'm happy to deploy if cdanis would like [16:01:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:37] rzl: thanks fyi lgtm [16:04:00] (03CR) 10Muehlenhoff: "deployment-logstash2 will be going away tomorrow: I'm adding a few people who might have created/used the striker, phabricator, ores, wiki" [puppet] - 10https://gerrit.wikimedia.org/r/674392 (https://phabricator.wikimedia.org/T238707) (owner: 10Muehlenhoff) [16:05:18] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libsdl2 [puppet] - 10https://gerrit.wikimedia.org/r/674391 (owner: 10Muehlenhoff) [16:05:20] (03PS1) 10Ladsgroup: mailman3: Add ferm [puppet] - 10https://gerrit.wikimedia.org/r/674393 (https://phabricator.wikimedia.org/T277286) [16:06:14] (03CR) 10Brennen Bearnes: [C: 03+1] logspam.pl: Update execution time limit regexp [puppet] - 10https://gerrit.wikimedia.org/r/674387 (owner: 10Ahmon Dancy) [16:07:06] cdanis: I am boldly taking over your deploy window :) [16:07:09] tgr_: going ahead, stand by [16:07:48] I guess for a maintenance job there's not much for you to verify, but thanks for being around anyhow! [16:08:14] yeah, it isn't testable [16:08:26] (03CR) 10RLazarus: [C: 03+2] Update GrowthExperiments cronjob parameters [puppet] - 10https://gerrit.wikimedia.org/r/673631 (https://phabricator.wikimedia.org/T275171) (owner: 10Gergő Tisza) [16:10:00] merged -- the next puppet run on mwmaint1002 is scheduled for ~7 minutes, before the next time refreshLinkRecommendations fires, so I won't bother running it manually [16:11:02] thanks! [16:11:10] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3651 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:12:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:42] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:22:00] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3016 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:22:43] ouch --^ [16:23:31] <_joe_> ok, lemme take a look [16:24:13] <_joe_> the latency doesn't show anything worrisome [16:25:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:26:17] <_joe_> elukey: I can't see the problem looking at grafana [16:26:39] <_joe_> there was some at 16:10 [16:30:27] _joe_ yeah seems that the alert is a false positive? T [16:32:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:55] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10Joe) Trying to break down my current thoughts: ### Onhost memcached In terms of functionality, I don't see a difference between being a DaemonSet and running on the host its... [16:41:58] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1086 - https://phabricator.wikimedia.org/T278226 (10Cmjohnson) 05Open→03Resolved Disk replaced with a disk from decom'd db host [16:43:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: 2021-03-31) rack/setup/install cloudgw100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T272403 (10Cmjohnson) Fixed the primary port for cloudgw1001 The secondary port cable number is 5321 in xe-0/0/19 [16:44:16] (03Abandoned) 10Hnowlan: aqs: move password to hieradata rather than password module [labs/private] - 10https://gerrit.wikimedia.org/r/672441 (https://phabricator.wikimedia.org/T257572) (owner: 10Hnowlan) [16:48:29] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3016 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:56:22] (03CR) 10Bstorm: [C: 03+2] maintain-dbusers: rely on the global_id, not username for paws [puppet] - 10https://gerrit.wikimedia.org/r/674165 (https://phabricator.wikimedia.org/T276284) (owner: 10Bstorm) [16:58:17] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to 0.13.0-a29 [vendor] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674373 (https://phabricator.wikimedia.org/T275918) [16:59:18] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10Epic: Support maps serving for affiliate sites via an allow list - https://phabricator.wikimedia.org/T261694 (10Slaporte) On April 5, 2020, we will update the [Wikimedia Maps Terms of Use](https://meta.wikimedia.org/wiki/Revised_Maps_Term... [16:59:50] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install (2) new 10G switches - https://phabricator.wikimedia.org/T277340 (10wiki_willy) Hi @ayounsi - just to follow up on this, we should probably wait a bit longer on determining which racks to convert to 10g (after John and Chris can wrap up all the... [17:00:04] chrisalbon and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T1700). [17:00:58] 10SRE, 10vm-requests, 10GitLab (Initialization), 10Patch-For-Review, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10wkandek) 05Stalled→03Resolved Let's close and reopen if a second server becomes necessary. [17:02:01] (03CR) 10Cwhite: [C: 03+2] pontoon: initial hiera config for pontoon env [puppet] - 10https://gerrit.wikimedia.org/r/669968 (owner: 10Cwhite) [17:04:18] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.07937 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:05:31] (03CR) 10Subramanya Sastry: [C: 03+1] Bump wikimedia/parsoid to 0.13.0-a29 [vendor] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674373 (https://phabricator.wikimedia.org/T275918) (owner: 10C. Scott Ananian) [17:08:55] 10SRE, 10MediaWiki-General, 10observability, 10serviceops, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10AMooney) [17:11:20] 10SRE: Rewrite http://download.wikimedia.org/mediawiki/ -> https://releases.wikimedia.org/mediawiki in less than 3 redirects - https://phabricator.wikimedia.org/T119679 (10hashar) [17:12:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:29] (03CR) 10Cwhite: [C: 03+1] prometheus: read alerts from /srv/alerts [puppet] - 10https://gerrit.wikimedia.org/r/674321 (https://phabricator.wikimedia.org/T272977) (owner: 10Filippo Giunchedi) [17:21:23] 10SRE, 10serviceops: Memcached, mcrouter, nutcracker's future in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10JMeybohm) ### onhost memcached >It's still an open question how we will inject the node IP into the mcrouter configuration. it would mean we'd need to pass the host IP as an e... [17:38:28] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.6825 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:43:11] Amir1: hello, would you like to do the wmde/statistics cron switch now? [17:43:48] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [17:45:02] mutante: sure! [17:45:21] Amir1: so, I know you have shell but not root shell, so let's do it together [17:45:50] cool [17:45:55] it's only on stat1007 [17:46:15] that's what I was trying to grep right now, thx [17:46:22] dont see where the class is used [17:46:57] it's a bit complicated, it's in misc jobs of statistcs explorer [17:47:10] and that changes using a hiera config [17:47:33] compiling on 1007. it's just unusual that I cant even find anything including wmde::graphite [17:47:36] one sec [17:47:39] modules/profile/manifests/statistics/explorer/misc_jobs.pp [17:47:41] look [17:47:57] modules/profile/manifests/statistics/explorer/misc_jobs.pp [17:48:06] sorry double paste [17:48:07] if $::hostname in $hosts_with_jobs [17:48:22] and that came from profile::statistics::explorer::misc_jobs::hosts_with_jobs [17:48:36] which has "statistics::wmde" [17:48:39] mutante: ^ [17:49:01] ACK, i see it now, thx [17:49:33] https://puppet-compiler.wmflabs.org/compiler1001/28717/stat1007.eqiad.wmnet/index.html [17:49:45] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/28717/stat1007.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [17:50:45] (03PS9) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) [17:51:26] where are the sytemd timer logs? [17:51:29] I keep forgetting [17:52:13] Amir1: first: https://phabricator.wikimedia.org/P15055 [17:52:35] Amir1: [stat1007:~] $ sudo systemctl list-timers | grep wmde [17:53:10] I can't run that command [17:53:11] [stat1007:~] $ sudo systemctl status wmde-analytics-daily-noon.service [17:53:58] Amir1: let me paste more things.. sec [17:55:28] (03PS10) 10Cwhite: logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) [17:59:38] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [18:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T1800). [18:00:05] cscott and legoktm: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:22] o/ [18:00:25] legoktm: hi! [18:00:30] legoktm: will you deploy, or should I? [18:01:31] I can take care of it [18:01:35] legoktm: go ahead then [18:02:39] o/ [18:03:18] (03CR) 10Legoktm: [C: 03+2] Bump wikimedia/parsoid to 0.13.0-a29 [vendor] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674373 (https://phabricator.wikimedia.org/T275918) (owner: 10C. Scott Ananian) [18:04:26] (03PS2) 10Legoktm: Add irc2001.wikimedia.org (running buster) as second irc server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672687 (https://phabricator.wikimedia.org/T224579) (owner: 10Muehlenhoff) [18:04:29] (03CR) 10Legoktm: [C: 03+2] Add irc2001.wikimedia.org (running buster) as second irc server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672687 (https://phabricator.wikimedia.org/T224579) (owner: 10Muehlenhoff) [18:05:00] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:06:42] (03Merged) 10jenkins-bot: Add irc2001.wikimedia.org (running buster) as second irc server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/672687 (https://phabricator.wikimedia.org/T224579) (owner: 10Muehlenhoff) [18:07:49] testing the irc2001 change on mwdebug1002... [18:08:53] (03PS1) 10MSantos: mobileapps: bump to 2021-03-19-113347-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/674406 [18:10:03] syncing [18:10:38] !log legoktm@deploy1002 Synchronized wmf-config/ProductionServices.php: Add irc2001.wikimedia.org (running buster) as second irc server (T224579) (duration: 01m 08s) [18:10:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:49] T224579: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 [18:12:50] (03CR) 10MSantos: [C: 03+2] mobileapps: bump to 2021-03-19-113347-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/674406 (owner: 10MSantos) [18:14:04] (03PS3) 10Ottomata: Remove schema overrides for 6 finished EL migrations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673075 (https://phabricator.wikimedia.org/T267347) [18:14:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:14:16] (03Merged) 10jenkins-bot: mobileapps: bump to 2021-03-19-113347-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/674406 (owner: 10MSantos) [18:15:14] Urbanecm: legoktm i have a config patch i'd like to deploy after the window is done, let me know when it looks clear. :) [18:15:15] (03PS1) 10MSantos: wikifeeds: bump to 2021-03-19-113019-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/674409 [18:15:22] 10SRE, 10Wikimedia-IRC-RC-Server: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10Legoktm) Events are now going to irc2001.wikimedia.org. I watched `#en.wikipedia` on both kraz and irc2001 for a few minutes and saw identical output (note that channels won't exist on the... [18:15:34] ack, just waiting on CI right now [18:16:38] !log mbsantos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [18:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:27] (03CR) 10MSantos: [C: 03+2] wikifeeds: bump to 2021-03-19-113019-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/674409 (owner: 10MSantos) [18:18:08] !log mbsantos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:18:09] 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) install second SSD into payments100[5-8] - https://phabricator.wikimedia.org/T278250 (10RobH) [18:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:17] 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) install second SSD into payments100[5-8] - https://phabricator.wikimedia.org/T278250 (10RobH) [18:18:58] (03Merged) 10jenkins-bot: wikifeeds: bump to 2021-03-19-113019-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/674409 (owner: 10MSantos) [18:19:38] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09524 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [18:20:17] !log mbsantos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [18:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:50] what's msantos's IRC nick? [18:21:24] @seen mbsantos [18:21:24] mutante: Last time I saw mbsantos they were changing the nickname to thesocialdev and thesocialdev is still in the channel #wikimedia-overflow at 2/26/2021 10:18:33 AM (25d8h2m51s ago) [18:21:31] * thesocialdev msantos [18:21:32] legoktm: ^ [18:22:07] legoktm: information extracted from officewiki::Contacts [18:22:42] @legoktm I just realised I mistaken the service deploy window [18:23:32] thesocialdev: ok, that was what I was going to ask about :) [18:23:38] mutante: ty [18:24:10] I'll update the contact in officewiki as well, thanks for the reminder [18:24:44] :) [18:28:13] (03Merged) 10jenkins-bot: Bump wikimedia/parsoid to 0.13.0-a29 [vendor] (wmf/1.36.0-wmf.36) - 10https://gerrit.wikimedia.org/r/674373 (https://phabricator.wikimedia.org/T275918) (owner: 10C. Scott Ananian) [18:31:37] whee [18:31:48] cscott: does mwdebug work for parsoid changes? or do I just sync it out? [18:32:46] (03CR) 10Dzahn: "[stat1007:~] $ for wmdetimer in analytics-minutely analytics-daily-early analytics-daily-noon analytics-weekly toolkit-analyzer-build; do " [puppet] - 10https://gerrit.wikimedia.org/r/674195 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [18:34:23] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10wkandek) 1. Option - "Gitlab default" use 22 on the IP for Gitlab, mix with admin traffic in 2. Option - "Gerrit... [18:36:51] subbu: around? [18:37:14] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:37:34] yes? [18:38:14] subbu: trying to verify the -a29 backport, can it be tested on mwdebug or do I just sync it out? [18:38:18] oh i see your qn. about to cscott ... [18:38:18] I think c.scott went afk [18:38:45] sync it out. it doesn't work with mwdebug since it is a different cluster. [18:38:50] ack [18:40:12] syncing [18:40:52] !log legoktm@deploy1002 Synchronized php-1.36.0-wmf.36/vendor/: Bump wikimedia/parsoid to 0.13.0-a29 (duration: 01m 16s) [18:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:06] ottomata: all done [18:42:31] ok thank youy [18:42:41] (03CR) 10Ottomata: [C: 03+2] Remove schema overrides for 6 finished EL migrations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/673075 (https://phabricator.wikimedia.org/T267347) (owner: 10Ottomata) [18:44:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:44:19] Sorry, my client didnt give me a ping. Yeah, like subbu says, since group0 isn't even deployed yet there's no way to test this. [18:44:30] We'll be watching the train [18:45:05] !log otto@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Remove schema overrides for 6 finished EL migrations - T267347 T271164 T267351 T267348 T267343 T267353 (duration: 01m 07s) [18:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:24] T267348: PrefUpdate Event Platform Migration - https://phabricator.wikimedia.org/T267348 [18:45:24] T271164: DesktopWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T271164 [18:45:24] T267351: SuggestedTagsAction Event Platform Migration - https://phabricator.wikimedia.org/T267351 [18:45:24] T267343: EditAttemptStep Event Platform Migration - https://phabricator.wikimedia.org/T267343 [18:45:25] T267353: VisualEditorFeatureUse Event Platform Migration - https://phabricator.wikimedia.org/T267353 [18:45:25] T267347: MobileWebUIActionsTracking Event Platform Migration - https://phabricator.wikimedia.org/T267347 [18:48:50] (03PS3) 10Ahmon Dancy: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) [18:50:36] (03PS4) 10Ahmon Dancy: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) [18:53:44] legoktm: since it's to the 1.36-wmf.36 branch, it wouldn't even be testable on mwdebug, would it? [18:53:52] as nothing's running -wmf.36 yet [18:54:35] group0 wikis are on wmf.36 already [18:54:59] i thought that train wasn't for another 6 minutes [18:55:22] it ran in European time today [18:55:34] ah, the deployment calendar lied [18:55:45] https://sal.toolforge.org/log/nmdeX3gB8Fs0LHO5GBoC [18:55:46] cscott, parsoid runs on wtp* ... so, we cannot use mwdebug* anyway for verifying. [18:55:48] is the sync finished? as soon as it is I can smoke test group0 at least. [18:55:57] yes [18:56:02] 11:40:52 <+logmsgbot> !log legoktm@deploy1002 Synchronized php-1.36.0-wmf.36/vendor/: Bump wikimedia/parsoid to 0.13.0-a29 (duration: 01m 16s) [18:56:15] subbu: yeah, but my point is even mwdebug doesn't work if you're deploying to an undeployed branch [18:56:24] ok [18:56:47] but anyway, group0 is deployed so i'm going to go run some null edit tests [18:57:04] subbu: could you take a quick look at the logs from group0 and verify there's nothing unexpected [18:58:37] not sure what you mean by logs from group0 .. but i can look at logstash. [19:00:04] hashar and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T1900). [19:00:49] 10SRE, 10Wikimedia-IRC-RC-Server: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10Legoktm) From {T123729}: * Announce in Tech/News, wikitech-l, wikitech-ambassadors that we'll be switching irc.wikimedia.org over to a new server on XX. Include a reminder that clients sho... [19:01:31] subbu: anything unusual in logstash caused by a request to mediawiki.org or another group0 wiki. [19:03:06] all good. there is an unrelatd issue which I'll bring up in #mediawiki-core since it pertains to an ongoing flag there already. [19:03:46] basic edit tests on mediawiki.org look good as well. nothing's on fire at least. [19:05:17] 10SRE, 10Wikimedia-IRC-RC-Server: Set up spare irc1001.wikimedia.org in eqiad - https://phabricator.wikimedia.org/T278255 (10Legoktm) [19:06:38] 10SRE, 10Sustainability: Add redundancy to IRC recent changes service - https://phabricator.wikimedia.org/T128592 (10Legoktm) [19:06:41] 10SRE, 10Wikimedia-IRC-RC-Server: Set up spare irc1001.wikimedia.org in eqiad - https://phabricator.wikimedia.org/T278255 (10Legoktm) [19:09:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:11:18] legoktm: https://phabricator.wikimedia.org/T244542 has a patch in review, let's see if we're faster than you removing kraz :P [19:11:48] heh [19:11:57] <3 [19:14:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:20:24] (03CR) 10Legoktm: Include patches in restricted image (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [19:22:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:24:02] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28718/console" [puppet] - 10https://gerrit.wikimedia.org/r/674393 (https://phabricator.wikimedia.org/T277286) (owner: 10Ladsgroup) [19:25:19] am I clear to deploy phatality? Everything stable currently? [19:25:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:26:28] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3333 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:27:58] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:06] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3333 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:33:03] 10SRE: Rewrite http://download.wikimedia.org/mediawiki/ -> https://releases.wikimedia.org/mediawiki in less than 3 redirects - https://phabricator.wikimedia.org/T119679 (10Krinkle) >>! In T119679#1833203, @Krinkle wrote: > […] > > * GET https://download.wikimedia.org/mediawiki > > 1. 301 Permanent Redirect to... [19:34:19] (03PS5) 10Ahmon Dancy: Include patches in restricted image [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) [19:35:18] (03CR) 10Ahmon Dancy: Include patches in restricted image (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [19:36:54] (03CR) 10Krinkle: Include patches in restricted image (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [19:38:19] (03CR) 10Ahmon Dancy: Include patches in restricted image (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [19:38:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:39:34] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:40:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:44:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:31] (03CR) 10Krinkle: Include patches in restricted image (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674132 (https://phabricator.wikimedia.org/T271274) (owner: 10Ahmon Dancy) [19:51:47] !log jforrester@deploy1002 Started deploy [integration/docroot@9de8c9d]: Add homer-public listing, added by volans [19:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:55] !log jforrester@deploy1002 Finished deploy [integration/docroot@9de8c9d]: Add homer-public listing, added by volans (duration: 00m 08s) [19:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:58:06] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:42] !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts auth1002.eqiad.wmnet [20:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:12] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts auth1002.eqiad.wmnet [20:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:26] !log robh@cumin1001 START - Cookbook sre.hosts.decommission for hosts auth1002.eqiad.wmnet [20:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:17] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10RobH) [20:12:06] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:13:07] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts auth1002.eqiad.wmnet [20:13:11] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `auth1002.eqiad.wmnet` - auth1002.eqiad.wmnet (**PASS**) - Downtimed host on Ici... [20:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:24] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:05] !log robh@cumin1001 START - Cookbook sre.dns.netbox [20:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:12] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09524 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:24:23] !log robh@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:24:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:49] !log robh@cumin1001 START - Cookbook sre.dns.netbox [20:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:27] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T275511 (10Jclark-ctr) moss-fe1001 A2. U42. Port33 ID5341 moss-fe1002. D4. U41 Port41 ID5342 [20:26:43] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install moss-fe100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T275511 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [20:32:18] (03CR) 10Cwhite: [C: 03+2] logstash: add and enable dlq max_bytes workaround [puppet] - 10https://gerrit.wikimedia.org/r/673377 (https://phabricator.wikimedia.org/T277775) (owner: 10Cwhite) [20:36:25] (03PS1) 10Cwhite: logstash: only add dlq cleanup if enabled [puppet] - 10https://gerrit.wikimedia.org/r/674429 [20:38:18] (03CR) 10Cwhite: [C: 03+2] logstash: only add dlq cleanup if enabled [puppet] - 10https://gerrit.wikimedia.org/r/674429 (owner: 10Cwhite) [20:39:52] (03PS3) 10Andrew Bogott: wmcs: add new admin script to create a new base image based off of upstream [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051) [20:39:54] (03PS1) 10Andrew Bogott: fullstack: switch back to the standard image [puppet] - 10https://gerrit.wikimedia.org/r/674430 (https://phabricator.wikimedia.org/T278051) [20:40:53] (03CR) 10Andrew Bogott: [C: 03+2] fullstack: switch back to the standard image [puppet] - 10https://gerrit.wikimedia.org/r/674430 (https://phabricator.wikimedia.org/T278051) (owner: 10Andrew Bogott) [20:41:57] (03PS5) 10ArielGlenn: [WIP] needs more tests and some cleanup, ewww [dumps] - 10https://gerrit.wikimedia.org/r/646998 (https://phabricator.wikimedia.org/T252396) [20:42:45] (03CR) 10jerkins-bot: [V: 04-1] [WIP] needs more tests and some cleanup, ewww [dumps] - 10https://gerrit.wikimedia.org/r/646998 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [20:45:20] (03CR) 10Hoo man: [C: 03+1] "Tested with testwikidata (and a small enough batch size that makes sure we need to separate batches):" [puppet] - 10https://gerrit.wikimedia.org/r/673679 (https://phabricator.wikimedia.org/T277300) (owner: 10Hoo man) [20:48:09] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10wkandek) To provide as little obstacles for developers as possible access through port 22 is the preferred optio... [20:48:28] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10wkandek) a:03Dzahn [20:50:03] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10Jclark-ctr) bast1003 Rack D1 U41 Port24 ID3161 [20:50:05] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install bast1003.wikimedia.org - https://phabricator.wikimedia.org/T276396 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [20:53:32] (03PS1) 10Cwhite: logstash: arrange systemd::timer::job resource around cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/674431 [20:56:40] 10SRE, 10Prod-Kubernetes, 10SRE-tools: Support downtiming services in our cookbooks - https://phabricator.wikimedia.org/T277740 (10Volans) a:03Volans [20:59:59] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:32] !log robh@cumin1001 START - Cookbook sre.dns.netbox [21:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:18] (03PS1) 10Subramanya Sastry: Checkout master branch of testreduce always [puppet] - 10https://gerrit.wikimedia.org/r/674433 [21:02:20] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3016 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [21:04:14] (03PS2) 10Cwhite: logstash: arrange systemd::timer::job resource around cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/674431 [21:04:39] (03CR) 10jerkins-bot: [V: 04-1] logstash: arrange systemd::timer::job resource around cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/674431 (owner: 10Cwhite) [21:04:52] (03PS6) 10ArielGlenn: [WIP] needs more tests and some cleanup, ewww [dumps] - 10https://gerrit.wikimedia.org/r/646998 (https://phabricator.wikimedia.org/T252396) [21:04:58] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:49] (03PS3) 10Cwhite: logstash: arrange systemd::timer::job resource around cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/674431 [21:08:12] (03PS1) 10RobH: pki-root1001 taking over auth1002 chassis [puppet] - 10https://gerrit.wikimedia.org/r/674434 (https://phabricator.wikimedia.org/T276625) [21:09:43] !log ppchelko@deploy1002 Started deploy [restbase/deploy@531c474]: Add pageviews top-per-country endpoint [21:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:10:15] (03CR) 10RobH: [C: 03+2] pki-root1001 taking over auth1002 chassis [puppet] - 10https://gerrit.wikimedia.org/r/674434 (https://phabricator.wikimedia.org/T276625) (owner: 10RobH) [21:10:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:43] (03PS4) 10Cwhite: logstash: arrange systemd::timer::job resource around cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/674431 [21:12:26] (03CR) 10Cwhite: [C: 03+2] logstash: arrange systemd::timer::job resource around cron_splay [puppet] - 10https://gerrit.wikimedia.org/r/674431 (owner: 10Cwhite) [21:16:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:27:41] !log ppchelko@deploy1002 Finished deploy [restbase/deploy@531c474]: Add pageviews top-per-country endpoint (duration: 17m 58s) [21:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:52] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:28:33] (03CR) 10Dzahn: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm) [21:30:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:33:09] volans: is there a workflow to get a second IP on an exiting host? like the netbox part of it [21:40:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:41:31] volans: already talked about it with Cas on another channel, good for now, thx [21:43:55] (03PS4) 10Andrew Bogott: wmcs: add new admin script to create a new base image based off of upstream [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051) [21:45:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:51] (03PS5) 10Andrew Bogott: wmcs: add new admin script to create a new base image based off of upstream [puppet] - 10https://gerrit.wikimedia.org/r/674184 (https://phabricator.wikimedia.org/T278051) [21:48:06] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.07937 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [21:48:43] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) Should it be gitlab.wikimedia.org for both, https and ssh? (so both the webserver and second sshd would l... [21:54:26] (03PS1) 10Dzahn: drop gitlab CNAME, in favor of service name on separate IP [dns] - 10https://gerrit.wikimedia.org/r/674439 (https://phabricator.wikimedia.org/T276148) [21:55:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:55:59] (03CR) 10Dzahn: [C: 03+2] "not used yet" [dns] - 10https://gerrit.wikimedia.org/r/674439 (https://phabricator.wikimedia.org/T276148) (owner: 10Dzahn) [21:56:06] (03PS2) 10Dzahn: drop gitlab CNAME, in favor of service name on separate IP [dns] - 10https://gerrit.wikimedia.org/r/674439 (https://phabricator.wikimedia.org/T276148) [21:56:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:58:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:03:20] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:05:56] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [22:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:08] (03CR) 10Arlolra: [C: 03+1] Checkout master branch of testreduce always [puppet] - 10https://gerrit.wikimedia.org/r/674433 (owner: 10Subramanya Sastry) [22:13:12] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Dzahn) ` +gitlab 1H IN A 208.80.154.14... [22:13:26] (03CR) 10Dzahn: [C: 03+2] Checkout master branch of testreduce always [puppet] - 10https://gerrit.wikimedia.org/r/674433 (owner: 10Subramanya Sastry) [22:14:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:15:27] (03CR) 10Dzahn: "deployed on testreduce1001, it's not a puppet change unless you delete the dir and let it reclone" [puppet] - 10https://gerrit.wikimedia.org/r/674433 (owner: 10Subramanya Sastry) [22:16:18] (03PS1) 10Legoktm: releases: Provide server-agnostic rsync for security patches [puppet] - 10https://gerrit.wikimedia.org/r/674443 [22:16:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:07] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@3fd7d7b]: partition ores dumps by namespace [22:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:14] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@3fd7d7b]: partition ores dumps by namespace (duration: 02m 07s) [22:19:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:23:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:23:03] (03PS2) 10Legoktm: releases: Provide server-agnostic rsync for security patches [puppet] - 10https://gerrit.wikimedia.org/r/674443 [22:26:03] (03PS3) 10Legoktm: releases: Provide server-agnostic rsync for security patches [puppet] - 10https://gerrit.wikimedia.org/r/674443 [22:26:17] (03CR) 10Dzahn: httpd: add parameters and template to allow custom ports.conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [22:27:01] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28728/console" [puppet] - 10https://gerrit.wikimedia.org/r/674443 (owner: 10Legoktm) [22:27:32] 10SRE, 10GitLab (Initialization), 10Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10Sergey.Trofimovsky.SF) >>! In T276148#6939506, @Dzahn wrote: > Should it be gitlab.wikimedia.org for both, https... [22:28:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:21] (03PS1) 10Dzahn: gitlab: add gitlab.wikimedia.org service IP with interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/674446 (https://phabricator.wikimedia.org/T276148) [22:31:15] (03CR) 10Ahmon Dancy: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/674443 (owner: 10Legoktm) [22:35:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:34] (03CR) 10Dzahn: "compiler output looks like it should not change anything about what actually syncs to what. so as long as all hosts still from the primary" [puppet] - 10https://gerrit.wikimedia.org/r/674443 (owner: 10Legoktm) [22:41:52] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/28729/gitlab1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/674446 (https://phabricator.wikimedia.org/T276148) (owner: 10Dzahn) [22:42:31] (03CR) 10Dzahn: [V: 03+1 C: 03+1] gitlab: add gitlab.wikimedia.org service IP with interface::alias (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/674446 (https://phabricator.wikimedia.org/T276148) (owner: 10Dzahn) [22:44:02] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` pki-root1001.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-... [22:44:19] 10SRE, 10Wikimedia-IRC-RC-Server, 10User-notice: Migrate irc.wikimedia.org/kraz to Buster - https://phabricator.wikimedia.org/T224579 (10Legoktm) >>! In T224579#6939027, @Legoktm wrote: > When should XX be? Moritz is going to switch DNS and reboot kraz "Thursday during the European morning", announcement to... [22:45:00] mutante: "rsync on the primary releases server" ? [22:45:06] (instead of "to") [22:45:42] or "from"? [22:45:59] legoktm: hmmm.... yes, "pulling from the deployment server" [22:46:18] everyone just pulls from primary_deploy [22:46:37] I'll put "...from the deployment server" in the class-level comment [22:46:43] had a little chat about it with security-team as well [22:46:51] not so long ago when they were upgraded [22:47:00] sounds great, thanks [22:47:38] (03PS4) 10Legoktm: releases: Provide server-agnostic rsync for security patches [puppet] - 10https://gerrit.wikimedia.org/r/674443 [22:48:24] (03CR) 10Dzahn: [C: 03+1] releases: Provide server-agnostic rsync for security patches [puppet] - 10https://gerrit.wikimedia.org/r/674443 (owner: 10Legoktm) [22:49:21] for this class :;security it's not "between releases servers" at all, but too nitpicky [22:50:22] I will leave that for the next time we tweak that file :p [22:50:26] (03CR) 10Legoktm: [C: 03+2] releases: Provide server-agnostic rsync for security patches [puppet] - 10https://gerrit.wikimedia.org/r/674443 (owner: 10Legoktm) [22:50:33] thanks for the review [22:50:59] it's there because that changed and I did not update the comment, i take the blame [22:51:02] yw [22:54:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:55:20] (03PS2) 10Legoktm: acme_chief::cert: Use normal spaces in documentation [puppet] - 10https://gerrit.wikimedia.org/r/673642 [22:56:33] (03CR) 10Legoktm: [C: 03+2] acme_chief::cert: Use normal spaces in documentation [puppet] - 10https://gerrit.wikimedia.org/r/673642 (owner: 10Legoktm) [22:57:48] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pki-root1001.eqiad.wmnet with reason: REIMAGE [22:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:52] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pki-root1001.eqiad.wmnet with reason: REIMAGE [22:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] RoanKattouw, Niharika, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210323T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:22] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:59] (03PS2) 10Dzahn: gitlab: add gitlab.wm.org service IP, with lookup from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/674446 (https://phabricator.wikimedia.org/T276148) [23:06:38] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['pki-root1001.eqiad.wmnet'] ` and were **ALL** successful. [23:07:19] (03CR) 10Legoktm: [C: 04-1] "Minor PHP code comments inline." (036 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/672767 (https://phabricator.wikimedia.org/T276908) (owner: 10Giuseppe Lavagetto) [23:08:01] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/28730/gitlab1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/674446 (https://phabricator.wikimedia.org/T276148) (owner: 10Dzahn) [23:10:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:15:02] (03CR) 10Legoktm: "I don't think this meets what the performance.wikimedia.org site needs...I guess we could have apache listen some non-443 port for HTTPS, " [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [23:23:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:25:12] (03CR) 10Dzahn: "I see...will amend" [puppet] - 10https://gerrit.wikimedia.org/r/674169 (https://phabricator.wikimedia.org/T277989) (owner: 10Dzahn) [23:30:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:30:38] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10RobH) [23:31:24] 10SRE, 10ops-eqiad, 10DC-Ops: apply new hostname label for pki-root1001 (was auth1002) - https://phabricator.wikimedia.org/T278273 (10RobH) [23:32:57] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10RobH) @jbond: be aware pki-root1001 is now staged for your use. [23:33:05] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install pki-root1001.eqiad.wmnet - https://phabricator.wikimedia.org/T276625 (10RobH) 05Open→03Resolved [23:34:44] 10SRE, 10ops-eqiad, 10DC-Ops: apply new hostname label for pki-root1001 (was auth1002) - https://phabricator.wikimedia.org/T278273 (10RobH) p:05Triage→03Low [23:35:12] 10SRE, 10ops-eqiad, 10DC-Ops: apply new hostname label for pki-root1001 (was auth1002) - https://phabricator.wikimedia.org/T278273 (10RobH) [23:42:52] (03PS1) 10Papaul: DHCP: Add MAC address for mw2377 and mw2378 [puppet] - 10https://gerrit.wikimedia.org/r/674454 (https://phabricator.wikimedia.org/T274171) [23:44:45] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for mw2377 and mw2378 [puppet] - 10https://gerrit.wikimedia.org/r/674454 (https://phabricator.wikimedia.org/T274171) (owner: 10Papaul) [23:48:30] 10Puppet, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Have puppet httpd class support enabling mod_ssl without having apache listen on port 443 - https://phabricator.wikimedia.org/T277989 (10Krinkle) [23:48:38] (03CR) 10Dave Pifke: "Having Envoy act as a HTTPS to HTTP proxy between Varnish and Apache, and having Apache also listening for HTTPS requests (unused) sounds " [puppet] - 10https://gerrit.wikimedia.org/r/673603 (owner: 10Legoktm) [23:52:18] (03PS3) 10Dzahn: gitlab: open firewall on 22,80,443. use drange to limit to service IP [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) [23:52:43] (03CR) 10Legoktm: [V: 03+1] "I realized role::lists3 isn't actually including the standard profile nor the base firewall, going to fix that in a minute and then rebase" [puppet] - 10https://gerrit.wikimedia.org/r/674393 (https://phabricator.wikimedia.org/T277286) (owner: 10Ladsgroup) [23:53:24] (03CR) 10jerkins-bot: [V: 04-1] gitlab: open firewall on 22,80,443. use drange to limit to service IP [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) (owner: 10Dzahn) [23:53:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:56:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:56:32] (03PS4) 10Dzahn: gitlab: open firewall on 22,80,443. use drange to limit to service IP [puppet] - 10https://gerrit.wikimedia.org/r/670331 (https://phabricator.wikimedia.org/T276144) [23:58:20] (03PS1) 10Papaul: Add mw237[7-8] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/674455 (https://phabricator.wikimedia.org/T274171) [23:59:28] (03CR) 10Dzahn: [C: 03+1] Add mw237[7-8] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/674455 (https://phabricator.wikimedia.org/T274171) (owner: 10Papaul) [23:59:52] (03CR) 10Papaul: [C: 03+2] Add mw237[7-8] to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/674455 (https://phabricator.wikimedia.org/T274171) (owner: 10Papaul)