[00:03:40] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01001 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:36:16] PROBLEM - Disk space on mwlog1001 is CRITICAL: DISK CRITICAL - free space: /srv 266784 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops [00:39:50] RECOVERY - snapshot of s2 in codfw on alert1001 is OK: Last snapshot for s2 at codfw (db2098.codfw.wmnet:3312) taken on 2021-04-25 23:13:47 (883 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [03:03:34] PROBLEM - PHD should be supervising processes on phab1001 is CRITICAL: PROCS CRITICAL: 2 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [03:06:02] RECOVERY - PHD should be supervising processes on phab1001 is OK: PROCS OK: 4 processes with UID = 497 (phd) https://wikitech.wikimedia.org/wiki/Phabricator [03:17:02] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:18:56] RECOVERY - snapshot of s4 in codfw on alert1001 is OK: Last snapshot for s4 at codfw (db2099.codfw.wmnet:3314) taken on 2021-04-26 01:02:12 (1723 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [03:25:07] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2021-04-21-044024-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/682032 (https://phabricator.wikimedia.org/T279045) (owner: 10KartikMistry) [03:26:36] (03Merged) 10jenkins-bot: Update cxserver to 2021-04-21-044024-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/682032 (https://phabricator.wikimedia.org/T279045) (owner: 10KartikMistry) [03:32:35] !log kartik@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [03:32:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:37:05] !log kartik@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [03:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:40:34] RECOVERY - snapshot of s7 in codfw on alert1001 is OK: Last snapshot for s7 at codfw (db2100.codfw.wmnet:3317) taken on 2021-04-26 02:04:07 (1075 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [03:41:08] !log kartik@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [03:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:43:04] !log Updated cxserver to 2021-04-21-044024-production (T279045) [03:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:43:12] T279045: Update the section suggestion database from latest CX corpus dump - https://phabricator.wikimedia.org/T279045 [04:26:33] (03PS1) 10Andrew Bogott: radosgw: refresh when ceph.conf changes [puppet] - 10https://gerrit.wikimedia.org/r/682350 [04:26:35] (03PS1) 10Andrew Bogott: radosgw: try generating a per-host keyring [puppet] - 10https://gerrit.wikimedia.org/r/682351 [04:27:08] (03CR) 10jerkins-bot: [V: 04-1] radosgw: refresh when ceph.conf changes [puppet] - 10https://gerrit.wikimedia.org/r/682350 (owner: 10Andrew Bogott) [04:27:44] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 73 probes of 636 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:28:07] (03CR) 10jerkins-bot: [V: 04-1] radosgw: try generating a per-host keyring [puppet] - 10https://gerrit.wikimedia.org/r/682351 (owner: 10Andrew Bogott) [04:32:09] (03PS2) 10Andrew Bogott: radosgw: refresh when ceph.conf changes [puppet] - 10https://gerrit.wikimedia.org/r/682350 [04:32:11] (03PS2) 10Andrew Bogott: radosgw: try generating a per-host keyring [puppet] - 10https://gerrit.wikimedia.org/r/682351 [04:33:23] (03CR) 10Andrew Bogott: [C: 03+2] radosgw: refresh when ceph.conf changes [puppet] - 10https://gerrit.wikimedia.org/r/682350 (owner: 10Andrew Bogott) [04:35:18] (03PS3) 10Andrew Bogott: radosgw: try generating a per-host keyring [puppet] - 10https://gerrit.wikimedia.org/r/682351 [04:36:53] (03CR) 10Andrew Bogott: [C: 03+2] radosgw: try generating a per-host keyring [puppet] - 10https://gerrit.wikimedia.org/r/682351 (owner: 10Andrew Bogott) [04:38:49] (03PS1) 10Andrew Bogott: radosgw profiles: remove keydata param [puppet] - 10https://gerrit.wikimedia.org/r/682352 [04:40:04] (03CR) 10Andrew Bogott: [C: 03+2] radosgw profiles: remove keydata param [puppet] - 10https://gerrit.wikimedia.org/r/682352 (owner: 10Andrew Bogott) [04:40:44] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 56 probes of 636 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:50:02] RECOVERY - snapshot of s5 in codfw on alert1001 is OK: Last snapshot for s5 at codfw (db2099.codfw.wmnet:3315) taken on 2021-04-26 03:55:07 (699 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [05:15:51] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1124.eqiad.wmnet'] ` The log ca... [05:17:56] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1124.eqiad.wmnet'] ` Of which those **FAILED**: ` ['db1124.eqiad.wmnet'] ` [05:18:25] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1124.eqiad.wmnet'] ` The log ca... [05:26:40] (03PS1) 10Marostegui: db1158: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/682353 (https://phabricator.wikimedia.org/T258361) [05:28:15] (03CR) 10Marostegui: [C: 03+2] db1158: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/682353 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [05:30:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1124.eqiad.wmnet with reason: REIMAGE [05:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1124.eqiad.wmnet with reason: REIMAGE [05:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:44] (03PS1) 10Marostegui: instances.yaml: Add db1158 [puppet] - 10https://gerrit.wikimedia.org/r/682355 (https://phabricator.wikimedia.org/T258361) [05:42:05] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1158 [puppet] - 10https://gerrit.wikimedia.org/r/682355 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [05:42:42] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1124.eqiad.wmnet'] ` and were **ALL** successful. [05:47:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1158 to dbctl, depooled, T258361', diff saved to https://phabricator.wikimedia.org/P15521 and previous config saved to /var/cache/conftool/dbconfig/20210426-054700-marostegui.json [05:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:10] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [05:53:36] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:00:03] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:00:31] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:03:20] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:34] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:08:07] (03PS5) 10Giuseppe Lavagetto: helmfile: install a simple deployment shell [puppet] - 10https://gerrit.wikimedia.org/r/681432 [06:10:22] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:11:02] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:15:41] there is a Telia unplanned email about --^ [06:24:08] !log reboot an-coord1001 to pick up kernel security settings (after reimage) [06:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:28] RECOVERY - snapshot of s6 in codfw on alert1001 is OK: Last snapshot for s6 at codfw (db2097.codfw.wmnet:3316) taken on 2021-04-26 05:31:16 (572 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [06:45:04] (03PS3) 10Elukey: Enable the Yarn Capacity scheduler for Hadoop Analytics [puppet] - 10https://gerrit.wikimedia.org/r/681700 (https://phabricator.wikimedia.org/T277062) [06:45:50] (03PS14) 10Giuseppe Lavagetto: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) [06:48:34] (03PS1) 10Marostegui: db-eqiad.php: Disable writes on es4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682358 (https://phabricator.wikimedia.org/T279281) [06:57:45] (03PS1) 10Marostegui: mariadb: Productionize db1183 [puppet] - 10https://gerrit.wikimedia.org/r/682386 (https://phabricator.wikimedia.org/T275633) [06:58:38] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1183 [puppet] - 10https://gerrit.wikimedia.org/r/682386 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [06:59:21] (03CR) 10Muehlenhoff: "Ah, I just realised in git log that this was intentionally removed in f70a2dc in favour of the new DHCP automation work. Given that it fai" [puppet] - 10https://gerrit.wikimedia.org/r/682173 (owner: 10Jbond) [07:08:02] 10SRE: try planet on bullseye - https://phabricator.wikimedia.org/T280989 (10MoritzMuehlenhoff) >>! In T280989#7030571, @MoritzMuehlenhoff wrote: > We'll need to find a different aggregator, then: rawdog was removed from Bullseye since it was never ported to Python 2, see https://bugs.debian.org/cgi-bin/bugrepor... [07:08:04] (03PS1) 10JMeybohm: Add new tlsproxy cert for configcluster etcd [labs/private] - 10https://gerrit.wikimedia.org/r/682403 (https://phabricator.wikimedia.org/T271573) [07:09:20] (03PS3) 10Giuseppe Lavagetto: helmfile: create module [puppet] - 10https://gerrit.wikimedia.org/r/681428 [07:09:23] (03PS6) 10Giuseppe Lavagetto: helmfile: install a simple deployment shell [puppet] - 10https://gerrit.wikimedia.org/r/681432 [07:09:36] !log removed rawdog from bullseye-wikimedia, needs Py2 T280989 [07:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:45] T280989: try planet on bullseye - https://phabricator.wikimedia.org/T280989 [07:11:07] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29182/console" [puppet] - 10https://gerrit.wikimedia.org/r/681428 (owner: 10Giuseppe Lavagetto) [07:12:56] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] helmfile: create module [puppet] - 10https://gerrit.wikimedia.org/r/681428 (owner: 10Giuseppe Lavagetto) [07:14:13] (03PS1) 10Marostegui: db1183: Specify testing section [puppet] - 10https://gerrit.wikimedia.org/r/682489 [07:15:06] (03CR) 10Marostegui: [C: 03+2] db1183: Specify testing section [puppet] - 10https://gerrit.wikimedia.org/r/682489 (owner: 10Marostegui) [07:16:17] (03PS1) 10JMeybohm: configcluster: Add new tlsproxy certificate [puppet] - 10https://gerrit.wikimedia.org/r/682493 (https://phabricator.wikimedia.org/T271573) [07:18:37] (03CR) 10Giuseppe Lavagetto: [C: 03+1] configcluster: Add new tlsproxy certificate [puppet] - 10https://gerrit.wikimedia.org/r/682493 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [07:21:14] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add new tlsproxy cert for configcluster etcd [labs/private] - 10https://gerrit.wikimedia.org/r/682403 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [07:22:56] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29184/console" [puppet] - 10https://gerrit.wikimedia.org/r/682493 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [07:24:09] !log installing pear security updates [07:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:17] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] configcluster: Add new tlsproxy certificate [puppet] - 10https://gerrit.wikimedia.org/r/682493 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [07:25:36] (03CR) 10Elukey: [C: 03+2] Enable the Yarn Capacity scheduler for Hadoop Analytics [puppet] - 10https://gerrit.wikimedia.org/r/681700 (https://phabricator.wikimedia.org/T277062) (owner: 10Elukey) [07:31:26] (03CR) 10David Caro: "@andrew Can you keep me in the loop about this please? thanks" [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [07:32:56] !log swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836 [07:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:05] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [07:33:14] PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:37:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:39:23] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) checking tables on db1124 after the transfer from db1158 [07:45:46] (03PS1) 10Marostegui: install_server: Do not format db1124 [puppet] - 10https://gerrit.wikimedia.org/r/682495 (https://phabricator.wikimedia.org/T258361) [07:46:29] (03CR) 10Marostegui: [C: 03+2] install_server: Do not format db1124 [puppet] - 10https://gerrit.wikimedia.org/r/682495 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [07:47:05] 10SRE, 10Internet-Archive, 10Wikimedia-Site-requests: Please Upload large files to Commons - https://phabricator.wikimedia.org/T281019 (10fgiunchedi) From my tests the culprit seems to be webproxy hosts closing the transfer after ~4MB, though using urldownloader works as expected, which proxy were you using... [07:47:10] PROBLEM - Check systemd state on prometheus1004 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:38] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe1001.eqiad.wmnet with reason: REIMAGE [07:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:19] (03PS1) 10JMeybohm: configcluster: Enable replication on conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/682497 (https://phabricator.wikimedia.org/T271573) [07:53:41] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe1001.eqiad.wmnet with reason: REIMAGE [07:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:04] (03PS1) 10Giuseppe Lavagetto: etcdmirror: add script to reload the cluster [puppet] - 10https://gerrit.wikimedia.org/r/682498 [07:54:32] (03CR) 10jerkins-bot: [V: 04-1] etcdmirror: add script to reload the cluster [puppet] - 10https://gerrit.wikimedia.org/r/682498 (owner: 10Giuseppe Lavagetto) [07:54:36] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on conf2005.codfw.wmnet with reason: for initial etcd replication [07:54:37] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on conf2005.codfw.wmnet with reason: for initial etcd replication [07:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:46] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10ops-monitoring-bot) Icinga downtime set by jayme@cumin1001 for 1:00:00 1 host(s) and their services with reason: for initial etcd replication ` conf2005.codfw.wmnet ` [07:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:41] (03PS2) 10Giuseppe Lavagetto: etcdmirror: add script to reload the cluster [puppet] - 10https://gerrit.wikimedia.org/r/682498 [07:57:43] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29187/console" [puppet] - 10https://gerrit.wikimedia.org/r/682498 (owner: 10Giuseppe Lavagetto) [07:59:53] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] etcdmirror: add script to reload the cluster [puppet] - 10https://gerrit.wikimedia.org/r/682498 (owner: 10Giuseppe Lavagetto) [08:02:44] (03CR) 10Alexandros Kosiaris: [C: 03+1] "don't forget the puppet private and labs/private patches!" [puppet] - 10https://gerrit.wikimedia.org/r/681500 (owner: 10Dzahn) [08:08:17] (03Abandoned) 10ArielGlenn: * re-use the initially-created batches file for all later runs of the same job * if secondary workers have failed one or more batches (rerun max times with errors) while the primary worker was running batches, the primary worker will throw an exception even if all its batches ran without errors * on a rerun of a job, all previously failed batch entries are marked as unclaimed by the primary [08:08:18] * new config setting for time between primary worker checks of whether secondary workers are finished up * handle aborted batches as primary worker running batched jobs * additional tests for the batch module, clean up the pagecontentbatch tests [dumps] - 10https://gerrit.wikimedia.org/r/675118 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [08:11:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, do you need further +1s or just privileges to merge on puppet?" [puppet] - 10https://gerrit.wikimedia.org/r/680021 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight) [08:13:21] 10SRE, 10vm-requests: eqiad/codfw: 4 VMs requested for LDAP replicas - https://phabricator.wikimedia.org/T281089 (10MoritzMuehlenhoff) [08:13:51] !log installing lxml security updates on stretch [08:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:00] 10SRE, 10SRE-tools: Various debmonitor-client systemdtimer errors starting April 21st - https://phabricator.wikimedia.org/T281090 (10ema) [08:19:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:19:37] (03PS1) 10Marostegui: db1183: Add role master [puppet] - 10https://gerrit.wikimedia.org/r/682499 [08:20:14] (03CR) 10Marostegui: [C: 03+2] db1183: Add role master [puppet] - 10https://gerrit.wikimedia.org/r/682499 (owner: 10Marostegui) [08:20:57] (03CR) 10Klausman: [C: 03+1] profile::thanos::swift: add fake credentials for mlserve_prod [labs/private] - 10https://gerrit.wikimedia.org/r/682125 (https://phabricator.wikimedia.org/T280773) (owner: 10Elukey) [08:23:52] (03CR) 10Awight: "> do you need [...] privileges to merge on puppet?" [puppet] - 10https://gerrit.wikimedia.org/r/680021 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight) [08:24:07] (03PS1) 10Marostegui: install_server: Do not format db1183 [puppet] - 10https://gerrit.wikimedia.org/r/682500 (https://phabricator.wikimedia.org/T275633) [08:24:50] (03CR) 10Marostegui: [C: 03+2] install_server: Do not format db1183 [puppet] - 10https://gerrit.wikimedia.org/r/682500 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [08:26:25] 10SRE, 10SRE-tools: Various debmonitor-client systemdtimer errors starting April 21st - https://phabricator.wikimedia.org/T281090 (10MoritzMuehlenhoff) The "ImportError: cannot import name 'JSONDecodeError'" errors are from our five remaining jessie hosts, there was a patch by @jbond to address this, which is... [08:28:24] 10SRE, 10SRE-tools: Various debmonitor-client systemdtimer errors starting April 21st - https://phabricator.wikimedia.org/T281090 (10MoritzMuehlenhoff) The "unsupported operand type(s) for -=: 'Retry' and 'int'" should all be fixed, the log errors should predate the rollout of the fixed version (although some... [08:28:56] !log update debmonitor to 0.2.9 on remaining hosts T281090 [08:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:09] T281090: Various debmonitor-client systemdtimer errors starting April 21st - https://phabricator.wikimedia.org/T281090 [08:31:10] 10SRE, 10SRE-tools: Various debmonitor-client systemdtimer errors starting April 21st - https://phabricator.wikimedia.org/T281090 (10jbond) To confirm i have just pushed out 0.2.9 which should fix the `JSONDecodeError` and ` 'Retry' and 'int'` issues. The expiry for sretest1002 was valid as it was missing... [08:31:29] (03PS1) 10JMeybohm: Log the exception message in case cleaup fails [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/682501 [08:32:48] (03CR) 10Hashar: [C: 03+1] "Indeed! From stable-3.2 branch:" [puppet] - 10https://gerrit.wikimedia.org/r/679447 (owner: 10Paladox) [08:33:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Log the exception message in case cleaup fails [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/682501 (owner: 10JMeybohm) [08:33:35] (03Merged) 10jenkins-bot: Log the exception message in case cleaup fails [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/682501 (owner: 10JMeybohm) [08:37:31] 10SRE, 10netops: mr1 port utilization alerts shouldn't mention hash page in their IRC logs - https://phabricator.wikimedia.org/T281055 (10fgiunchedi) I agree, we should be restricting `#page` to alerts that page folks, not sure of an alternative tag though (or remove the tag altogether for now) cc @ayounsi [08:38:59] godog: ΰ² _ΰ²  [08:39:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:39:10] jouncebot: now [08:39:10] No deployments scheduled for the next 1 hour(s) and 50 minute(s) [08:39:13] jouncebot: next [08:39:13] In 1 hour(s) and 50 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210426T1030) [08:39:16] #kormat [08:39:33] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01001 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:39:53] * jbond42 looking [08:40:12] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Do not enable community configuration outside of beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682339 (https://phabricator.wikimedia.org/T274520) (owner: 10Urbanecm) [08:40:38] kormat: haha! wikibugs did # p a g e didn't it? my apologies, I realized only later [08:41:02] (03Merged) 10jenkins-bot: GrowthExperiments: Do not enable community configuration outside of beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682339 (https://phabricator.wikimedia.org/T274520) (owner: 10Urbanecm) [08:42:15] (03PS1) 10Hashar: gerrit: remove Apache 720s timeout [puppet] - 10https://gerrit.wikimedia.org/r/682502 (https://phabricator.wikimedia.org/T277127) [08:42:41] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: NOOP: 88da8226823e59d1d19db9aeca3b5a5140c0c60c: GrowthExperiments: Do not enable community configuration outside of beta wikis (T274520) (duration: 00m 59s) [08:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:50] T274520: Move Growth configuration to on-wiki JSON file - https://phabricator.wikimedia.org/T274520 [08:43:30] 10SRE, 10Performance-Team, 10Platform Engineering, 10observability, 10Patch-For-Review: mwlog1001 is running out of free space on /srv/mw-log - https://phabricator.wikimedia.org/T281048 (10fgiunchedi) FWIW +1 on lowering debug level, AFAIK mwlog1001 is indeed quite close to being replaced by mwlog1002 in... [08:44:18] 10SRE, 10Thumbor, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10jijiki) 05Openβ†’03Resolved It has been deployed to production, I forgot to close that task, sorry! [08:44:24] (03PS1) 10Urbanecm: GrowthExperiments: Enable community configuration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682504 (https://phabricator.wikimedia.org/T274520) [08:44:39] (03CR) 10Hashar: "That addresses a disparity in timeout settings mentioned by Elukey at T277127#6936982 . It is simply way too high and does not seem to m" [puppet] - 10https://gerrit.wikimedia.org/r/682502 (https://phabricator.wikimedia.org/T277127) (owner: 10Hashar) [08:46:14] (03CR) 10Elukey: [C: 03+1] "We can merge anytime if you want :)" [puppet] - 10https://gerrit.wikimedia.org/r/682502 (https://phabricator.wikimedia.org/T277127) (owner: 10Hashar) [08:46:22] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682322 (https://phabricator.wikimedia.org/T281048) (owner: 10Reedy) [08:46:52] (03CR) 10Urbanecm: [C: 03+2] "no-op ATM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682504 (https://phabricator.wikimedia.org/T274520) (owner: 10Urbanecm) [08:47:45] (03Merged) 10jenkins-bot: GrowthExperiments: Enable community configuration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682504 (https://phabricator.wikimedia.org/T274520) (owner: 10Urbanecm) [08:49:21] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: NOOP: f01a6dab70f74938dd51668809a181a8f551b6c8: GrowthExperiments: Enable community configuration on testwiki (T274520) (duration: 00m 57s) [08:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:31] T274520: Move Growth configuration to on-wiki JSON file - https://phabricator.wikimedia.org/T274520 [08:50:07] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:13] * Urbanecm done [08:52:34] 10SRE, 10netops: mr1 port utilization alerts shouldn't mention hash page in their IRC logs - https://phabricator.wikimedia.org/T281055 (10ayounsi) @CDanis set it up, there is a Icinga check that pulls the LibreNMS api and should page where #page is present. But should not page for management routers. @fgiunche... [08:52:59] can we just have wikibugs not relay that keyword to here? [08:53:31] ahahahah [08:55:14] 10SRE, 10netops: mr1 port utilization alerts shouldn't mention hash page in their IRC logs - https://phabricator.wikimedia.org/T281055 (10elukey) WARNING: don't use the "hash + page" keyword in here since Wikibugs will display it on IRC :D (I already made a mistake when creating the task) [08:55:55] Majavah: this should help --^ :) [08:55:58] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm for new host ldap-replica2005failoid1002.wikimedia.org [08:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:46] elukey: as long as people read warnings ;) [08:57:13] ignoring wikibugs is the proper fix ;) [08:57:28] (03PS1) 10Muehlenhoff: Add Cumin alias for bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/682535 [08:57:34] does it reads the content of the comments as well? [08:57:44] it displays a preview [08:58:18] indeed, I don't like that [09:00:42] (03CR) 10Muehlenhoff: [C: 03+2] Add Cumin alias for bullseye hosts [puppet] - 10https://gerrit.wikimedia.org/r/682535 (owner: 10Muehlenhoff) [09:01:48] 10SRE, 10netops: mr1 port utilization alerts shouldn't mention hash page in their IRC logs - https://phabricator.wikimedia.org/T281055 (10fgiunchedi) >>! In T281055#7032875, @ayounsi wrote: > @CDanis set it up, there is a Icinga check that pulls the LibreNMS api and should page where # page is present. But sho... [09:02:43] (03PS2) 10JMeybohm: configcluster: Enable replication on conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/682497 (https://phabricator.wikimedia.org/T271573) [09:04:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Remember to stop the current replication daemon before running puppet 😊" [puppet] - 10https://gerrit.wikimedia.org/r/682497 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [09:04:58] !log imported etcd-mirror_0.0.6-1 to buster-wikimedia [09:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:07] 10SRE, 10Analytics-Radar, 10Patch-For-Review, 10Services (watching), 10User-herron: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10elukey) @herron ping, we should start working on this :) [09:07:26] !log jmm@cumin2001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ldap-replica2005failoid1002.wikimedia.org [09:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:19] (03PS5) 10Jcrespo: WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383 [09:08:21] (03PS4) 10Jcrespo: Add logger functionality to recover-dump, add logger statements, added unit test to test initializing logging [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [09:08:23] (03PS1) 10Jcrespo: Increase default memory usage of xtrabackup --prepare to 40GB [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682536 (https://phabricator.wikimedia.org/T281094) [09:08:26] (03PS1) 10Jcrespo: Xtrabackup: Increase default open-files-limit to match production [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094) [09:08:55] (03CR) 10jerkins-bot: [V: 04-1] WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383 (owner: 10Jcrespo) [09:08:57] (03CR) 10jerkins-bot: [V: 04-1] Add logger functionality to recover-dump, add logger statements, added unit test to test initializing logging [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [09:08:59] (03CR) 10jerkins-bot: [V: 04-1] Increase default memory usage of xtrabackup --prepare to 40GB [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682536 (https://phabricator.wikimedia.org/T281094) (owner: 10Jcrespo) [09:09:09] (03CR) 10jerkins-bot: [V: 04-1] Xtrabackup: Increase default open-files-limit to match production [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094) (owner: 10Jcrespo) [09:09:26] (03CR) 10Jcrespo: "Sorry, I reverted your latest version of your patch without intending it, reverting my change." [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [09:10:11] PROBLEM - Check for expired certificates debmonitor_discovery_wmnet on pki1001 is CRITICAL: CRITICAL - 2 certs expiry in 2 days, 1 certs expiry in 1 days https://wikitech.wikimedia.org/wiki/PKI/Debugging [09:10:16] (03PS2) 10David Caro: wmcs.openstack: add cloudvirt maintenance cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682169 (https://phabricator.wikimedia.org/T280641) [09:10:18] (03CR) 10David Caro: wmcs.openstack: add cloudvirt maintenance cookbooks (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/682169 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:10:21] (03PS2) 10Jcrespo: Increase default memory usage of xtrabackup --prepare to 40GB [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682536 (https://phabricator.wikimedia.org/T281094) [09:10:21] PROBLEM - Check for expired certificates debmonitor_discovery_wmnet on pki2001 is CRITICAL: CRITICAL - 2 certs expiry in 2 days, 1 certs expiry in 1 days https://wikitech.wikimedia.org/wiki/PKI/Debugging [09:10:36] (03PS2) 10Jcrespo: Xtrabackup: Increase default open-files-limit to match production [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094) [09:10:44] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm for new host ldap-replica2005.wikimedia.org [09:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:01] Majavah: thanks! [09:12:56] (03PS5) 10Jcrespo: Add logger functionality to recover-dump, add logger statements, added unit test to test initializing logging [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [09:13:08] (03PS6) 10Jcrespo: Add logger functionality to recover-dump, add logger statements, added unit test to test initializing logging [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [09:13:28] !log imported etcd-mirror_0.0.6-2 to buster-wikimedia [09:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:16] (03PS3) 10Jcrespo: Xtrabackup: Increase default open-files-limit to match production [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094) [09:15:48] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on conf2005.codfw.wmnet with reason: for initial etcd replication [09:15:49] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on conf2005.codfw.wmnet with reason: for initial etcd replication [09:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:56] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10ops-monitoring-bot) Icinga downtime set by jayme@cumin1001 for 1:00:00 1 host(s) and their services with reason: for initial etcd replication ` conf2005.codfw.wmnet ` [09:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:21] (03PS4) 10Jcrespo: Xtrabackup: Increase default open-files-limit to match production [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/682537 (https://phabricator.wikimedia.org/T281094) [09:16:28] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "Temporarily disable some reportupdater jobs" [puppet] - 10https://gerrit.wikimedia.org/r/680021 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight) [09:17:11] (03CR) 10JMeybohm: [C: 03+2] configcluster: Enable replication on conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/682497 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [09:17:25] (03CR) 10Filippo Giunchedi: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/680021 (https://phabricator.wikimedia.org/T279046) (owner: 10Awight) [09:21:01] PROBLEM - very high load average likely xfs on ms-be1062 is CRITICAL: CRITICAL - load average: 167.72, 112.46, 60.18 https://wikitech.wikimedia.org/wiki/Swift [09:21:39] PROBLEM - SSH on ms-be1062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:21:54] PROBLEM - LVS thumbor eqiad port 8800/tcp - Thumbor image scaling IPv4 #page on thumbor.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.0 503 Service Unavailable - 212 bytes in 10.003 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:22:17] ^ looking [09:22:27] * volans acked page [09:22:29] * akosiaris around [09:22:35] * moritzm too [09:22:38] ugh, taking a look too, it might be the recent swift rebalance [09:22:55] <_joe_> seems likely [09:22:59] I am checking commons newfiles for impact [09:23:20] (03PS1) 10Jbond: systemd::timer::job: update the program name of systemd-timer-mail-wrapper [puppet] - 10https://gerrit.wikimedia.org/r/682540 [09:23:21] looking at ms-be1062 [09:23:24] it is a bit slow, but progressing (not competely broken) [09:23:58] <_joe_> I'll take IC [09:24:14] thank you _joe_ [09:24:26] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ldap-replica2005.wikimedia.org [09:24:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:35] I'm +1 to depool swift eqiad [09:24:58] <_joe_> godog: depool swift will only affect reads or also writes? [09:25:32] _joe_: reads [09:25:37] varnish is showing increased 503 to upload domain [09:25:46] <_joe_> godog: go on then [09:25:51] <_joe_> jynus: expected [09:25:54] around 1200 per minute [09:26:07] not a huge rate, but much larger than normal [09:26:18] * per 5 minutes [09:26:22] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm for new host ldap-replica2006.wikimedia.org [09:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:30] RECOVERY - LVS thumbor eqiad port 8800/tcp - Thumbor image scaling IPv4 #page on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 367 bytes in 5.290 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:26:56] !log filippo@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=swift,name=eqiad [09:27:01] {{done}} [09:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:11] was the recovery because of the depool? [09:27:12] 10SRE, 10SRE-tools: Various debmonitor-client systemdtimer errors starting April 21st - https://phabricator.wikimedia.org/T281090 (10jbond) > Failed to execute DebMonitor CLI: [SSL] PEM lib (_ssl.c:2947) >'PEM routines', 'get_name', 'no start line'), ('SSL routines', 'use_certificate_chain_file', 'PEM lib' Th... [09:27:19] jynus: no [09:27:21] <_joe_> jynus: nope [09:27:44] <_joe_> also the response came back in 5 seconds [09:27:48] <_joe_> so not ok still [09:27:53] (03PS2) 10Jbond: systemd::timer::job: update the program name of systemd-timer-mail-wrapper [puppet] - 10https://gerrit.wikimedia.org/r/682540 (https://phabricator.wikimedia.org/T281090) [09:28:00] yeah, 503s are still high [09:28:26] <_joe_> jynus: can you check if there is a pattern in the 503s / an increased request rate? [09:28:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29188/console" [puppet] - 10https://gerrit.wikimedia.org/r/682540 (https://phabricator.wikimedia.org/T281090) (owner: 10Jbond) [09:28:46] * jbond42 here to help [09:28:52] I saw a typical batck upload process but on latest files, nothing out of the ordinary [09:29:02] <_joe_> jbond42: can you check if pybal is depooling thumbor servers? [09:29:03] the ISS photos [09:29:05] <_joe_> it seems not [09:29:15] ack [09:29:32] _joe_: I am checking it [09:29:33] <_joe_> jynus: yeah i mean in the 5xx logs [09:29:40] (already I mean) [09:29:52] yeah, I am checking [09:29:52] <_joe_> effie: ack [09:29:59] <_joe_> jbond42: then nevermind :P [09:30:07] still not seeing traffic picking up in swift codfw, should be soon though [09:30:11] ack [09:30:21] no [09:30:23] ah yeah there we go [09:30:23] the pattern so far is "they are thumbnails" :-) [09:30:46] thumbor logs just mention that requests are being throttled by poolcounter [09:30:55] no ip, file or cache server special that I can see so far [09:30:58] but no servers are depooled [09:31:16] also 503s are now at its normal level [09:31:35] since around 9:29 [09:31:40] <_joe_> effie: so maybe we should check what url pybal requests [09:32:25] yeah traffic is picking up in swift codfw [09:32:54] https://upload.wikimedia.org/wikipedia/commons/thumb/1/17/Aimee_Challenor,_2016_(cropped).jpg/1200px-Aimee_Challenor,_2016_(cropped).jpg is failing consistently it seems [09:33:11] maybe the pattern is a specific swift shard or shards? [09:33:13] <_joe_> ok, I think we can easily resolve the incident, once we have a better idea of what was starving [09:33:22] <_joe_> swift? thumbor cpu? [09:33:33] yeah, latencies are begining to drop and 503s approach zero [09:33:37] thumbors 503s are going down [09:33:38] <_joe_> godog: can you take a look at swift, and effie can you take a look at thumbor? [09:33:52] another one at: https://upload.wikimedia.org/wikipedia/en/thumb/4/4f/HLN_(TV_network)_2017_logo.svg/768px-HLN_(TV_network)_2017_logo.svg.png [09:33:52] <_joe_> I mean grafana data trying to explain why the 503s? [09:34:08] _joe_: yeah I think swift eqiad was the culprit in this case [09:34:12] <_joe_> jynus: yeah not important then [09:34:21] <_joe_> godog: can you add graphs to the incident doc? [09:34:38] jynus: the first image was (Aimee Challenor, 2016 (cropped).jpg) was deleted late last month [09:34:54] thumbor is back to work [09:35:04] _joe_: yeah, I'm waiting a bit for codfw traffic to stabilize first [09:35:28] _joe_: there was swift latency observed from thumbor [09:35:33] which piled up requests [09:36:53] there is virtually no upload error on varnish right now [09:37:53] thumbor-swift latency https://grafana-rw.wikimedia.org/d/Pukjw6cWk/thumbor?viewPanel=5&orgId=1&from=now-1h&to=now&refresh=30s [09:38:11] <_joe_> effie: add to the doc please :) [09:38:37] !log reboot ms-be1062, kernel backtrace saved [09:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:48] <_joe_> still pretty high tbh [09:39:10] <_joe_> higher than I'd expect even with the switch to codfw [09:40:04] <_joe_> oh no I see, it's now reading from codfw only, that graph is vbaguely misleading tbh [09:40:39] yeah, 500ms is about 300ms higher that I 'd expect for writes and more than 400ms for read [09:41:13] <_joe_> https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?viewPanel=35&orgId=1&from=now-1h&to=now&refresh=30s this is the graph you should look at [09:42:45] !log installing clamav security updates on otrs1001 [09:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:15] <_joe_> ok, godog do you think we should keep the incident open until we can move traffic back to eqiad? [09:44:35] _joe_: I'm not sure, I'll know more in ~5 min when ms-be1062 comes back [09:44:42] <_joe_> ack [09:45:31] my working theory so far is that ms-be1062's nic/driver crashed and blackholed traffic, timeouts ensued [09:45:55] <_joe_> oh that seems plausible, yes [09:46:04] ah, even swift observed latency levels are now dropping to expected levels [09:46:16] so https://grafana-rw.wikimedia.org/d/Pukjw6cWk/thumbor?viewPanel=5&orgId=1&from=now-1h&to=now&refresh=30s makes more sense now [09:46:35] I wonder what caused that plateau to 500ms [09:46:41] backlog? [09:47:17] but yeah once ms-be1062 is back I think we can put traffic back to eqiad and resolve the incident [09:49:40] 10SRE, 10SRE-swift-storage: ms-be1062 fell off the network, causing swift timeouts - https://phabricator.wikimedia.org/T281107 (10fgiunchedi) [09:49:45] PROBLEM - Host ms-be1062 is DOWN: PING CRITICAL - Packet loss = 100% [09:51:05] ? [09:51:06] RECOVERY - SSH on ms-be1062 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:51:08] RECOVERY - Host ms-be1062 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [09:51:24] ok, why was that delayed by ~10m? [09:51:40] <_joe_> akosiaris: freenode has some issues with services [09:51:43] the PROBLEM - Host ms-be1062 is DOWN: PING CRITICAL I mean [09:52:06] <_joe_> oh that's because I guess it took 10 minnutes to shutdown [09:52:10] <_joe_> godog: ? [09:52:22] ah, that would explain it. 10m for a shutdown, lol [09:52:24] 10SRE, 10SRE-swift-storage, 10Wikimedia-Incident: ms-be1062 fell off the network, causing swift timeouts - https://phabricator.wikimedia.org/T281107 (10jijiki) [09:52:54] yeah I issued 'reboot' from the console but shutting down was taking forever [09:52:54] <_joe_> (also godog: please paste the relevant part of the backtrace in the doc at your convenience) [09:53:02] <_joe_> not surprised [09:53:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ldap-replica2006.wikimedia.org [09:53:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:54] yeah, I'll repool swift eqiad in 5 min, sounds good _joe_ ? [09:54:08] <_joe_> +1 [09:54:19] (03PS2) 10Hnowlan: site: set eventlog1003 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/681704 (https://phabricator.wikimedia.org/T280679) [09:58:11] 10SRE, 10Wikimedia-Mailing-lists: See if we can drop the extra lists.wikimedia.org in mailman3 URLs - https://phabricator.wikimedia.org/T280996 (10akosiaris) p:05Triageβ†’03Medium [09:58:21] 10SRE, 10observability, 10CAS-SSO: grafana-rw SSO redirect breaks template parameters due to double encoding - https://phabricator.wikimedia.org/T281004 (10akosiaris) p:05Triageβ†’03Medium [09:58:41] 10SRE, 10Internet-Archive, 10Wikimedia-Site-requests: Please Upload large files to Commons - https://phabricator.wikimedia.org/T281019 (10akosiaris) p:05Triageβ†’03Medium [09:58:56] 10SRE, 10Performance-Team, 10Platform Engineering, 10observability, 10Patch-For-Review: mwlog1001 is running out of free space on /srv/mw-log - https://phabricator.wikimedia.org/T281048 (10akosiaris) p:05Triageβ†’03Medium [09:59:07] 10SRE, 10netops: mr1 port utilization alerts shouldn't mention hash page in their IRC logs - https://phabricator.wikimedia.org/T281055 (10akosiaris) p:05Triageβ†’03Medium [09:59:24] (03PS2) 10Arturo Borrero Gonzalez: wikimediacloud.org: prepare DNS records for cloudgw @ eqiad [dns] - 10https://gerrit.wikimedia.org/r/681322 (https://phabricator.wikimedia.org/T270704) [09:59:26] 10SRE, 10SRE-tools, 10Patch-For-Review: Various debmonitor-client systemdtimer errors starting April 21st - https://phabricator.wikimedia.org/T281090 (10akosiaris) p:05Triageβ†’03Medium [09:59:27] (03CR) 10David Caro: [C: 03+1] "nit: you might want to add some hint in the proctitle that it's the wrapper running (maybe as a 'fake' arg) so when doing ps you can still" [puppet] - 10https://gerrit.wikimedia.org/r/682540 (https://phabricator.wikimedia.org/T281090) (owner: 10Jbond) [09:59:36] 10SRE, 10netops, 10observability, 10User-fgiunchedi: Move paging for librenms from icinga to AM - https://phabricator.wikimedia.org/T281095 (10akosiaris) p:05Triageβ†’03Medium [09:59:50] 10SRE, 10SRE-swift-storage, 10Wikimedia-Incident: ms-be1062 fell off the network, causing swift timeouts - https://phabricator.wikimedia.org/T281107 (10akosiaris) p:05Triageβ†’03High [10:00:06] (03CR) 10Volans: wmcs.ceph: Added mon upgrade cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/682099 (owner: 10David Caro) [10:00:36] !log filippo@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift,name=eqiad [10:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:59] _joe_: ^ another 10 minutes and I think we're good to move on [10:01:09] <_joe_> ack [10:01:58] akosiaris: the Y axis is capped at 1s, it was 500ms before [10:02:47] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/682540 (https://phabricator.wikimedia.org/T281090) (owner: 10Jbond) [10:03:09] akosiaris: scratch that, I just saw what you mean [10:03:16] :D [10:03:59] akosiaris: if I were to guess, I would say poolcounter [10:04:01] it's interesting, sin't it? at 500ms for ~10m with no clear explanation [10:04:20] hmm that could be, /me verifying [10:04:32] wait, what poolcounter? [10:04:57] that's the latency that thumber observed, I hope poolcounter isn't included in that latency [10:05:02] thumbor* [10:05:11] no but poolcounter throttled requests [10:05:29] so thumbor didnt send even more writes that would possibly increase latency even more [10:05:34] that would be my guess [10:05:49] (03CR) 10David Caro: wmcs.ceph: Added mon upgrade cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/682099 (owner: 10David Caro) [10:06:03] ah, meaning that poolcounter saved us from higher latency? Hmmm [10:06:12] (03PS1) 10Jbond: P:rsyslog: send pybal logs to ELK [puppet] - 10https://gerrit.wikimedia.org/r/682566 [10:06:28] could be [10:06:55] (03CR) 10Volans: wmcs.ceph: Added mon upgrade cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/682099 (owner: 10David Caro) [10:07:16] I will try to think any other reason that would explain this plateau in writes [10:07:34] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm for new host ldap-replica1003.wikimedia.org [10:07:40] 10SRE, 10DBA, 10Pybal, 10Sustainability: Create a backend check for pybal to monitor the MySQL protocol being up - https://phabricator.wikimedia.org/T165677 (10LSobanski) p:05Mediumβ†’03Low [10:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:01] 10SRE, 10DBA, 10Privacy Engineering, 10WMF-Legal, and 3 others: dbtree loads third party resources (from google.com/jsapi) - https://phabricator.wikimedia.org/T96499 (10LSobanski) p:05Mediumβ†’03Lowest [10:08:14] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:08:26] 10SRE, 10DBA, 10Pybal, 10Sustainability: Create a backend check for pybal to monitor the MySQL protocol being up - https://phabricator.wikimedia.org/T165677 (10LSobanski) p:05Lowβ†’03Lowest [10:08:40] RECOVERY - very high load average likely xfs on ms-be1062 is OK: OK - load average: 34.57, 27.79, 16.18 https://wikitech.wikimedia.org/wiki/Swift [10:10:04] (03PS3) 10JMeybohm: Swap zookeeper from conf2001 to conf2004 [puppet] - 10https://gerrit.wikimedia.org/r/680875 (https://phabricator.wikimedia.org/T271573) (owner: 10Elukey) [10:11:40] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:11:59] _joe_: yeah I think we're back, I'm +1 to resolve the incident [10:13:12] (03CR) 10Hnowlan: [C: 03+2] site: set eventlog1003 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/681704 (https://phabricator.wikimedia.org/T280679) (owner: 10Hnowlan) [10:13:55] hopefully I haven't jinxed it heh [10:14:28] (03PS1) 10Marostegui: mariadb: Move db1125 from sanitarium to testing [puppet] - 10https://gerrit.wikimedia.org/r/682570 (https://phabricator.wikimedia.org/T258361) [10:14:41] <_joe_> ack [10:15:00] <_joe_> looks like you didn't [10:15:19] (03PS2) 10Marostegui: mariadb: Move db1125 from sanitarium to testing [puppet] - 10https://gerrit.wikimedia.org/r/682570 (https://phabricator.wikimedia.org/T258361) [10:16:26] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1125 from sanitarium to testing [puppet] - 10https://gerrit.wikimedia.org/r/682570 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [10:16:41] <_joe_> ok incident is closed [10:17:01] (03CR) 10Ayounsi: [C: 03+1] wikimediacloud.org: prepare DNS records for cloudgw @ eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/681322 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [10:17:11] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1125.eqiad.wmnet'] ` The log ca... [10:17:50] thank you [10:17:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: prepare DNS records for cloudgw @ eqiad [dns] - 10https://gerrit.wikimedia.org/r/681322 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [10:18:10] <_joe_> I'll write up an incident report today, hopefully [10:18:17] !log installing systemd updates from buster 10.9 point release [10:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:57] (03PS2) 10Reedy: Add wmgUseFooterTechCodeOfConductLink to replace wmgUseFooterCodeOfConductLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681835 (https://phabricator.wikimedia.org/T280886) [10:19:08] jouncebot: now [10:19:09] No deployments scheduled for the next 0 hour(s) and 10 minute(s) [10:19:10] jouncebot: next [10:19:11] In 0 hour(s) and 10 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210426T1030) [10:19:21] (03CR) 10Reedy: [C: 03+2] Add wmgUseFooterTechCodeOfConductLink to replace wmgUseFooterCodeOfConductLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681835 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [10:19:46] (03CR) 10David Caro: wmcs.ceph: Added mon upgrade cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/682099 (owner: 10David Caro) [10:21:09] (03Merged) 10jenkins-bot: Add wmgUseFooterTechCodeOfConductLink to replace wmgUseFooterCodeOfConductLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681835 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [10:22:25] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki/httpd: adapt to kubernetes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/679362 (owner: 10Giuseppe Lavagetto) [10:22:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ldap-replica1003.wikimedia.org [10:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:45] (03PS1) 10Marostegui: check_private_data_report: Remove db1125 [puppet] - 10https://gerrit.wikimedia.org/r/682571 (https://phabricator.wikimedia.org/T258361) [10:22:55] !log reedy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Add wmgUseFooterTechCodeOfConductLink (duration: 00m 59s) [10:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:51] (03CR) 10Marostegui: [C: 03+2] check_private_data_report: Remove db1125 [puppet] - 10https://gerrit.wikimedia.org/r/682571 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [10:23:56] (03PS2) 10Reedy: Flip variables in wmgUseFooterCodeOfConductLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681836 (https://phabricator.wikimedia.org/T280886) [10:24:02] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm for new host ldap-replica1004.wikimedia.org [10:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:29] !log reedy@deploy1002 Synchronized wmf-config/CommonSettings.php: Use wmgUseFooterTechCodeOfConductLink instead of wmgUseFooterCodeOfConductLink (duration: 00m 57s) [10:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:02] (03PS3) 10Reedy: Flip variables in wmgUseFooterCodeOfConductLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681836 (https://phabricator.wikimedia.org/T280886) [10:25:10] !log Deploy schema change on s4 codfw, lag will appear T276292 [10:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:19] T276292: Schema change for renaming new_name_timestamp to rc_new_name_timestamp in recentchanges - https://phabricator.wikimedia.org/T276292 [10:25:54] (03CR) 10Reedy: [C: 03+2] Flip variables in wmgUseFooterCodeOfConductLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681836 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [10:26:26] !log upgrading mw* servers php7.2 in codfw [10:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:49] (03PS1) 10Hnowlan: site: set eventlog1003 role to eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/682573 (https://phabricator.wikimedia.org/T278137) [10:29:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1125.eqiad.wmnet with reason: REIMAGE [10:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:16] PROBLEM - DPKG on mw2272 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [10:30:05] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210426T1030). [10:31:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1125.eqiad.wmnet with reason: REIMAGE [10:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:27] (03PS23) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [10:32:28] (03PS9) 10Jcrespo: mariadb: Setup 2 new host as temporary metadata database for media backups [puppet] - 10https://gerrit.wikimedia.org/r/681103 (https://phabricator.wikimedia.org/T276442) [10:32:30] (03PS6) 10Jcrespo: mediabackup: Setup the storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/681117 (https://phabricator.wikimedia.org/T276442) [10:32:33] (03PS1) 10Jcrespo: dbbackups: Add s2 to db1102, a buster backup source [puppet] - 10https://gerrit.wikimedia.org/r/682574 (https://phabricator.wikimedia.org/T280492) [10:32:35] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682575 (https://phabricator.wikimedia.org/T128546) [10:32:41] (03PS2) 10Jcrespo: dbbackups: Add s2 to db1102, a buster backup source [puppet] - 10https://gerrit.wikimedia.org/r/682574 (https://phabricator.wikimedia.org/T280492) [10:33:48] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682575 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:34:31] (03CR) 10jerkins-bot: [V: 04-1] dbbackups: Add s2 to db1102, a buster backup source [puppet] - 10https://gerrit.wikimedia.org/r/682574 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [10:34:40] (03PS3) 10Jcrespo: dbbackups: Add s2 to db1102, a buster backup source [puppet] - 10https://gerrit.wikimedia.org/r/682574 (https://phabricator.wikimedia.org/T280492) [10:35:07] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Setup the storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/681117 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [10:35:54] 10Puppet, 10SRE: Determine safe concurrent puppet run batches via cumin - https://phabricator.wikimedia.org/T280622 (10Volans) >>! In T280622#7029474, @jbond wrote: >> How long did the run take? My reading of the graphs says ~40m (from 12:00 to 12:38), is that correct? > This sounds about right to me but unfor... [10:36:22] (03Merged) 10jenkins-bot: Flip variables in wmgUseFooterCodeOfConductLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681836 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [10:36:42] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682575 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:37:24] (03PS2) 10Reedy: Update messages used for tech CoC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681834 (https://phabricator.wikimedia.org/T280886) [10:37:39] (03CR) 10Reedy: [C: 04-2] "Rebased back ontop, but can't land until the messages do. -2 for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681834 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [10:37:44] !log reedy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Setup wmgUseFooterCodeOfConductLink for later usage (duration: 00m 57s) [10:37:46] (03PS4) 10Jcrespo: dbbackups: Add s2 to db1102, a buster backup source [puppet] - 10https://gerrit.wikimedia.org/r/682574 (https://phabricator.wikimedia.org/T280492) [10:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:09] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.062 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:38:14] (03PS5) 10Jcrespo: dbbackups: Add s2 to db1102, a buster backup source [puppet] - 10https://gerrit.wikimedia.org/r/682574 (https://phabricator.wikimedia.org/T280492) [10:38:33] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ldap-replica1004.wikimedia.org [10:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:38] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1125.eqiad.wmnet'] ` and were **ALL** successful. [10:39:48] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Add s2 to db1102, a buster backup source [puppet] - 10https://gerrit.wikimedia.org/r/682574 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [10:42:18] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Pginer-WMF) [10:44:48] (03PS3) 10Jbond: systemd::timer::job: update the program name of systemd-timer-mail-wrapper [puppet] - 10https://gerrit.wikimedia.org/r/682540 (https://phabricator.wikimedia.org/T281090) [10:45:26] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:682575| Bumping portals to master (T128546)]] (duration: 00m 57s) [10:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:38] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:46:24] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:682575| Bumping portals to master (T128546)]] (duration: 00m 57s) [10:46:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:21] (03CR) 10Jbond: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/682540 (https://phabricator.wikimedia.org/T281090) (owner: 10Jbond) [10:48:48] (03CR) 10Jbond: "pcc https://puppet-compiler.wmflabs.org/compiler1001/29193/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/682540 (https://phabricator.wikimedia.org/T281090) (owner: 10Jbond) [10:51:04] (03PS1) 10Volans: clustershell: instantiate progress bar earlier [software/cumin] - 10https://gerrit.wikimedia.org/r/682588 [10:51:16] (03CR) 10Volans: "replies inline" (033 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [10:51:39] (03CR) 10Volans: [C: 03+2] "Trivial, test-only, self-merging" [software/cumin] - 10https://gerrit.wikimedia.org/r/681690 (owner: 10Volans) [10:54:12] 10SRE, 10Internet-Archive, 10Wikimedia-Site-requests: Please Upload large files to Commons - https://phabricator.wikimedia.org/T281019 (10Urbanecm) >>! In T281019#7032660, @fgiunchedi wrote: > From my tests the culprit seems to be webproxy hosts closing the transfer after ~4MB, though using urldownloader wor... [10:54:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, SyslogIdentifier is also supported even on jessie." [puppet] - 10https://gerrit.wikimedia.org/r/682540 (https://phabricator.wikimedia.org/T281090) (owner: 10Jbond) [10:54:29] (03CR) 10Volans: "reply inline" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/681692 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [10:58:08] !log restarting php-fpm in mw* clusters in codfw to pick up php7.2 update [10:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:37] RECOVERY - DPKG on mw2272 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210426T1100). [11:00:05] CFisch_WMDE and kart_: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] CFisch_WMDE: o/ [11:00:24] \o [11:00:27] * kart_ is here [11:00:37] well, I can deploy today [11:00:39] * CFisch_WMDE a bit too distracted at home to deploy myself [11:01:08] CFisch_WMDE: don't worry :) [11:01:14] :-) [11:01:46] jenkins appears to be busy today [11:04:00] Yeah! [11:04:13] PROBLEM - SSH on mw1270.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:05:40] 5 minutes since merge and still not even running :/ [11:05:50] https://integration.wikimedia.org/zuul/ ouch, looks like it might take some time [11:05:58] yeah yeah [11:09:09] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01059 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:11:14] kart_: finally. Please test on mwdebug1002 [11:11:17] *mwdebug1001 [11:11:22] Yeah. [11:16:01] Urbanecm: Looks good! [11:16:07] cool, syncing [11:17:03] PROBLEM - Check for expired certificates debmonitor_discovery_wmnet on pki1001 is CRITICAL: CRITICAL - 1702 certs expiry in 5 days https://wikitech.wikimedia.org/wiki/PKI/Debugging [11:17:30] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 2b5b640ad28bce1df20c2ca82654996d9cfc7630: Enable ContentTranslation as a default tool for 11 Wikipedias (T279422) (duration: 00m 57s) [11:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:38] kart_: should be live [11:17:39] T279422: Deploy Content Translation tool in 11 Wikis - https://phabricator.wikimedia.org/T279422 [11:17:45] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:17:49] Urbanecm: Awesome, thanks!! [11:18:06] any time [11:22:43] PROBLEM - Check for expired certificates debmonitor_discovery_wmnet on pki2001 is CRITICAL: CRITICAL - 1702 certs expiry in 5 days https://wikitech.wikimedia.org/wiki/PKI/Debugging [11:23:02] Urbanecm: Are you done SWATing? [11:23:29] hoo: currently waiting for a backport to get merged [11:23:42] If you have something quick, go ahead and ping me when done hoo [11:23:50] Yes, I have a quick config patch [11:23:57] Go ahead then :) [11:28:25] moritzm: you happy for me to merge [11:29:43] jbond42: ack, please do! [11:29:47] XioNoX, arturo: the dns changes are for cloudgw1002, you should run the dns.netbox cookbook [11:29:55] moritzm: done [11:30:04] volans: ack [11:30:06] doing now [11:30:08] thx [11:30:26] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:20] !log hoo@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:681137|Revert "Set wgPageImagesAPIDefaultLicense to 'any' for wikidata"]] (duration: 00m 57s) [11:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:50] I'm done :) [11:32:16] cool :) [11:32:21] * Urbanecm still waiting for CI [11:32:39] zZzZzZ [11:39:20] CFisch_WMDE: finally [11:39:26] \o/ [11:39:50] CFisch_WMDE: pulled onto mwdebug1001, please test [11:40:40] Urbanecm: yeah not much to test there :-) go on [11:40:46] okay, syncing then [11:45:04] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/TemplateData/modules/ext.templateDataGenerator.editTemplatePage/Dialog.js: a347517f906b07b2503ae559c6cc714e1c50e4aa: Fix suggested values not being shown when the params type isnt specified (T280688) (duration: 00m 57s) [11:45:13] CFisch_WMDE: should be live [11:45:15] anything else? [11:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:18] T280688: Show Suggested Values field if the type should allow it - https://phabricator.wikimedia.org/T280688 [11:46:06] Urbanecm: Thanks a lot. I'm fine! [11:46:14] cool :) [11:46:17] !log EU B&C done [11:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:31] !log hashar@deploy1002 Started deploy [integration/docroot@c2e48c9]: doc: Explain that VE is both stand-alone and integrated into MediaWiki [11:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:41] jouncebot: next [11:49:41] In 5 hour(s) and 10 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210426T1700) [11:49:44] !log hashar@deploy1002 Finished deploy [integration/docroot@c2e48c9]: doc: Explain that VE is both stand-alone and integrated into MediaWiki (duration: 00m 13s) [11:49:49] Urbanecm: Can I deploy? [11:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:15] marostegui: certainly [11:50:21] Urbanecm: thanks! [11:50:25] Np [11:51:27] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:10] !log marostegui@deploy1002 Synchronized wmf-config/db-eqiad.php: Disable writes on es4 T279281 (duration: 00m 56s) [11:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:27] T279281: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 [11:55:27] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:56:34] !log Restart es4 primary master - T279281 [11:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:44] !log marostegui@deploy1002 Synchronized wmf-config/db-eqiad.php: Enable writes on es4 T279281 (duration: 00m 56s) [12:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:00] T279281: Upgrade 10.4.13 hosts to a higher version - https://phabricator.wikimedia.org/T279281 [12:05:07] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,service=nginx,name=mw1338.eqiad.wmnet [12:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:21] RECOVERY - SSH on mw1270.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:06:42] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,service=nginx,name=mw1338.eqiad.wmnet [12:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:20:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:24:12] !log cleaning watchlist of QuickStatementsBot in wikidatawiki [12:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:45] RECOVERY - Check for expired certificates debmonitor_discovery_wmnet on pki1001 is OK: OK - No certificates due to expire https://wikitech.wikimedia.org/wiki/PKI/Debugging [12:27:33] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,service=nginx,name=mw1338.eqiad.wmnet [12:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:09] RECOVERY - Disk space on mwlog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops [12:28:16] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: cluster=jobrunner,service=nginx,name=mw1338.eqiad.wmnet [12:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:11] RECOVERY - Check for expired certificates debmonitor_discovery_wmnet on pki2001 is OK: OK - No certificates due to expire https://wikitech.wikimedia.org/wiki/PKI/Debugging [12:30:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1135 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15524 and previous config saved to /var/cache/conftool/dbconfig/20210426-123020-marostegui.json [12:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 25%: Repool db1135', diff saved to https://phabricator.wikimedia.org/P15525 and previous config saved to /var/cache/conftool/dbconfig/20210426-124022-root.json [12:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1105 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P15526 and previous config saved to /var/cache/conftool/dbconfig/20210426-124141-marostegui.json [12:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:28] (03PS2) 10David Caro: openstack.victoria: update libvird defaults file to force listen [puppet] - 10https://gerrit.wikimedia.org/r/682605 [12:52:11] (03Merged) 10jenkins-bot: tests: fix minimum dependency and pytest warning [software/cumin] - 10https://gerrit.wikimedia.org/r/681691 (owner: 10Volans) [12:52:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I see:" [puppet] - 10https://gerrit.wikimedia.org/r/682605 (owner: 10David Caro) [12:53:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 25%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P15527 and previous config saved to /var/cache/conftool/dbconfig/20210426-125354-root.json [12:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3311 (re)pooling @ 25%: Repool db1105:3311', diff saved to https://phabricator.wikimedia.org/P15528 and previous config saved to /var/cache/conftool/dbconfig/20210426-125409-root.json [12:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 50%: Repool db1135', diff saved to https://phabricator.wikimedia.org/P15529 and previous config saved to /var/cache/conftool/dbconfig/20210426-125526-root.json [12:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:50] !log installing gst-plugins-base1.0 security updates [12:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:46] (03PS3) 10David Caro: wmcs.openstack: add cloudvirt maintenance cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682169 (https://phabricator.wikimedia.org/T280641) [13:01:05] (03CR) 10ArielGlenn: [C: 03+2] snapshot: Drop absented crons [puppet] - 10https://gerrit.wikimedia.org/r/682614 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [13:01:20] (03CR) 10JMeybohm: [C: 04-1] rdf-streaming-updater: create helmfile.d structure (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [13:03:33] (03CR) 10David Caro: [C: 03+2] wmcs.openstack: add cloudvirt maintenance cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682169 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [13:03:40] (03CR) 10David Caro: [C: 03+2] "Just renamed the cookbook" [cookbooks] - 10https://gerrit.wikimedia.org/r/682169 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [13:04:10] 10SRE, 10Internet-Archive, 10Wikimedia-Site-requests: Please Upload large files to Commons - https://phabricator.wikimedia.org/T281019 (10Zabe) >>! In T281019#7033780, @Languageseeker wrote: > I got a 503 error with web.archive.org/web/20150905070709if_/http://www.quartos.org/quarto_images/ham-1625-22278x-fo... [13:04:32] (03CR) 10David Caro: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/682605 (owner: 10David Caro) [13:06:39] (03CR) 10Elukey: [C: 03+1] "Ok to proceed with a pcc run first if possible, just to verify that everything is good :)" [puppet] - 10https://gerrit.wikimedia.org/r/682573 (https://phabricator.wikimedia.org/T278137) (owner: 10Hnowlan) [13:06:57] (03Merged) 10jenkins-bot: wmcs.openstack: add cloudvirt maintenance cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/682169 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [13:08:04] (03CR) 10David Caro: [C: 03+2] openstack.victoria: update libvird defaults file to force listen [puppet] - 10https://gerrit.wikimedia.org/r/682605 (owner: 10David Caro) [13:08:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 50%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P15530 and previous config saved to /var/cache/conftool/dbconfig/20210426-130858-root.json [13:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3311 (re)pooling @ 50%: Repool db1105:3311', diff saved to https://phabricator.wikimedia.org/P15531 and previous config saved to /var/cache/conftool/dbconfig/20210426-130913-root.json [13:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 75%: Repool db1135', diff saved to https://phabricator.wikimedia.org/P15532 and previous config saved to /var/cache/conftool/dbconfig/20210426-131029-root.json [13:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:59] (03PS1) 10Alexandros Kosiaris: safe-service-restart: Only verify in scope services [puppet] - 10https://gerrit.wikimedia.org/r/682619 (https://phabricator.wikimedia.org/T279100) [13:12:07] (03CR) 10Volans: "Post merge -2." (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/682169 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [13:12:42] (03PS1) 10Volans: Revert "wmcs.openstack: add cloudvirt maintenance cookbooks" [cookbooks] - 10https://gerrit.wikimedia.org/r/682628 [13:13:52] 10SRE, 10Internet-Archive, 10Wikimedia-Site-requests: Please Upload large files to Commons - https://phabricator.wikimedia.org/T281019 (10Languageseeker) That’s not the same file. 002 uploaded while 010 failed. It seems that the error occurs during the publishing stage. [13:14:37] !log installing ldap-replica2005/2006 [13:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:21:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:22:15] (03PS1) 10David Caro: openstack.nova: libvirt use systemd socket instead of --listen [puppet] - 10https://gerrit.wikimedia.org/r/682621 [13:24:02] PROBLEM - snapshot of s2 in eqiad on alert1001 is CRITICAL: Last snapshot for s2 at eqiad (db1156.eqiad.wmnet) taken on 2021-04-26 11:58:44 is 985 GB, but previous one was 834 GB, a change of 18.2% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [13:24:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 75%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P15533 and previous config saved to /var/cache/conftool/dbconfig/20210426-132402-root.json [13:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3311 (re)pooling @ 75%: Repool db1105:3311', diff saved to https://phabricator.wikimedia.org/P15534 and previous config saved to /var/cache/conftool/dbconfig/20210426-132417-root.json [13:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 100%: Repool db1135', diff saved to https://phabricator.wikimedia.org/P15535 and previous config saved to /var/cache/conftool/dbconfig/20210426-132533-root.json [13:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:10] 10SRE, 10homer, 10netops: Homer: merge all system.conf templates in one - https://phabricator.wikimedia.org/T269345 (10ayounsi) 05Openβ†’03Resolved [13:27:23] 10SRE, 10Internet-Archive, 10Wikimedia-Site-requests: Please Upload large files to Commons - https://phabricator.wikimedia.org/T281019 (10Zabe) >>! In T281019#7033945, @Languageseeker wrote: > That’s not the same file. 002 uploaded while 010 failed. It seems that the error occurs during the publishing stage.... [13:27:58] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: for zookeeper migration [13:28:01] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: for zookeeper migration [13:28:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:07] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10ops-monitoring-bot) Icinga downtime set by jayme@cumin1001 for 2:00:00 6 host(s) and their services with reason: for zookeeper migration ` conf[2001-2006].codfw.wmnet ` [13:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:24] (03CR) 10JMeybohm: [C: 03+2] Swap zookeeper from conf2001 to conf2004 [puppet] - 10https://gerrit.wikimedia.org/r/680875 (https://phabricator.wikimedia.org/T271573) (owner: 10Elukey) [13:30:08] (03CR) 10Hnowlan: [C: 03+2] api-gateway: use envoy 1.15.4 temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/681336 (https://phabricator.wikimedia.org/T280317) (owner: 10Hnowlan) [13:31:43] PROBLEM - MariaDB Replica SQL: db_inventory #page on db2093 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1062, Errmsg: Could not execute Write_rows_v1 event on table zarcillo.instances: Duplicate entry db1125 for key PRIMARY, Error_code: 1062: handler error HA_ERR_FOUND_DUPP_KEY: the events master log db1115-bin.000017, end_log_pos 5773 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling [13:31:51] (03Merged) 10jenkins-bot: api-gateway: use envoy 1.15.4 temporarily [deployment-charts] - 10https://gerrit.wikimedia.org/r/681336 (https://phabricator.wikimedia.org/T280317) (owner: 10Hnowlan) [13:31:56] RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:10] RECOVERY - Check systemd state on prometheus1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:24] * volans looking [13:32:35] good morning! can't do much right now, I'm afk for another few minutes [13:32:39] volans: you can ignore it [13:32:54] volans: I am on it [13:32:54] ack [13:32:57] marostegui: ack [13:36:09] 10SRE, 10netops, 10Patch-For-Review: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10jbond) >>! In T221388#6822771, @Volans wrote: >>>! In T221388#6822703, @BBlack wrote: >> I'm probably not up to date on concrete plans built on top of this, but it seems like having the numeric vlan id mig... [13:36:27] RECOVERY - MariaDB Replica SQL: db_inventory #page on db2093 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:36:47] (03PS1) 10Jbond: Option 82: don't use use-vlan-id [homer/public] - 10https://gerrit.wikimedia.org/r/682629 (https://phabricator.wikimedia.org/T221388) [13:38:42] (03CR) 10Andrew Bogott: [C: 03+1] "Does this correspond to the latest libvirt.default that's packaged with V?" [puppet] - 10https://gerrit.wikimedia.org/r/682621 (owner: 10David Caro) [13:39:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3312 (re)pooling @ 100%: Repool db1105:3312', diff saved to https://phabricator.wikimedia.org/P15536 and previous config saved to /var/cache/conftool/dbconfig/20210426-133905-root.json [13:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1105:3311 (re)pooling @ 100%: Repool db1105:3311', diff saved to https://phabricator.wikimedia.org/P15537 and previous config saved to /var/cache/conftool/dbconfig/20210426-133922-root.json [13:39:29] (03PS1) 10JMeybohm: configcluster: Update zookeeper version to debian upstream [puppet] - 10https://gerrit.wikimedia.org/r/682655 (https://phabricator.wikimedia.org/T271573) [13:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:45] 10SRE, 10Thumbor, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10Gilles) [13:40:17] (03CR) 10David Caro: [C: 04-2] "Discussing why this would be needed" [cookbooks] - 10https://gerrit.wikimedia.org/r/682628 (owner: 10Volans) [13:41:00] (03CR) 10David Caro: [C: 03+2] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/682621 (owner: 10David Caro) [13:41:03] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29205/console" [puppet] - 10https://gerrit.wikimedia.org/r/682655 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [13:41:37] (03CR) 10Elukey: [C: 03+1] configcluster: Update zookeeper version to debian upstream [puppet] - 10https://gerrit.wikimedia.org/r/682655 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [13:42:03] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] configcluster: Update zookeeper version to debian upstream [puppet] - 10https://gerrit.wikimedia.org/r/682655 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [13:42:58] (03CR) 10Ayounsi: [C: 03+1] Option 82: don't use use-vlan-id [homer/public] - 10https://gerrit.wikimedia.org/r/682629 (https://phabricator.wikimedia.org/T221388) (owner: 10Jbond) [13:45:20] (03PS2) 10Esanders: Make DiscussionTool's sourcemodetoolbar available on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682184 (https://phabricator.wikimedia.org/T281011) [13:45:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_zookeeper site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:47:54] (03CR) 10Jbond: [C: 03+2] Option 82: don't use use-vlan-id [homer/public] - 10https://gerrit.wikimedia.org/r/682629 (https://phabricator.wikimedia.org/T221388) (owner: 10Jbond) [13:49:20] (03Merged) 10jenkins-bot: Option 82: don't use use-vlan-id [homer/public] - 10https://gerrit.wikimedia.org/r/682629 (https://phabricator.wikimedia.org/T221388) (owner: 10Jbond) [14:01:42] 10SRE, 10MediaWiki-General, 10Traffic, 10Browser-Support-Apple-Safari: File:Chessboard480.svg WEBP thumbnail version not visible on safari when size is fixed at 208px - https://phabricator.wikimedia.org/T280439 (10Gilles) @ema I've just realised that the net was too wide. Chrome has "Safari" in its UA stri... [14:01:58] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Wikimedia Analytics Data for Aisha Khatun - https://phabricator.wikimedia.org/T280967 (10Ottomata) Approved! [14:03:26] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on conf2001.codfw.wmnet with reason: for zookeeper migration [14:03:28] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on conf2001.codfw.wmnet with reason: for zookeeper migration [14:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:36] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10ops-monitoring-bot) Icinga downtime set by jayme@cumin1001 for 1 day, 0:00:00 1 host(s) and their services with reason: for zookeeper migration ` conf2001.codfw.wmnet ` [14:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:50] (03PS1) 10David Caro: wmcs.openstack: Use Icinga directly [cookbooks] - 10https://gerrit.wikimedia.org/r/682663 [14:05:03] (03PS2) 10David Caro: wmcs.openstack: Use Icinga directly [cookbooks] - 10https://gerrit.wikimedia.org/r/682663 [14:05:05] (03CR) 10Volans: [C: 03+1] "LGTM, thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/682663 (owner: 10David Caro) [14:05:39] (03Abandoned) 10Volans: Revert "wmcs.openstack: add cloudvirt maintenance cookbooks" [cookbooks] - 10https://gerrit.wikimedia.org/r/682628 (owner: 10Volans) [14:06:10] (03CR) 10Volans: wmcs.openstack: add cloudvirt maintenance cookbooks (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/682169 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [14:06:19] (03PS3) 10David Caro: wmcs.openstack: Use Icinga directly [cookbooks] - 10https://gerrit.wikimedia.org/r/682663 [14:06:19] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:52] (03PS1) 10JMeybohm: Swap zookeeper from conf2002 to conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/682666 (https://phabricator.wikimedia.org/T271573) [14:11:54] (03PS1) 10JMeybohm: Swap zookeeper from conf2003 to conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/682667 (https://phabricator.wikimedia.org/T271573) [14:12:51] (03CR) 10Elukey: [C: 03+1] Swap zookeeper from conf2002 to conf2005 [puppet] - 10https://gerrit.wikimedia.org/r/682666 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [14:13:07] (03CR) 10Elukey: [C: 03+1] Swap zookeeper from conf2003 to conf2006 [puppet] - 10https://gerrit.wikimedia.org/r/682667 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [14:13:14] (03PS24) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [14:13:16] (03PS10) 10Jcrespo: mariadb: Setup 2 new host as temporary metadata database for media backups [puppet] - 10https://gerrit.wikimedia.org/r/681103 (https://phabricator.wikimedia.org/T276442) [14:13:18] (03PS7) 10Jcrespo: mediabackup: Setup the storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/681117 (https://phabricator.wikimedia.org/T276442) [14:13:20] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db1156 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/682668 (https://phabricator.wikimedia.org/T280492) [14:13:34] (03PS2) 10Jcrespo: mariadb: Reenable notifications on db1156 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/682668 (https://phabricator.wikimedia.org/T280492) [14:15:33] (03CR) 10jerkins-bot: [V: 04-1] wmcs.openstack: Use Icinga directly [cookbooks] - 10https://gerrit.wikimedia.org/r/682663 (owner: 10David Caro) [14:15:36] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Setup the storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/681117 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [14:15:56] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10JMeybohm) Switched zookeeper from conf2001 to conf2004. We decided to leave it like this for today and see if anything comes up. [14:16:50] 10SRE, 10Commons, 10MediaWiki-File-management, 10Thumbor, and 2 others: Thumbor doesn't save Content-Disposition: inline headers to Swift for webp thumbnails - https://phabricator.wikimedia.org/T254557 (10Gilles) Confirming that this header now gets saved in Swift for new files: ` gilles@ms-fe1005:~$ curl... [14:16:58] 10SRE, 10Thumbor, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10Gilles) [14:17:02] 10SRE, 10Commons, 10MediaWiki-File-management, 10Thumbor, and 2 others: Thumbor doesn't save Content-Disposition: inline headers to Swift for webp thumbnails - https://phabricator.wikimedia.org/T254557 (10Gilles) 05Openβ†’03Resolved [14:17:59] (03CR) 10Volans: [C: 03+1] "LGTM, thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/682663 (owner: 10David Caro) [14:19:06] 10SRE, 10Thumbor, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10Gilles) [14:20:51] RECOVERY - Check whether microcode mitigations for CPU vulnerabilities are applied on an-coord1001 is OK: OK - All expected CPU flags found https://wikitech.wikimedia.org/wiki/Microcode [14:28:29] !log installing ldap-replica1003/1004 [14:28:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:42] (03PS1) 10JMeybohm: configcluster: No longer include zookeeper in old configcluster role [puppet] - 10https://gerrit.wikimedia.org/r/682669 (https://phabricator.wikimedia.org/T271573) [14:32:56] 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [14:34:51] PROBLEM - Long running screen/tmux on aqs1010 is CRITICAL: CRIT: Long running SCREEN process. (user: root PID: 31400, 1741449s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [14:35:39] 10SRE, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10akosiaris) >>! In T279100#7033661, @jijiki wrote: >>>! In T279100#7019419, @akosiaris wrote: >> So,... [14:36:00] 10SRE, 10Traffic: Wikidough: Support EDNS(0) Padding: RFC 7830 and RFC 8467 - https://phabricator.wikimedia.org/T274431 (10ssingh) [14:36:58] hnowlan: tmux --^ :) [14:37:34] elukey: oops :) [14:38:49] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 66 probes of 637 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:45:01] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 34759776 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:45:17] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 53 probes of 637 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:47:14] (03CR) 10Elukey: [C: 03+1] configcluster: No longer include zookeeper in old configcluster role [puppet] - 10https://gerrit.wikimedia.org/r/682669 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [14:47:19] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 468368 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:47:46] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29206/console" [puppet] - 10https://gerrit.wikimedia.org/r/682573 (https://phabricator.wikimedia.org/T278137) (owner: 10Hnowlan) [14:47:59] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [14:47:59] !log hnowlan@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [14:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:02] (03PS1) 10Ema: cache: Safari regex refinement [puppet] - 10https://gerrit.wikimedia.org/r/682672 (https://phabricator.wikimedia.org/T280439) [14:52:00] (03CR) 10jerkins-bot: [V: 04-1] cache: Safari regex refinement [puppet] - 10https://gerrit.wikimedia.org/r/682672 (https://phabricator.wikimedia.org/T280439) (owner: 10Ema) [14:54:10] (03PS2) 10Ema: cache: Safari regex refinement [puppet] - 10https://gerrit.wikimedia.org/r/682672 (https://phabricator.wikimedia.org/T280439) [14:54:18] (03CR) 10Gilles: [C: 03+1] "Looks good to me, seems like CI is unhappy with the commit message, not the contents of the change." [puppet] - 10https://gerrit.wikimedia.org/r/682672 (https://phabricator.wikimedia.org/T280439) (owner: 10Ema) [14:54:28] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [14:54:29] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [14:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:18] (03CR) 10Ema: [C: 03+2] cache: Safari regex refinement [puppet] - 10https://gerrit.wikimedia.org/r/682672 (https://phabricator.wikimedia.org/T280439) (owner: 10Ema) [14:57:35] 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10fgiunchedi) [15:00:48] (03CR) 10JMeybohm: [C: 04-1] rdf-streaming-updater: enable HA capability (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) (owner: 10Mstyles) [15:01:46] !log hnowlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:01:46] !log hnowlan@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [15:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:52] !log installing jquery security updates on stretch [15:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:27] 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) p:05Triageβ†’03Medium [15:20:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Cmjohnson) Dell sent an email with a list of things they want to be done, considering that they've had 2 technicians out to fix the issue with zero resolution, I replied... [15:21:08] 10SRE, 10Wikimedia-Logstash, 10observability: Update saved / short links with objects in ELK7 - https://phabricator.wikimedia.org/T272016 (10lmata) 05Openβ†’03Resolved a:03lmata closing please reopen if unresolved [15:21:11] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 (10lmata) [15:21:54] !log restart zookeeper on conf2004 to pick up the -javaagent setting for the prometheus exporter [15:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:25] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [15:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:33] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [15:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:10] 10SRE, 10netops, 10Patch-For-Review: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10jbond) Just to confirm after removing `use-vlan-id` re-imaging of sretest1002 worked fine [15:29:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Wikimedia Analytics Data for Aisha Khatun - https://phabricator.wikimedia.org/T280967 (10akosiaris) [15:29:53] (03CR) 10Alexandros Kosiaris: [C: 03+2] Adding akhatun [puppet] - 10https://gerrit.wikimedia.org/r/682601 (https://phabricator.wikimedia.org/T280967) (owner: 10Alexandros Kosiaris) [15:33:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:35:48] 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [15:37:19] 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [15:38:22] 10SRE, 10Services, 10Wikidata, 10Wikidata-Query-Service, and 3 others: New service request: WDQS Flink based Streaming Updater - https://phabricator.wikimedia.org/T280579 (10MPhamWMF) [15:43:00] 10SRE, 10netops, 10Patch-For-Review: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10Volans) Thanks for the confirmation and sorry about the trouble @jbond [15:43:23] 10SRE, 10ops-codfw: codfw: Relocate servers in 10G racks - https://phabricator.wikimedia.org/T281135 (10Papaul) [15:43:47] (03PS1) 10Alexandros Kosiaris: admin: indicate that akhatun now has a kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/682676 (https://phabricator.wikimedia.org/T280967) [15:44:45] (03PS4) 10Herron: replace mwlog1001 with new mwlog[12]002 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) [15:44:55] (03CR) 10jerkins-bot: [V: 04-1] admin: indicate that akhatun now has a kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/682676 (https://phabricator.wikimedia.org/T280967) (owner: 10Alexandros Kosiaris) [15:53:08] (03PS2) 10Alexandros Kosiaris: admin: indicate that akhatun now has a kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/682676 (https://phabricator.wikimedia.org/T280967) [15:57:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: indicate that akhatun now has a kerberos principal [puppet] - 10https://gerrit.wikimedia.org/r/682676 (https://phabricator.wikimedia.org/T280967) (owner: 10Alexandros Kosiaris) [16:00:18] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Wikimedia Analytics Data for Aisha Khatun - https://phabricator.wikimedia.org/T280967 (10akosiaris) 05Openβ†’03Resolved a:03akosiaris @AKhatun_WMF Access has been granted. I 'll resolve this task, feel free to reopen though if problem... [16:09:08] (03PS1) 10Andrew Bogott: nova vendordata: wipe the sssd cache after final puppet run [puppet] - 10https://gerrit.wikimedia.org/r/682679 (https://phabricator.wikimedia.org/T280514) [16:10:25] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: wipe the sssd cache after final puppet run [puppet] - 10https://gerrit.wikimedia.org/r/682679 (https://phabricator.wikimedia.org/T280514) (owner: 10Andrew Bogott) [16:10:44] (03PS2) 10CRusnov: check-raid.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/670972 (https://phabricator.wikimedia.org/T247364) [16:11:30] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] nova vendordata: wipe the sssd cache after final puppet run (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682679 (https://phabricator.wikimedia.org/T280514) (owner: 10Andrew Bogott) [16:12:22] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10akosiaris) Reading the answers (thanks!!!) I understand that > Having a list of fonts available on Thumbor server... [16:13:13] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:44] (03PS1) 10Andrew Bogott: nova vendordata: trivial #inline comment fix [puppet] - 10https://gerrit.wikimedia.org/r/682681 [16:14:52] (03CR) 10Andrew Bogott: [C: 03+2] nova vendordata: trivial #inline comment fix [puppet] - 10https://gerrit.wikimedia.org/r/682681 (owner: 10Andrew Bogott) [16:15:03] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:03] (03PS3) 10Legoktm: lists: Fix mailman3 apache config [puppet] - 10https://gerrit.wikimedia.org/r/681785 (https://phabricator.wikimedia.org/T278612) [16:18:32] (03CR) 10Legoktm: [C: 03+2] lists: Fix mailman3 apache config [puppet] - 10https://gerrit.wikimedia.org/r/681785 (https://phabricator.wikimedia.org/T278612) (owner: 10Legoktm) [16:20:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudvirt1038: PCIe error - https://phabricator.wikimedia.org/T276922 (10Andrew) Thanks for the update! I agree that we should leave this in Dell's hands as much as possible. I'm not impatient to have it back online so do whatever you need to... [16:20:53] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db1102 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/682682 (https://phabricator.wikimedia.org/T280492) [16:21:57] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reenable notifications on db1102 after maintenance [puppet] - 10https://gerrit.wikimedia.org/r/682682 (https://phabricator.wikimedia.org/T280492) (owner: 10Jcrespo) [16:23:51] 10SRE, 10netops: mr1 port utilization alerts shouldn't mention hash page in their IRC logs - https://phabricator.wikimedia.org/T281055 (10CDanis) Moving to AM sounds good to me. But if needed, in the interim we could change the magic string we use in `check_librenms` to something else instead of hash page, wh... [16:24:47] (03PS3) 10Legoktm: mailman3: Use backported packages from component/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/678134 (https://phabricator.wikimedia.org/T278905) [16:27:55] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Lucas_Werkmeister_WMDE) Feel free to move the wdqs-gui-build mailing list early (it’s currently in Group F), it’s pretty much written by bots and read by nobody (see T189810 / T2... [16:30:08] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1025.eqiad.wmnet [16:30:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:24] (03PS2) 10Hnowlan: site: set eventlog1003 role to eventlogging, allow eventlog1003 to pull from kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/682573 (https://phabricator.wikimedia.org/T278137) [16:31:45] (03PS1) 10Dzahn: thumbor: add a timer that writes the output of fc-list to /srv [puppet] - 10https://gerrit.wikimedia.org/r/682685 (https://phabricator.wikimedia.org/T280718) [16:31:53] (03CR) 10jerkins-bot: [V: 04-1] site: set eventlog1003 role to eventlogging, allow eventlog1003 to pull from kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/682573 (https://phabricator.wikimedia.org/T278137) (owner: 10Hnowlan) [16:32:07] 10SRE, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10cscott) Note that $wgAPIMaxResultSize has a comment that says it depends on $wgMaxArticleSize, so if you bumped $wgMaxArticleSize you probably ne... [16:32:12] (03CR) 10jerkins-bot: [V: 04-1] thumbor: add a timer that writes the output of fc-list to /srv [puppet] - 10https://gerrit.wikimedia.org/r/682685 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [16:32:14] (03CR) 10Elukey: site: set eventlog1003 role to eventlogging, allow eventlog1003 to pull from kafka-jumbo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682573 (https://phabricator.wikimedia.org/T278137) (owner: 10Hnowlan) [16:34:00] (03PS2) 10Dzahn: thumbor: add a timer that writes the output of fc-list to /srv [puppet] - 10https://gerrit.wikimedia.org/r/682685 (https://phabricator.wikimedia.org/T280718) [16:34:51] (03PS3) 10Hnowlan: site: set eventlog1003 role to eventlogging, allow eventlog1003 to pull from kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/682573 (https://phabricator.wikimedia.org/T278137) [16:35:20] (03CR) 10jerkins-bot: [V: 04-1] site: set eventlog1003 role to eventlogging, allow eventlog1003 to pull from kafka-jumbo [puppet] - 10https://gerrit.wikimedia.org/r/682573 (https://phabricator.wikimedia.org/T278137) (owner: 10Hnowlan) [16:35:22] (03PS4) 10Hnowlan: site: eventlog1003 role to eventlogging, allow access to kafka [puppet] - 10https://gerrit.wikimedia.org/r/682573 (https://phabricator.wikimedia.org/T278137) [16:35:31] (03CR) 10Hnowlan: site: eventlog1003 role to eventlogging, allow access to kafka (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682573 (https://phabricator.wikimedia.org/T278137) (owner: 10Hnowlan) [16:35:33] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/29208/thumbor1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/682685 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [16:36:15] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1025.eqiad.wmnet [16:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:30] (03CR) 10Dzahn: [C: 03+2] gerrit: remove Apache 720s timeout [puppet] - 10https://gerrit.wikimedia.org/r/682502 (https://phabricator.wikimedia.org/T277127) (owner: 10Hashar) [16:37:44] (03CR) 10Dzahn: [C: 03+2] gerrit: Remove disableReverseDnsLookup [puppet] - 10https://gerrit.wikimedia.org/r/679447 (owner: 10Paladox) [16:40:14] !log gerrit1001 - reload apache2 [16:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:23] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10Papaul) @fgiunchedi please see below for the information we received from the vendor Please find the unity MIB attached here. Our team uses a MIB browser, such as iReasoning to ac... [16:49:30] !log gerrit - restarted apache (hard) to remove time out from gerrit:682502 [16:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:48] (03PS1) 10Ahmon Dancy: Fixed a few minor typos in README.md [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/682689 [16:49:58] (03CR) 10Dzahn: "did an apache2 restart in prod" [puppet] - 10https://gerrit.wikimedia.org/r/682502 (https://phabricator.wikimedia.org/T277127) (owner: 10Hashar) [17:00:04] ryankemper: (Dis)respected human, time to deploy Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210426T1700). Please do the needful. [17:01:58] 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb2003.frack.codfw.wmnet - https://phabricator.wikimedia.org/T281177 (10RobH) [17:02:04] (03CR) 10Dzahn: "I think this might have caused https://phabricator.wikimedia.org/T281176" [puppet] - 10https://gerrit.wikimedia.org/r/675124 (owner: 10Jbond) [17:02:20] 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb2003.frack.codfw.wmnet - https://phabricator.wikimedia.org/T281177 (10RobH) [17:02:52] 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb2003.frack.codfw.wmnet - https://phabricator.wikimedia.org/T281177 (10RobH) a:03Papaul [17:06:46] !log imported postorius_1.3.4-2~bpo10+2 to apt.wm.o [17:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:16] (03CR) 10Herron: "If this looks good could I ask service ops for a deploy? This will help move forward decom of mwlog1001." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [17:13:02] MatmaRex: hey, all beta wikis sounds to be down mentioning DiscussionTools in the traceback. Can you have a look please? [17:13:16] (03PS1) 10Jbond: C:ssh::server: Update the type checking to String [puppet] - 10https://gerrit.wikimedia.org/r/682693 (https://phabricator.wikimedia.org/T281176) [17:13:49] huh [17:14:00] Urbanecm: in a meeting right now, i'll look in 15 min [17:14:13] MatmaRex: thanks, I'll fill a train blocker in the meanwhile. [17:15:45] MatmaRex: filled as T281180. [17:15:45] T281180: DiscussionTools: Precondition failed: This Title instance does not represent a proper page, but merely a link target - https://phabricator.wikimedia.org/T281180 [17:15:48] (03CR) 10Dzahn: [C: 03+1] "confirmed (http://man.openbsd.org/sshd_config.5)" [puppet] - 10https://gerrit.wikimedia.org/r/682693 (https://phabricator.wikimedia.org/T281176) (owner: 10Jbond) [17:16:21] (03CR) 10Jeena Huneidi: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/681780 (owner: 10PipelineBot) [17:17:04] Urbanecm: the exception seems to occur on special pages [17:17:51] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/681780 (owner: 10PipelineBot) [17:18:50] !log dduvall@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [17:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:33] !log dduvall@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [17:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:18] MatmaRex: yeah, indeed. Still an issue, IMO :D [17:28:05] yes for sure [17:28:06] !log dduvall@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [17:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:17] (03Abandoned) 10Ahmon Dancy: Test emailing notification of security patch failure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679015 (owner: 10Ahmon Dancy) [17:31:26] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [17:32:04] (03PS3) 10Bartosz DziewoΕ„ski: Make DiscussionTool's sourcemodetoolbar available on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682184 (https://phabricator.wikimedia.org/T281011) (owner: 10Esanders) [17:38:16] (03PS10) 10Volans: clustershell: allow to choose different reporters [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [17:38:18] (03PS3) 10Volans: CLI/clustershell: allow to disable progress bars [software/cumin] - 10https://gerrit.wikimedia.org/r/681692 (https://phabricator.wikimedia.org/T212783) [17:38:20] (03PS3) 10Volans: setup.py: support more recent PyParsing versions [software/cumin] - 10https://gerrit.wikimedia.org/r/681758 [17:38:22] (03PS2) 10Volans: clustershell: instantiate progress bar earlier [software/cumin] - 10https://gerrit.wikimedia.org/r/682588 [17:40:04] (03CR) 10Volans: "addressed comments" (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [17:43:18] RECOVERY - snapshot of x1 in codfw on alert1001 is OK: Last snapshot for x1 at codfw (db2101.codfw.wmnet:3320) taken on 2021-04-26 16:43:19 (276 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [17:59:35] jouncebot: next [17:59:35] In 0 hour(s) and 0 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210426T1800) [17:59:38] jouncebot: refresh [17:59:38] I refreshed my knowledge about deployments. [18:00:04] RoanKattouw, Niharika, and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210426T1800). [18:00:05] dancy and MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:13] I'm here! [18:00:49] hmm, i realize now that i can't actually test the DiscussionTools config patch, because of the exception on preferences [18:00:54] I can deploy today [18:01:17] and i can't fix that, because there's an unrelated issue in CentralAuth (?) blocking our merges [18:01:31] MatmaRex: it's labs only anyway, so I can just +2 it and let it auto-deploy, if that's fine with you. [18:01:39] (afaik we don't have a way to test labs-only patches) [18:01:44] yeah, it's low risk i guess [18:01:51] yeah [18:02:00] (03CR) 10Urbanecm: [C: 03+2] Make DiscussionTool's sourcemodetoolbar available on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682184 (https://phabricator.wikimedia.org/T281011) (owner: 10Esanders) [18:02:02] let's hope :) [18:02:03] thanks [18:02:22] dancy: want to self-service, or should I deploy for you? [18:02:24] Urbanecm: My commit is easy to test. Just run 'mwscript' without args and see how it responds. The old code will respond `This script can only be run from the command line.` and the new code will have a longer, more informative message. [18:02:37] I'll leave it to you to deploy [18:02:50] okay. Sounds like a cool patch, btw :) [18:03:01] (03CR) 10Urbanecm: [C: 03+2] "thanks for this patch :). Let's ship it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679517 (owner: 10Ahmon Dancy) [18:03:22] (03Merged) 10jenkins-bot: Make DiscussionTool's sourcemodetoolbar available on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682184 (https://phabricator.wikimedia.org/T281011) (owner: 10Esanders) [18:03:52] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/29209/bast1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/682693 (https://phabricator.wikimedia.org/T281176) (owner: 10Jbond) [18:04:58] (03Merged) 10jenkins-bot: Fix error message if MWScript.php is run without arguments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679517 (owner: 10Ahmon Dancy) [18:05:19] it works! yay :) https://www.irccloud.com/pastebin/jomCZj4J/ [18:05:31] excellent. Thanks for the deploy [18:06:15] np :). Running a sync to get it everywhere [18:06:49] !log urbanecm@deploy1002 Synchronized multiversion/MWScript.php: 5ace4e1b806bcfc4ea059f9e9cae9aa94c0bdbd1: Fix error message if MWScript.php is run without arguments (duration: 00m 58s) [18:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:59] DannyS712: ad https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/682710/, I'd prefer bumping the MW requirement to 1.37 [18:12:48] (03PS1) 10Urbanecm: elwiki: Update Growth experiments configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682700 (https://phabricator.wikimedia.org/T280172) [18:13:48] (03CR) 10Urbanecm: [C: 03+2] elwiki: Update Growth experiments configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682700 (https://phabricator.wikimedia.org/T280172) (owner: 10Urbanecm) [18:14:38] (03PS1) 10Ladsgroup: mailman3: Disable gravatar [puppet] - 10https://gerrit.wikimedia.org/r/682701 [18:15:08] 10SRE, 10Performance-Team, 10serviceops, 10User-jijiki: Run latest Thumbor on Docker with Buster + Python 3 - https://phabricator.wikimedia.org/T267327 (10Gilles) a:05Gillesβ†’03cmassaro [18:15:48] (03Merged) 10jenkins-bot: elwiki: Update Growth experiments configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682700 (https://phabricator.wikimedia.org/T280172) (owner: 10Urbanecm) [18:17:05] (03CR) 10H.krishna123: "No problem -- just wondering, should I close off the original tickets seeing as the GSoC application period is over? (T277160 and T277162?" [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/673714 (https://phabricator.wikimedia.org/T277162) (owner: 10H.krishna123) [18:18:51] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 2d16f6251a67cf13cef02bbdcb3c9f5c1c505d16: elwiki: Update Growth experiments configuration (T280172) (duration: 00m 58s) [18:18:56] * Urbanecm done [18:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:01] T280172: Deploy Growth features on Greek Wikipedia - https://phabricator.wikimedia.org/T280172 [18:19:53] (03CR) 10Legoktm: [C: 03+2] mailman3: Disable gravatar [puppet] - 10https://gerrit.wikimedia.org/r/682701 (owner: 10Ladsgroup) [18:20:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install backup200[4-7] - https://phabricator.wikimedia.org/T277323 (10Papaul) [18:23:13] (03PS4) 10Legoktm: mailman3: Use backported packages from component/mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/678134 (https://phabricator.wikimedia.org/T278905) [18:24:50] (03CR) 10Legoktm: [V: 03+1] "Added two more packages after cloud testing." [puppet] - 10https://gerrit.wikimedia.org/r/678134 (https://phabricator.wikimedia.org/T278905) (owner: 10Legoktm) [18:26:15] (03PS1) 10RobH: wcqs100[123] setup info [puppet] - 10https://gerrit.wikimedia.org/r/682704 (https://phabricator.wikimedia.org/T276644) [18:26:47] (03CR) 10RobH: [C: 03+2] wcqs100[123] setup info [puppet] - 10https://gerrit.wikimedia.org/r/682704 (https://phabricator.wikimedia.org/T276644) (owner: 10RobH) [18:30:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['wcqs1001.eqiad... [18:31:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10RobH) [18:32:16] (03CR) 10Gehel: [C: 04-1] "One more comment :/" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [18:36:02] RECOVERY - Long running screen/tmux on aqs1010 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [18:36:14] 10SRE, 10Platform Engineering, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): mwlog1001 is running out of free space on /srv/mw-log - https://phabricator.wikimedia.org/T281048 (10Gilles) [18:45:20] 10SRE, 10ops-eqiad: Rack/power audit in eqiad c8/d5 - https://phabricator.wikimedia.org/T280977 (10wiki_willy) a:03Cmjohnson Hey @ayounsi - we think it's best if we add a 2nd 10g TOR to both C8 and D5. Currently, there's 12u of available rack space in C8 and 11u available in D5. Additional, there are 14x n... [18:45:21] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs1001.eqiad.wmnet with reason: REIMAGE [18:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:50] (03PS1) 10Andrew Bogott: Ceph: Try to re-centralize radosgw keyrings [puppet] - 10https://gerrit.wikimedia.org/r/682729 (https://phabricator.wikimedia.org/T276961) [18:47:08] (03CR) 10jerkins-bot: [V: 04-1] Ceph: Try to re-centralize radosgw keyrings [puppet] - 10https://gerrit.wikimedia.org/r/682729 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [18:47:20] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs1002.eqiad.wmnet with reason: REIMAGE [18:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:35] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wcqs1001.eqiad.wmnet with reason: REIMAGE [18:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:42] (03PS2) 10Andrew Bogott: Ceph: Try to re-centralize radosgw keyrings [puppet] - 10https://gerrit.wikimedia.org/r/682729 (https://phabricator.wikimedia.org/T276961) [18:49:23] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wcqs1003.eqiad.wmnet with reason: REIMAGE [18:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:44] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wcqs1002.eqiad.wmnet with reason: REIMAGE [18:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:55] (03CR) 10jerkins-bot: [V: 04-1] Ceph: Try to re-centralize radosgw keyrings [puppet] - 10https://gerrit.wikimedia.org/r/682729 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [18:50:02] 10SRE, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10AntiCompositeNumber) [18:51:18] (03PS3) 10Andrew Bogott: Ceph: Try to re-centralize radosgw keyrings [puppet] - 10https://gerrit.wikimedia.org/r/682729 (https://phabricator.wikimedia.org/T276961) [18:51:39] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wcqs1003.eqiad.wmnet with reason: REIMAGE [18:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:04] (03PS4) 10Andrew Bogott: Ceph: Try to re-centralize radosgw keyrings [puppet] - 10https://gerrit.wikimedia.org/r/682729 (https://phabricator.wikimedia.org/T276961) [18:56:25] (03CR) 10Andrew Bogott: [C: 03+2] Ceph: Try to re-centralize radosgw keyrings [puppet] - 10https://gerrit.wikimedia.org/r/682729 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [18:59:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work), 10Patch-For-Review: (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['wcqs1001.eqiad.wmnet', 'wcqs1002.eqiad.wmnet', 'wcqs1003.eqiad... [19:06:22] (03PS1) 10Herron: install_server: kafka-main[12]00[1-5] use default release installer [puppet] - 10https://gerrit.wikimedia.org/r/682731 (https://phabricator.wikimedia.org/T225005) [19:06:56] (03PS1) 10Andrew Bogott: Ceph: rename radosgw service [puppet] - 10https://gerrit.wikimedia.org/r/682732 [19:07:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): (Need By: 2021-04-30) rack/setup/install wcqs100[123] - https://phabricator.wikimedia.org/T276644 (10RobH) 05Openβ†’03Resolved [19:07:25] (03CR) 10jerkins-bot: [V: 04-1] Ceph: rename radosgw service [puppet] - 10https://gerrit.wikimedia.org/r/682732 (owner: 10Andrew Bogott) [19:10:18] (03PS2) 10Andrew Bogott: Ceph: rename radosgw service [puppet] - 10https://gerrit.wikimedia.org/r/682732 [19:11:09] (03CR) 10Andrew Bogott: [C: 03+2] Ceph: rename radosgw service [puppet] - 10https://gerrit.wikimedia.org/r/682732 (owner: 10Andrew Bogott) [19:14:12] (03PS1) 10Ryan Kemper: wdqs: add missing raid0 dependency [puppet] - 10https://gerrit.wikimedia.org/r/682735 (https://phabricator.wikimedia.org/T280382) [19:15:26] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/682735 (https://phabricator.wikimedia.org/T280382) (owner: 10Ryan Kemper) [19:22:22] (03CR) 10Bstorm: [C: 03+2] cloudstore: set up secondary_drbd classes [puppet] - 10https://gerrit.wikimedia.org/r/681800 (https://phabricator.wikimedia.org/T224747) (owner: 10Bstorm) [19:41:07] (03CR) 10Aaron Schulz: Move ExternalStore log group from debug to error (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682322 (https://phabricator.wikimedia.org/T281048) (owner: 10Reedy) [19:45:55] !log uploaded python3-falcon, python3-mimeparse, python3-mujson, openstack-pkg-tools to mailman3 component on apt.wm.o [19:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:19] (03PS1) 10Ladsgroup: lists: Send error logs of apache2/exim4 to logstash [puppet] - 10https://gerrit.wikimedia.org/r/682736 (https://phabricator.wikimedia.org/T276697) [19:49:21] (03PS1) 10Ladsgroup: mailman3: Increase the log level to WARNING and send them to logstash [puppet] - 10https://gerrit.wikimedia.org/r/682737 (https://phabricator.wikimedia.org/T276697) [19:50:33] (03CR) 10jerkins-bot: [V: 04-1] lists: Send error logs of apache2/exim4 to logstash [puppet] - 10https://gerrit.wikimedia.org/r/682736 (https://phabricator.wikimedia.org/T276697) (owner: 10Ladsgroup) [19:50:53] (03CR) 10jerkins-bot: [V: 04-1] mailman3: Increase the log level to WARNING and send them to logstash [puppet] - 10https://gerrit.wikimedia.org/r/682737 (https://phabricator.wikimedia.org/T276697) (owner: 10Ladsgroup) [19:56:58] RECOVERY - Long running screen/tmux on centrallog1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [19:59:42] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:00:04] chrisalbon and accraze: May I have your attention please! Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210426T2000) [20:04:01] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: add missing raid0 dependency [puppet] - 10https://gerrit.wikimedia.org/r/682735 (https://phabricator.wikimedia.org/T280382) (owner: 10Ryan Kemper) [20:08:10] (03PS2) 10Ladsgroup: lists: Send error logs of apache2/exim4 to logstash [puppet] - 10https://gerrit.wikimedia.org/r/682736 (https://phabricator.wikimedia.org/T276697) [20:08:12] (03PS2) 10Ladsgroup: mailman3: Increase the log level to WARNING and send them to logstash [puppet] - 10https://gerrit.wikimedia.org/r/682737 (https://phabricator.wikimedia.org/T276697) [20:12:08] (03PS1) 10Dzahn: site/DHCP: remove planet1003 [puppet] - 10https://gerrit.wikimedia.org/r/682739 (https://phabricator.wikimedia.org/T280989) [20:13:52] (03CR) 10Dzahn: [C: 03+2] site/DHCP: remove planet1003 [puppet] - 10https://gerrit.wikimedia.org/r/682739 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [20:15:33] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts planet1003.eqiad.wmnet [20:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:40] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005869 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:25:07] cloudvirt* [20:26:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts planet1003.eqiad.wmnet [20:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:30] 10SRE, 10Patch-For-Review: try planet on bullseye - https://phabricator.wikimedia.org/T280989 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `planet1003.eqiad.wmnet` - planet1003.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found Ganeti VM - VM shutd... [20:35:10] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host people1003.eqiad.wmnet [20:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:04] !log restarting php-fpm on phab1001 to deploy phabricator hotfix d238db85b8d8072d99f31805aa4a8a7cf0c09941 [20:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:45] PROBLEM - Disk space on mwlog1001 is CRITICAL: DISK CRITICAL - free space: /srv 272454 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops [20:59:53] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:00:04] Reedy and sbassett: That opportune time is upon us again. Time for a Weekly Security deployment window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210426T2100). [21:01:04] (03PS1) 10Ssingh: Improve tests/test_dns.py [no code change] [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/682750 [21:03:27] (03CR) 10Ssingh: [C: 03+2] Improve tests/test_dns.py [no code change] [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/682750 (owner: 10Ssingh) [21:03:44] (03CR) 10Ssingh: [V: 03+1 C: 03+2] Improve tests/test_dns.py [no code change] [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/682750 (owner: 10Ssingh) [21:03:49] (03CR) 10Ssingh: [V: 03+2 C: 03+2] Improve tests/test_dns.py [no code change] [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/682750 (owner: 10Ssingh) [21:05:20] (03PS1) 10Andrew Bogott: wmcs-image-create: Add half-hearted support for new daily Bullseye builds [puppet] - 10https://gerrit.wikimedia.org/r/682751 (https://phabricator.wikimedia.org/T280801) [21:06:29] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-image-create: Add half-hearted support for new daily Bullseye builds [puppet] - 10https://gerrit.wikimedia.org/r/682751 (https://phabricator.wikimedia.org/T280801) (owner: 10Andrew Bogott) [21:17:12] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10wiki_willy) Just following up here. @Papaul, @Cmjohnson, and @RobH - can you guys fill this out by end of this week? Thanks, Willy [21:18:32] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10RobH) [21:20:15] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10RobH) [21:21:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host people1003.eqiad.wmnet [21:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:35] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10RobH) [21:23:58] (03PS1) 10Dzahn: site/DHCP: add people1003 with bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682755 (https://phabricator.wikimedia.org/T280989) [21:24:24] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10RobH) [21:25:14] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10RobH) [21:26:13] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops, 10procurement: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10RobH) This won't affect cumin execution rights, correct? [21:29:06] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) [21:29:16] (03CR) 10Dzahn: [C: 03+2] site/DHCP: add people1003 with bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682755 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [21:29:21] (03PS2) 10Dzahn: site/DHCP: add people1003 with bullseye [puppet] - 10https://gerrit.wikimedia.org/r/682755 (https://phabricator.wikimedia.org/T280989) [21:30:04] (03PS3) 10Ladsgroup: mailman3: Increase the log level to WARNING and send them to logstash [puppet] - 10https://gerrit.wikimedia.org/r/682737 (https://phabricator.wikimedia.org/T276697) [21:30:46] (03PS1) 10Jdlrobson: Enable language in header for office and testwiki logged in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682757 (https://phabricator.wikimedia.org/T280526) [21:30:50] (03PS1) 10Jdlrobson: Enable new language button for all logged in users outside test projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682758 (https://phabricator.wikimedia.org/T280526) [21:33:45] (03CR) 10Ladsgroup: "Tested on mailman03. The files have been created and the logs are there. Can't say if it will end up in logstash or not." [puppet] - 10https://gerrit.wikimedia.org/r/682737 (https://phabricator.wikimedia.org/T276697) (owner: 10Ladsgroup) [21:41:29] (03PS1) 10Ladsgroup: mailman3: Improve the logging directory permissions [puppet] - 10https://gerrit.wikimedia.org/r/682760 [21:45:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) a:05RobHβ†’03Jclark-ctr I neglected to reassign this last week, these can be moved according to the updated per host che... [21:47:04] (03PS7) 10Legoktm: lists: Move renamed lists into hiera [puppet] - 10https://gerrit.wikimedia.org/r/682192 [21:48:29] (03CR) 10Legoktm: [C: 03+2] lists: Move renamed lists into hiera [puppet] - 10https://gerrit.wikimedia.org/r/682192 (owner: 10Legoktm) [21:50:43] (03CR) 10Legoktm: [C: 03+2] mailman: Don't un-advertise a list when disabling it [puppet] - 10https://gerrit.wikimedia.org/r/682185 (owner: 10Legoktm) [21:51:22] (03PS4) 10Legoktm: lists: Update mod_security rules for mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/681244 [21:52:13] (03CR) 10Legoktm: [C: 03+2] lists: Update mod_security rules for mailman3 [puppet] - 10https://gerrit.wikimedia.org/r/681244 (owner: 10Legoktm) [21:52:31] (03CR) 10Ladsgroup: [C: 04-1] "breaks uwsgi" [puppet] - 10https://gerrit.wikimedia.org/r/682760 (owner: 10Ladsgroup) [21:58:52] (03CR) 10Legoktm: "Verified with:" [puppet] - 10https://gerrit.wikimedia.org/r/681244 (owner: 10Legoktm) [22:11:28] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1006.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` [22:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:42] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [22:17:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:20:45] (03PS1) 10RobH: snapshot10[1-5] setup info [puppet] - 10https://gerrit.wikimedia.org/r/682765 (https://phabricator.wikimedia.org/T272509) [22:21:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:21:34] (03CR) 10RobH: [C: 03+2] snapshot10[1-5] setup info [puppet] - 10https://gerrit.wikimedia.org/r/682765 (https://phabricator.wikimedia.org/T272509) (owner: 10RobH) [22:24:16] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1006.eqiad.wmnet with reason: REIMAGE [22:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10Patch-For-Review: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['snapshot1011.eqiad.wmnet'... [22:26:26] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1006.eqiad.wmnet with reason: REIMAGE [22:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:30] (03PS1) 10Dzahn: site: add peopleweb role to people1003 [puppet] - 10https://gerrit.wikimedia.org/r/682766 (https://phabricator.wikimedia.org/T280989) [22:39:20] (03PS1) 10Andrew Bogott: radosgw: enable => true to the service resource [puppet] - 10https://gerrit.wikimedia.org/r/682767 [22:39:26] (03CR) 10Dzahn: [C: 03+2] site: add peopleweb role to people1003 [puppet] - 10https://gerrit.wikimedia.org/r/682766 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [22:39:32] (03PS2) 10Dzahn: site: add peopleweb role to people1003 [puppet] - 10https://gerrit.wikimedia.org/r/682766 (https://phabricator.wikimedia.org/T280989) [22:42:05] (03PS2) 10Andrew Bogott: radosgw: enable => true to the service resource [puppet] - 10https://gerrit.wikimedia.org/r/682767 [22:42:48] (03CR) 10jerkins-bot: [V: 04-1] radosgw: enable => true to the service resource [puppet] - 10https://gerrit.wikimedia.org/r/682767 (owner: 10Andrew Bogott) [22:46:47] (03PS3) 10Andrew Bogott: radosgw: enable => true to the service resource [puppet] - 10https://gerrit.wikimedia.org/r/682767 [22:50:32] (03PS1) 10Dzahn: peopleweb: set rsync_dst host to people1003 [puppet] - 10https://gerrit.wikimedia.org/r/682769 [22:50:34] (03CR) 10Andrew Bogott: [C: 03+2] radosgw: enable => true to the service resource [puppet] - 10https://gerrit.wikimedia.org/r/682767 (owner: 10Andrew Bogott) [22:50:59] (03CR) 10jerkins-bot: [V: 04-1] peopleweb: set rsync_dst host to people1003 [puppet] - 10https://gerrit.wikimedia.org/r/682769 (owner: 10Dzahn) [22:51:16] (03PS1) 10Dzahn: switch peopleweb backend to people1003 [dns] - 10https://gerrit.wikimedia.org/r/682770 [22:51:37] (03CR) 10Razzi: [C: 03+2] sqoop: switch to single grouped_wikis.csv [puppet] - 10https://gerrit.wikimedia.org/r/681498 (https://phabricator.wikimedia.org/T280549) (owner: 10Razzi) [22:58:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10Patch-For-Review: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10RobH) [23:00:04] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Evening backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210426T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:06:13] (03PS1) 10Dzahn: DHCP: let people1003 use bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/682771 (https://phabricator.wikimedia.org/T280989) [23:06:57] (03PS2) 10Dzahn: DHCP: let people1003 use bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/682771 (https://phabricator.wikimedia.org/T280989) [23:07:49] (03CR) 10Dzahn: [C: 03+2] DHCP: let people1003 use bullseye installer [puppet] - 10https://gerrit.wikimedia.org/r/682771 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [23:17:48] PROBLEM - HTTPS-peopleweb on people1003 is CRITICAL: connect to address 10.64.0.155 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/People.wikimedia.org [23:18:14] (03CR) 10Krinkle: "This is now ready to go. The first change went out in 1.36.0-wmf.38, the second one in wmf/1.37.0-wmf.1, which has been out for a full wee" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677734 (owner: 10Krinkle) [23:18:18] (03PS5) 10Krinkle: mc: Remove unused mcrouterAware/cluster/coalesceKeys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677734 [23:20:16] PROBLEM - people.wikimedia.org requires authentication on people1003 is CRITICAL: connect to address 10.64.0.155 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [23:20:50] ^ eh, that's me. and wasnt supposed to happen but also not being used. fixing [23:21:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on people1003.eqiad.wmnet with reason: new host [23:21:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on people1003.eqiad.wmnet with reason: new host [23:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:04] ACKNOWLEDGEMENT - HTTPS-peopleweb on people1003 is CRITICAL: connect to address 10.64.0.155 and port 443: Connection refused daniel_zahn fresh install https://wikitech.wikimedia.org/wiki/People.wikimedia.org [23:23:04] ACKNOWLEDGEMENT - people.wikimedia.org requires authentication on people1003 is CRITICAL: connect to address 10.64.0.155 and port 443: Connection refused daniel_zahn fresh install https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [23:26:16] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:28:28] !log renewing TLS cert for peopleweb.discovery.wmnet, adding *3 hosts [23:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:34] (03PS1) 10Andrew Bogott: radosgw: rgw_swift_account_in_url = true [puppet] - 10https://gerrit.wikimedia.org/r/682775 [23:34:44] (03PS1) 10Dzahn: ssl: update cert for peopleweb.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/682776 (https://phabricator.wikimedia.org/T280989) [23:34:59] (03CR) 10Andrew Bogott: [C: 03+2] radosgw: rgw_swift_account_in_url = true [puppet] - 10https://gerrit.wikimedia.org/r/682775 (owner: 10Andrew Bogott) [23:35:02] (03PS2) 10Dzahn: ssl: update cert for peopleweb.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/682776 (https://phabricator.wikimedia.org/T280989) [23:35:21] (03PS2) 10Dzahn: peopleweb: set rsync_dst host to people1003 [puppet] - 10https://gerrit.wikimedia.org/r/682769 [23:36:42] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "openssl x509 -in peopleweb.discovery.wmnet.crt -text -noout | grep DNS" [puppet] - 10https://gerrit.wikimedia.org/r/682776 (https://phabricator.wikimedia.org/T280989) (owner: 10Dzahn) [23:37:12] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3064 is OK: HTTP OK: HTTP/1.0 200 OK - 23641 bytes in 0.329 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:40:03] (03PS12) 10Mstyles: rdf-streaming-updater: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) [23:40:17] (03CR) 10Mstyles: rdf-streaming-updater: create helmfile.d structure (036 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [23:42:04] (03CR) 10jerkins-bot: [V: 04-1] rdf-streaming-updater: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [23:57:26] (03CR) 10Mstyles: "not sure what's happening with the envoy_template linting error" [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles)