[00:01:41] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:13] 10SRE, 10Traffic, 10HTTPS, 10Performance-Team (Radar): Enable QUIC support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Bugreporter) [00:05:46] 10SRE, 10Traffic, 10HTTPS, 10Performance-Team (Radar): Enable HTTP/3 (QUIC) support on Wikimedia servers - https://phabricator.wikimedia.org/T238034 (10Bugreporter) [00:09:15] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:03] RECOVERY - Check systemd state on elastic2046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:25:45] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [00:45:41] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [00:47:59] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:43:26] (03PS1) 10Andrew Bogott: Trove: add policy.yaml override [puppet] - 10https://gerrit.wikimedia.org/r/684136 (https://phabricator.wikimedia.org/T281655) [01:44:02] (03CR) 10jerkins-bot: [V: 04-1] Trove: add policy.yaml override [puppet] - 10https://gerrit.wikimedia.org/r/684136 (https://phabricator.wikimedia.org/T281655) (owner: 10Andrew Bogott) [01:45:23] (03PS2) 10Andrew Bogott: Trove: add policy.yaml override [puppet] - 10https://gerrit.wikimedia.org/r/684136 (https://phabricator.wikimedia.org/T281655) [01:46:05] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [01:47:30] (03CR) 10Andrew Bogott: [C: 03+2] Trove: add policy.yaml override [puppet] - 10https://gerrit.wikimedia.org/r/684136 (https://phabricator.wikimedia.org/T281655) (owner: 10Andrew Bogott) [01:54:15] PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:05:16] (03PS1) 10Andrew Bogott: Horizon: install trove policy file for trove-dashboard [puppet] - 10https://gerrit.wikimedia.org/r/684137 (https://phabricator.wikimedia.org/T281655) [02:10:49] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: install trove policy file for trove-dashboard [puppet] - 10https://gerrit.wikimedia.org/r/684137 (https://phabricator.wikimedia.org/T281655) (owner: 10Andrew Bogott) [02:49:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:52:35] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [02:54:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:55:33] RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:03:49] PROBLEM - Host wdqs1013 is DOWN: PING CRITICAL - Packet loss = 100% [03:04:35] RECOVERY - Host wdqs1013 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [03:24:51] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:51:41] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [03:54:15] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.065 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:44:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:49:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:54:49] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [05:32:09] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 37817808 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:37:05] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 610800 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:42:19] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:51] (03CR) 10Elukey: [C: 03+1] "This can be merged anytime, so all the install servers will update way before our maintenance :)" [puppet] - 10https://gerrit.wikimedia.org/r/682785 (https://phabricator.wikimedia.org/T278423) (owner: 10Razzi) [06:47:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:49:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:20:57] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I'm not sure I get the point of this PS." [puppet] - 10https://gerrit.wikimedia.org/r/683837 (owner: 10Jbond) [07:29:45] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Change owner of /srv/patches to mwdeploy (from root) [puppet] - 10https://gerrit.wikimedia.org/r/683989 (https://phabricator.wikimedia.org/T245184) (owner: 10Ahmon Dancy) [07:31:43] !log installing libimage-exiftool-perl security updates [07:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:19] (03CR) 10Majavah: [C: 03+1] "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/683837 (owner: 10Jbond) [07:50:04] (03CR) 10Giuseppe Lavagetto: safe-service-restart: Only verify in scope services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682619 (https://phabricator.wikimedia.org/T279100) (owner: 10Alexandros Kosiaris) [07:58:23] 10SRE, 10ops-codfw, 10Discovery, 10Discovery-Search (Current work): elastic2033 without bootable devices available - https://phabricator.wikimedia.org/T281621 (10Gehel) Note that elastic2033 is using software RAID. The data should be on RAID0, but the root partition on RAID1. [07:59:02] 10SRE, 10Wikimedia-Planet: Find a replacement for RSS aggregator for planet.wikimedia.org - https://phabricator.wikimedia.org/T281219 (10MoritzMuehlenhoff) https://github.com/rubys/venus/issues/37 points to https://github.com/feedreader/pluto which is written in Ruby and still seems to be actively maintained. [08:01:19] !log installing edk2 security updates [08:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:13:05] 10SRE, 10ops-codfw, 10Discovery, 10Discovery-Search (Current work): elastic2033 without bootable devices available - https://phabricator.wikimedia.org/T281621 (10elukey) The other thing that may happen is that the mbr was installed only on one of the two disks of the RAID1, so now nothing boots. IIRC PXE w... [08:15:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:37:55] (03PS1) 10Gergő Tisza: Handle DB readonly errors [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684078 (https://phabricator.wikimedia.org/T281382) [08:46:24] (03PS1) 10WMDE-Fisch: [beta] Enable new search feature for the template dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684284 (https://phabricator.wikimedia.org/T271802) [08:51:29] (03PS1) 10Giuseppe Lavagetto: safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) [08:51:41] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] wikidata: post edit constraint jobs on 70% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682608 (https://phabricator.wikimedia.org/T204031) (owner: 10Tonina Zhelyazkova) [08:52:53] !log joal@deploy1002 Started deploy [analytics/refinery@584ed6a]: Hotfix analytics deploy (monthly sqoop) [analytics/refinery@584ed6a] [08:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:00] (03CR) 10jerkins-bot: [V: 04-1] safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [08:53:02] (03CR) 10WMDE-Fisch: "This change is ready for review." [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684079 (https://phabricator.wikimedia.org/T281352) (owner: 10WMDE-Fisch) [08:56:24] (03PS2) 10WMDE-Fisch: [beta] Enable new search features for the template dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684284 (https://phabricator.wikimedia.org/T271802) [08:57:35] (03PS2) 10Giuseppe Lavagetto: safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) [08:58:06] (03PS3) 10Tonina Zhelyazkova: wikidata: post edit constraint jobs on 70% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682608 (https://phabricator.wikimedia.org/T204031) [08:58:36] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29345/console" [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [08:59:01] (03CR) 10jerkins-bot: [V: 04-1] safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [08:59:12] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: add more FQDNs in prepartion for the cloudgw migration [dns] - 10https://gerrit.wikimedia.org/r/683855 (owner: 10Arturo Borrero Gonzalez) [09:02:55] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29346/console" [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [09:05:19] (03Abandoned) 10Muehlenhoff: Include grub::defaults unconditionally [puppet] - 10https://gerrit.wikimedia.org/r/505869 (https://phabricator.wikimedia.org/T140100) (owner: 10Muehlenhoff) [09:07:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:08:57] (03PS1) 10Volans: doc: fix sphinx warning in docstring [software/cumin] - 10https://gerrit.wikimedia.org/r/684295 [09:09:00] !log joal@deploy1002 Finished deploy [analytics/refinery@584ed6a]: Hotfix analytics deploy (monthly sqoop) [analytics/refinery@584ed6a] (duration: 16m 06s) [09:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:10:12] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29347/console" [puppet] - 10https://gerrit.wikimedia.org/r/683997 (owner: 10Jbond) [09:10:16] !log joal@deploy1002 Started deploy [analytics/refinery@584ed6a] (thin): Hotfix analytics deploy (monthly sqoop) THIN [analytics/refinery@584ed6a] [09:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:22] (03CR) 10Arturo Borrero Gonzalez: P::toolforge::mailrelay: support multiple domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/684032 (https://phabricator.wikimedia.org/T278109) (owner: 10Majavah) [09:10:23] !log joal@deploy1002 Finished deploy [analytics/refinery@584ed6a] (thin): Hotfix analytics deploy (monthly sqoop) THIN [analytics/refinery@584ed6a] (duration: 00m 07s) [09:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:46] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca: use hardcoded sources for pki certs [puppet] - 10https://gerrit.wikimedia.org/r/683997 (owner: 10Jbond) [09:12:28] !log joal@deploy1002 Started deploy [analytics/refinery@584ed6a] (hadoop-test): Hotfix analytics deploy (monthly sqoop) HADOOP-TEST [analytics/refinery@584ed6a] [09:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:02] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676383 (owner: 10Filippo Giunchedi) [09:15:25] (03PS1) 10Jbond: P:pki::multirootca: drop ca bundle file as its not used [puppet] - 10https://gerrit.wikimedia.org/r/684297 [09:16:04] (03PS29) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [09:16:06] (03PS1) 10Jcrespo: backups: Fix typo on fileset name, resulting on no backups scheduled [puppet] - 10https://gerrit.wikimedia.org/r/684298 (https://phabricator.wikimedia.org/T281369) [09:16:17] (03PS2) 10Jcrespo: backups: Fix typo on fileset name, resulting on no backups scheduled [puppet] - 10https://gerrit.wikimedia.org/r/684298 (https://phabricator.wikimedia.org/T281369) [09:16:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29348/console" [puppet] - 10https://gerrit.wikimedia.org/r/684297 (owner: 10Jbond) [09:17:53] (03CR) 10jerkins-bot: [V: 04-1] backups: Fix typo on fileset name, resulting on no backups scheduled [puppet] - 10https://gerrit.wikimedia.org/r/684298 (https://phabricator.wikimedia.org/T281369) (owner: 10Jcrespo) [09:17:59] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:pki::multirootca: drop ca bundle file as its not used [puppet] - 10https://gerrit.wikimedia.org/r/684297 (owner: 10Jbond) [09:18:01] (03CR) 10Jcrespo: [C: 04-1] "This has a typo, plus some grep impact" [puppet] - 10https://gerrit.wikimedia.org/r/683676 (owner: 10Jbond) [09:18:14] (03CR) 10jerkins-bot: [V: 04-1] backups: Fix typo on fileset name, resulting on no backups scheduled [puppet] - 10https://gerrit.wikimedia.org/r/684298 (https://phabricator.wikimedia.org/T281369) (owner: 10Jcrespo) [09:19:27] (03PS1) 10Jcrespo: backups: Fix typo on fileset name, resulting on no backups scheduled [puppet] - 10https://gerrit.wikimedia.org/r/684300 (https://phabricator.wikimedia.org/T281369) [09:19:37] (03PS2) 10Jcrespo: backups: Fix typo on fileset name, resulting on no backups scheduled [puppet] - 10https://gerrit.wikimedia.org/r/684300 (https://phabricator.wikimedia.org/T281369) [09:21:02] (03CR) 10jerkins-bot: [V: 04-1] backups: Fix typo on fileset name, resulting on no backups scheduled [puppet] - 10https://gerrit.wikimedia.org/r/684300 (https://phabricator.wikimedia.org/T281369) (owner: 10Jcrespo) [09:21:34] (03PS1) 10Filippo Giunchedi: pontoon: add hosts_for_role function [puppet] - 10https://gerrit.wikimedia.org/r/684301 [09:22:17] (03CR) 10Filippo Giunchedi: [C: 03+2] "Merging right away since this is only a cherry-pick in production" [puppet] - 10https://gerrit.wikimedia.org/r/684301 (owner: 10Filippo Giunchedi) [09:22:37] (03PS3) 10Jcrespo: backups: Fix typo on fileset name, resulting on no backups scheduled [puppet] - 10https://gerrit.wikimedia.org/r/684298 (https://phabricator.wikimedia.org/T281369) [09:23:04] (03CR) 10Volans: [C: 03+2] doc: fix sphinx warning in docstring [software/cumin] - 10https://gerrit.wikimedia.org/r/684295 (owner: 10Volans) [09:24:08] (03PS3) 10Jbond: P::envoy: allow users to run tlsproxy without service proxy [puppet] - 10https://gerrit.wikimedia.org/r/683837 (https://phabricator.wikimedia.org/T277990) [09:25:26] (03PS3) 10Majavah: P::toolforge::mailrelay: support multiple domains [puppet] - 10https://gerrit.wikimedia.org/r/684032 (https://phabricator.wikimedia.org/T278109) [09:25:52] (03PS2) 10Filippo Giunchedi: hieradata: introduce 'public_domain' variable [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676383 [09:25:54] (03PS4) 10Filippo Giunchedi: wmflib: add role/public_endpoint to wmflib::service [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676385 [09:25:56] (03PS4) 10Filippo Giunchedi: pontoon: enable sso for alerts in cloud [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676386 [09:25:58] (03PS4) 10Filippo Giunchedi: pontoon: use public_domain for alerts/icinga [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676387 [09:26:00] (03PS5) 10Filippo Giunchedi: pontoon: introduce public_certs [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676388 [09:26:02] (03PS5) 10Filippo Giunchedi: pontoon: add public LB class [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676389 [09:26:04] (03PS8) 10Filippo Giunchedi: role: add pontoon::frontend role/profile [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676390 [09:26:05] sorry for the spam ^ [09:26:06] (03PS8) 10Filippo Giunchedi: hieradata: add public o11y services to service::catalog [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676391 [09:26:08] (03CR) 10Majavah: P::toolforge::mailrelay: support multiple domains (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/684032 (https://phabricator.wikimedia.org/T278109) (owner: 10Majavah) [09:26:24] (03CR) 10Jbond: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/683837 (https://phabricator.wikimedia.org/T277990) (owner: 10Jbond) [09:28:42] (03Merged) 10jenkins-bot: doc: fix sphinx warning in docstring [software/cumin] - 10https://gerrit.wikimedia.org/r/684295 (owner: 10Volans) [09:32:50] (03CR) 10Jcrespo: [C: 04-1] "This would need a deeper bacula classes refactor." [puppet] - 10https://gerrit.wikimedia.org/r/683675 (owner: 10Jbond) [09:35:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] hieradata: introduce 'public_domain' variable [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676383 (owner: 10Filippo Giunchedi) [09:36:32] 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10MoritzMuehlenhoff) [09:36:44] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" (031 comment) [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676383 (owner: 10Filippo Giunchedi) [09:38:39] (03CR) 10Jbond: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/683675 (owner: 10Jbond) [09:38:53] (03Abandoned) 10Jbond: P:backup::host: add sets parameter [puppet] - 10https://gerrit.wikimedia.org/r/683675 (owner: 10Jbond) [09:39:07] (03Abandoned) 10Jbond: O:pki::root: move backup sets to hiera [puppet] - 10https://gerrit.wikimedia.org/r/683676 (owner: 10Jbond) [09:40:18] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [puppet] - 10https://gerrit.wikimedia.org/r/684298 (https://phabricator.wikimedia.org/T281369) (owner: 10Jcrespo) [09:41:29] (03CR) 10Filippo Giunchedi: hieradata: introduce 'public_domain' variable (031 comment) [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676383 (owner: 10Filippo Giunchedi) [09:41:52] !log joal@deploy1002 Finished deploy [analytics/refinery@584ed6a] (hadoop-test): Hotfix analytics deploy (monthly sqoop) HADOOP-TEST [analytics/refinery@584ed6a] (duration: 29m 24s) [09:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:50] !log installing python3.7 security updates [09:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:21] (03PS3) 10Filippo Giunchedi: hieradata: introduce 'public_domain' variable [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676383 [09:43:23] (03PS5) 10Filippo Giunchedi: wmflib: add role/public_endpoint to wmflib::service [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676385 [09:44:40] (03PS1) 10Filippo Giunchedi: hieradata: introduce 'public_domain' variable [puppet] - 10https://gerrit.wikimedia.org/r/684309 [09:45:51] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: introduce 'public_domain' variable [puppet] (sandbox/filippo/pontoon-o11y) - 10https://gerrit.wikimedia.org/r/676383 (owner: 10Filippo Giunchedi) [09:48:57] (03CR) 10Jcrespo: "Jbond: one thing that could be done now, and it is "expected/documented" is to put the "include profile::backup::host" on the role (althou" [puppet] - 10https://gerrit.wikimedia.org/r/684298 (https://phabricator.wikimedia.org/T281369) (owner: 10Jcrespo) [09:49:16] (03CR) 10Jcrespo: [C: 03+2] backups: Fix typo on fileset name, resulting on no backups scheduled [puppet] - 10https://gerrit.wikimedia.org/r/684298 (https://phabricator.wikimedia.org/T281369) (owner: 10Jcrespo) [09:51:26] (03CR) 10Filippo Giunchedi: [C: 03+2] "Merging right away since this only a cherry pick in production" [puppet] - 10https://gerrit.wikimedia.org/r/684309 (owner: 10Filippo Giunchedi) [09:51:28] (03CR) 10Arturo Borrero Gonzalez: wmcs.drain_hypervisor: skip all VMs in the canary project (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683857 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:54:21] (03CR) 10David Caro: wmcs.drain_hypervisor: skip all VMs in the canary project (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683857 (https://phabricator.wikimedia.org/T280641) (owner: 10David Caro) [09:54:41] (03CR) 10Awight: [C: 03+1] "Looks right. I'd be scared to trim it down any further." [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684079 (https://phabricator.wikimedia.org/T281352) (owner: 10WMDE-Fisch) [09:57:53] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:58:07] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/684298 (https://phabricator.wikimedia.org/T281369) (owner: 10Jcrespo) [09:59:54] ^ jbond42 that's the pki backups running!! :-) [10:00:35] jynus: great thanks [10:13:29] (03PS1) 10JMeybohm: Rename configcluster_stretch to configcluster in hiera [puppet] - 10https://gerrit.wikimedia.org/r/684315 (https://phabricator.wikimedia.org/T271573) [10:13:31] (03PS1) 10JMeybohm: Remove unused profile::etcd and related classes [puppet] - 10https://gerrit.wikimedia.org/r/684316 (https://phabricator.wikimedia.org/T271573) [10:14:51] (03PS2) 10JMeybohm: Rename configcluster_stretch to configcluster in hiera [puppet] - 10https://gerrit.wikimedia.org/r/684315 (https://phabricator.wikimedia.org/T271573) [10:14:53] (03PS2) 10JMeybohm: Remove unused profile::etcd and related classes [puppet] - 10https://gerrit.wikimedia.org/r/684316 (https://phabricator.wikimedia.org/T271573) [10:15:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:27] 10SRE, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10serviceops: decommission conf200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T281374 (10JMeybohm) a:05JMeybohm→03Papaul [10:25:11] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29349/console" [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [10:26:19] (03PS1) 10Arturo Borrero Gonzalez: Drop 56.15.185.in-addr.arpa zone [dns] - 10https://gerrit.wikimedia.org/r/684320 [10:26:47] (03PS3) 10Giuseppe Lavagetto: safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) [10:27:03] (03PS2) 10Majavah: Add grafana-cloud.{wm.o,d.wmnet} to replace labs [dns] - 10https://gerrit.wikimedia.org/r/684099 [10:28:07] (03CR) 10jerkins-bot: [V: 04-1] safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [10:28:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Drop 56.15.185.in-addr.arpa zone [dns] - 10https://gerrit.wikimedia.org/r/684320 (owner: 10Arturo Borrero Gonzalez) [10:30:05] jan_drewniak: (Dis)respected human, time to deploy Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210503T1030). Please do the needful. [10:31:54] (03PS1) 10Jbond: (WIP): add function to test if we are doing the initial puppet run [puppet] - 10https://gerrit.wikimedia.org/r/684321 [10:34:47] (03CR) 10Majavah: [C: 04-1] "The current implementation does not work for most Cloud VPS projects where there is no per-project puppetmaster involved. I imagine most p" [puppet] - 10https://gerrit.wikimedia.org/r/684321 (owner: 10Jbond) [10:34:56] (03PS1) 10Jbond: hiera - cloud: add cert to test build process [puppet] - 10https://gerrit.wikimedia.org/r/684322 [10:35:25] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684323 (https://phabricator.wikimedia.org/T128546) [10:38:56] 10SRE, 10Commons, 10Tools, 10Wikimedia-Mailing-lists: daily-image-l stopped sending on 2020-10-11 - https://phabricator.wikimedia.org/T265568 (10jcrespo) 05Open→03Resolved I am going to assume this is resolved, due to old age. Reopen if this is still happening. [10:38:58] (03CR) 10Jbond: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/684321 (owner: 10Jbond) [10:39:24] (03CR) 10Jbond: [C: 03+2] hiera - cloud: add cert to test build process [puppet] - 10https://gerrit.wikimedia.org/r/684322 (owner: 10Jbond) [10:39:41] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684323 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:40:57] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684323 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:42:24] (03PS1) 10Gergő Tisza: GrowthExperiments: enable link recommendations backend on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684327 (https://phabricator.wikimedia.org/T278710) [10:46:29] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:684302| Bumping portals to master (T128546)]] (duration: 00m 58s) [10:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:38] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:46:40] (03PS1) 10Gergő Tisza: GrowthExperiments: enable link recommendations frontend on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684331 (https://phabricator.wikimedia.org/T278710) [10:47:27] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:684302| Bumping portals to master (T128546)]] (duration: 00m 57s) [10:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:45] (03PS1) 10Volans: CHANGELOG: add changelogs for release v4.1.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/684332 [10:54:23] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v4.1.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/684332 (owner: 10Volans) [10:55:56] (03PS1) 10Hashar: [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 [10:56:55] (03CR) 10Urbanecm: [C: 03+2] "backport window" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684078 (https://phabricator.wikimedia.org/T281382) (owner: 10Gergő Tisza) [10:57:17] (03CR) 10Urbanecm: [C: 03+2] "backport window" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684080 (owner: 10Gergő Tisza) [10:59:17] !log installing avahi security updates on buster [10:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210503T1100). [11:00:04] Tonina_WMDE, CFisch_WMDE, and tgr: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] i can deploy today [11:00:17] o/ [11:00:26] o/ I'll self-serve but only be around in the second half of the window. [11:00:48] o/ [11:00:50] o/ [11:00:54] tgr_: ack. [11:01:04] (03CR) 10Urbanecm: [C: 03+2] wikidata: post edit constraint jobs on 70% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682608 (https://phabricator.wikimedia.org/T204031) (owner: 10Tonina Zhelyazkova) [11:01:08] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Fix settings dialog offering ReferencePreviews when unavailable [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684079 (https://phabricator.wikimedia.org/T281352) (owner: 10WMDE-Fisch) [11:01:21] (03CR) 10Urbanecm: [C: 03+2] Fix settings dialog offering ReferencePreviews when unavailable [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684079 (https://phabricator.wikimedia.org/T281352) (owner: 10WMDE-Fisch) [11:01:28] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v4.1.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/684332 (owner: 10Volans) [11:01:51] (03Merged) 10jenkins-bot: wikidata: post edit constraint jobs on 70% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682608 (https://phabricator.wikimedia.org/T204031) (owner: 10Tonina Zhelyazkova) [11:02:53] Tonina_WMDE: I assume your patch cannot be actually tested, right? [11:03:12] no, I don't think it can [11:03:13] (it's on mwdebug1001 anyway) [11:03:17] okay, syncing :) [11:04:41] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f1a5ef0116c77b86b1abfb7bfa7d4ed363c69f61: wikidata: post edit constraint jobs on 70% of edits (T204031) (duration: 00m 57s) [11:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:49] T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031 [11:05:00] Tonina_WMDE: should be live :) [11:05:20] 10SRE, 10Commons, 10Tools, 10Wikimedia-Mailing-lists: daily-image-l stopped sending on 2020-10-11 - https://phabricator.wikimedia.org/T265568 (10RhinosF1) >>! In T265568#7052296, @jcrespo wrote: > I am going to assume this is resolved, due to old age. Reopen if this is still happening. https://lists.wikime... [11:05:21] (03PS2) 10Urbanecm: Set wgGEMentorshipMigrationStage to SCHEMA_COMPAT_NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683430 (https://phabricator.wikimedia.org/T279853) [11:05:32] (03CR) 10Urbanecm: [C: 03+2] Set wgGEMentorshipMigrationStage to SCHEMA_COMPAT_NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683430 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm) [11:06:30] (03Merged) 10jenkins-bot: Set wgGEMentorshipMigrationStage to SCHEMA_COMPAT_NEW everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683430 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm) [11:06:35] (03PS1) 10Muehlenhoff: Add library hint for avahi [puppet] - 10https://gerrit.wikimedia.org/r/684337 [11:06:45] thanks Urbanecm :) [11:06:51] any time :) [11:08:25] (03Merged) 10jenkins-bot: Handle DB readonly errors [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684078 (https://phabricator.wikimedia.org/T281382) (owner: 10Gergő Tisza) [11:08:28] (03Merged) 10jenkins-bot: refreshLinkRecommendations.php: Use per-wiki locks [extensions/GrowthExperiments] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684080 (owner: 10Gergő Tisza) [11:08:31] (03Merged) 10jenkins-bot: Fix settings dialog offering ReferencePreviews when unavailable [extensions/Popups] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684079 (https://phabricator.wikimedia.org/T281352) (owner: 10WMDE-Fisch) [11:09:26] (03PS1) 10Volans: Upstream release v4.1.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/684340 [11:10:05] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for avahi [puppet] - 10https://gerrit.wikimedia.org/r/684337 (owner: 10Muehlenhoff) [11:11:10] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: c5a7c67b4daf33e0f9aaabec3f35ab6d4184894b: Set wgGEMentorshipMigrationStage to SCHEMA_COMPAT_NEW everywhere (T279853) (duration: 00m 57s) [11:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:19] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [11:11:54] CFisch_WMDE: your patch is on mwdebug1001 [11:12:08] * CFisch_WMDE testing [11:13:12] Urbanecm: All good, go on! [11:13:16] syncing [11:15:00] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/Popups/: a438b641c81fa16faba287407012beaff8b1f3ba: Fix settings dialog offering ReferencePreviews when unavailable (T281352) (duration: 00m 58s) [11:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:08] T281352: Broken settings dialogue for reference previews when in conflict with a skin/gadget - https://phabricator.wikimedia.org/T281352 [11:15:09] should be live CFisch_WMDE [11:16:04] so, unless someone has anything else, i think tgr|away can deploy his patches when available. [11:16:18] All good. Thanks again Urbanecm! :-) [11:16:24] any time :) [11:18:34] (03CR) 10Volans: [C: 03+2] Upstream release v4.1.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/684340 (owner: 10Volans) [11:23:44] Its a Tonina_WMDE =o [11:24:41] (03Merged) 10jenkins-bot: Upstream release v4.1.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/684340 (owner: 10Volans) [11:34:22] (03PS4) 10Giuseppe Lavagetto: safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) [11:35:27] 10SRE: Integrate Buster 10.9 point update - https://phabricator.wikimedia.org/T279054 (10MoritzMuehlenhoff) [11:35:42] (03CR) 10jerkins-bot: [V: 04-1] safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [11:35:58] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29351/console" [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [11:44:40] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: add cloudsw addresses in vlan 1120 [dns] - 10https://gerrit.wikimedia.org/r/684353 [11:46:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/683551 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [11:51:33] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] [beta] Enable new search features for the template dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684284 (https://phabricator.wikimedia.org/T271802) (owner: 10WMDE-Fisch) [11:56:45] !log kharlan@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/GrowthExperiments: Backport: [[gerrit:684080|refreshLinkRecommendations.php: Use per-wiki locks]] [[gerrit:684078|Handle DB readonly errors (T281382)]] (duration: 00m 58s) [11:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:54] T281382: Make sure all GrowthExperiments DB writes handle readonly mode well - https://phabricator.wikimedia.org/T281382 [12:02:31] (03PS2) 10Kosta Harlan: GrowthExperiments: enable link recommendations backend on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684327 (https://phabricator.wikimedia.org/T278710) (owner: 10Gergő Tisza) [12:03:00] (03CR) 10Kosta Harlan: [C: 03+2] GrowthExperiments: enable link recommendations backend on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684327 (https://phabricator.wikimedia.org/T278710) (owner: 10Gergő Tisza) [12:03:43] (03Merged) 10jenkins-bot: GrowthExperiments: enable link recommendations backend on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684327 (https://phabricator.wikimedia.org/T278710) (owner: 10Gergő Tisza) [12:07:40] !log kharlan@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:684327|GrowthExperiments: enable link recommendations backend on cswiki (T278710)]] (duration: 00m 57s) [12:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:48] T278710: Add a link: production deployment - https://phabricator.wikimedia.org/T278710 [12:08:27] PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:09:03] (03PS2) 10Kosta Harlan: GrowthExperiments: enable link recommendations frontend on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684331 (https://phabricator.wikimedia.org/T278710) (owner: 10Gergő Tisza) [12:09:17] (03CR) 10Kosta Harlan: [C: 03+2] GrowthExperiments: enable link recommendations frontend on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684331 (https://phabricator.wikimedia.org/T278710) (owner: 10Gergő Tisza) [12:10:19] (03Merged) 10jenkins-bot: GrowthExperiments: enable link recommendations frontend on cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684331 (https://phabricator.wikimedia.org/T278710) (owner: 10Gergő Tisza) [12:19:08] (03PS1) 10Gergő Tisza: GrowthExperiments: Set default variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684378 [12:19:42] (03PS2) 10Gergő Tisza: GrowthExperiments: Set default variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684378 (https://phabricator.wikimedia.org/T278123) [12:19:53] the deploy window is running over a bit [12:22:41] (03CR) 10Kosta Harlan: [C: 03+2] GrowthExperiments: Set default variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684378 (https://phabricator.wikimedia.org/T278123) (owner: 10Gergő Tisza) [12:23:31] (03Merged) 10jenkins-bot: GrowthExperiments: Set default variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684378 (https://phabricator.wikimedia.org/T278123) (owner: 10Gergő Tisza) [12:33:21] !log kharlan@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:684378|GrowthExperiments: Set default variant (T278123)]] [[gerrit:684331|GrowthExperiments: enable link recommendations frontend on cswiki (T278710)]] (duration: 00m 57s) [12:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:31] T278710: Add a link: production deployment - https://phabricator.wikimedia.org/T278710 [12:33:31] T278123: Provide capability for A/B testing task types - https://phabricator.wikimedia.org/T278123 [12:35:41] tgr|away and I are done with backporting for now [12:36:08] !log Backport window done [12:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:00] (03PS1) 10Majavah: beta: Use upload.wikimedia.beta.wmflabs.o for uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684381 (https://phabricator.wikimedia.org/T281650) [12:43:57] (03PS3) 10Majavah: Add grafana-cloud.{wm.o,d.wmnet} to replace labs [dns] - 10https://gerrit.wikimedia.org/r/684099 [12:45:49] (03PS1) 10Majavah: beta: Switch to deployment-urldownloader03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684384 [12:47:51] (03PS9) 10Majavah: etcd: Use cfssl for peer-to-peer communication [puppet] - 10https://gerrit.wikimedia.org/r/674077 [12:53:42] (03PS16) 10DCausse: rdf-streaming-updater: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [12:53:44] (03PS7) 10DCausse: rdf-streaming-updater: enable HA capability [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) (owner: 10Mstyles) [12:53:46] (03PS6) 10DCausse: rdf-streaming-updater: use session mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/681497 (https://phabricator.wikimedia.org/T280166) (owner: 10Mstyles) [12:54:36] 10SRE, 10ops-codfw, 10DC-Ops, 10observability, 10User-fgiunchedi: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10fgiunchedi) I did some work on this last week, there's temporary patches on netmon1002 to get things going at least minimally and collect voltage/current/power/e... [13:10:47] !log Run `User::newSystemUser( 'Maintenance script', [ 'steal' => true ] )` on cswiki to make the user a proper system user (T281703) [13:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:55] T281703: TypeError: Argument 1 passed to GrowthExperiments\NewcomerTasks\TaskSuggester\CacheDecorator::suggest() must implement interface MediaWiki\User\UserIdentity, null given, called in /srv/mediawiki/php-1.37.0-wmf.3/extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php on line 170 - https://phabricator.wikimedia.org/T281703 [13:14:54] (03PS1) 10Jdrewniak: Hotfix: loadRelatedArticles should consider existence of container element [extensions/RelatedArticles] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684393 (https://phabricator.wikimedia.org/T281547) [13:20:07] (03PS1) 10Jbond: wmflib: add new fact to puppet_config [puppet] - 10https://gerrit.wikimedia.org/r/684394 [13:20:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29352/console" [puppet] - 10https://gerrit.wikimedia.org/r/684394 (owner: 10Jbond) [13:21:24] (03PS2) 10Jbond: (WIP): add function to test if we are doing the initial puppet run [puppet] - 10https://gerrit.wikimedia.org/r/684321 [13:22:02] (03PS3) 10Jbond: (WIP): add function to test if we are doing the initial puppet run [puppet] - 10https://gerrit.wikimedia.org/r/684321 [13:23:23] (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib: add new fact to puppet_config [puppet] - 10https://gerrit.wikimedia.org/r/684394 (owner: 10Jbond) [13:39:42] (03PS1) 10Esanders: Make DT's source mode toolbar available as beta on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684404 (https://phabricator.wikimedia.org/T279124) [13:43:21] !log uploaded cumin_4.1.0 to apt.wikimedia.org buster-wikimedia,bullseye-wikimedia [13:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:15] (03PS1) 10Jbond: hiera - cloud: add none existing class to test rebuild [puppet] - 10https://gerrit.wikimedia.org/r/684406 [13:45:26] (03CR) 10Jbond: [C: 03+2] hiera - cloud: add none existing class to test rebuild [puppet] - 10https://gerrit.wikimedia.org/r/684406 (owner: 10Jbond) [13:49:23] (03CR) 10JMeybohm: "Thanks for your review!" [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) (owner: 10JMeybohm) [13:50:24] (03PS3) 10WMDE-Fisch: [beta] Enable new search features for the template dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684284 (https://phabricator.wikimedia.org/T271802) [13:51:38] (03CR) 10Jbond: [C: 04-1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/684321 (owner: 10Jbond) [13:52:11] FYI: Merging a labs only config patch. [13:52:30] (03CR) 10WMDE-Fisch: [C: 03+2] [beta] Enable new search features for the template dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684284 (https://phabricator.wikimedia.org/T271802) (owner: 10WMDE-Fisch) [13:53:13] (03Merged) 10jenkins-bot: [beta] Enable new search features for the template dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684284 (https://phabricator.wikimedia.org/T271802) (owner: 10WMDE-Fisch) [13:57:33] (03PS4) 10JMeybohm: kube-apiserver: Update admission controller config [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) [13:59:49] (03PS1) 10Hashar: [WMF] script to build our plugins [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684411 [14:01:14] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29353/console" [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) (owner: 10JMeybohm) [14:05:26] (03CR) 10Muehlenhoff: "Looks fine, but better fold his into 683551 from the start? If e.g. a revert is needed, then it all happens in one go and it's also cleare" [puppet] - 10https://gerrit.wikimedia.org/r/684315 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [14:06:33] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29354/console" [puppet] - 10https://gerrit.wikimedia.org/r/683916 (https://phabricator.wikimedia.org/T262847) (owner: 10Ottomata) [14:08:13] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29356/console" [puppet] - 10https://gerrit.wikimedia.org/r/683916 (https://phabricator.wikimedia.org/T262847) (owner: 10Ottomata) [14:09:13] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29355/console" [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) (owner: 10JMeybohm) [14:09:27] (03CR) 10JMeybohm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/684315 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [14:10:03] (03PS2) 10Ottomata: Remove SWAP / virtualenv based jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/683916 (https://phabricator.wikimedia.org/T262847) [14:10:06] (03PS5) 10JMeybohm: kube-apiserver: Update admission controller config [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) [14:10:12] (03CR) 10Muehlenhoff: Remove unused profile::etcd and related classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/684316 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [14:11:34] (03PS3) 10JMeybohm: Rename role configcluster_stretch to configcluster [puppet] - 10https://gerrit.wikimedia.org/r/683551 (https://phabricator.wikimedia.org/T271573) [14:11:36] (03PS3) 10JMeybohm: Remove unused profile::etcd and related classes [puppet] - 10https://gerrit.wikimedia.org/r/684316 (https://phabricator.wikimedia.org/T271573) [14:12:25] (03PS2) 10Hashar: [WMF] register our plugins as submodules [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684336 [14:12:27] (03PS2) 10Hashar: [WMF] script to build our plugins [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684411 [14:12:49] (03Abandoned) 10JMeybohm: Rename configcluster_stretch to configcluster in hiera [puppet] - 10https://gerrit.wikimedia.org/r/684315 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [14:17:36] (03PS3) 10Hashar: [WMF] script to build our plugins [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684411 [14:18:58] (03PS6) 10JMeybohm: kube-apiserver: Update admission controller config [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) [14:23:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:25:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:27:21] !log uploaded conftool_1.3.1 to apt.wikimedia.org bullseye-wikimedia [14:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:43] (03CR) 10Ottomata: [C: 03+2] Remove SWAP / virtualenv based jupyterhub [puppet] - 10https://gerrit.wikimedia.org/r/683916 (https://phabricator.wikimedia.org/T262847) (owner: 10Ottomata) [14:29:06] (03PS5) 10Giuseppe Lavagetto: safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) [14:29:36] (03PS1) 10Jbond: P:gitlab: install gitlab-ce [puppet] - 10https://gerrit.wikimedia.org/r/684418 [14:29:39] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-awight-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:38] (03CR) 10jerkins-bot: [V: 04-1] safe-service-restart: only verify pooled services [puppet] - 10https://gerrit.wikimedia.org/r/684287 (https://phabricator.wikimedia.org/T279100) (owner: 10Giuseppe Lavagetto) [14:31:56] (03CR) 10jerkins-bot: [V: 04-1] P:gitlab: install gitlab-ce [puppet] - 10https://gerrit.wikimedia.org/r/684418 (owner: 10Jbond) [14:31:56] (03PS2) 10Jbond: P:gitlab: install gitlab-ce [puppet] - 10https://gerrit.wikimedia.org/r/684418 (https://phabricator.wikimedia.org/T279545) [14:33:17] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-dsaez-singleuser.service,jupyter-ebernhardson-singleuser.service,jupyter-mneisler-singleuser.service,jupyter-neilpquinn-wmf-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:17] (03CR) 10Giuseppe Lavagetto: [C: 04-1] modules::conftool add safe-service-restart scap option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682141 (https://phabricator.wikimedia.org/T266055) (owner: 10Effie Mouzeli) [14:34:19] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/683551 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [14:34:25] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-dsaez-singleuser.service,jupyter-ebernhardson-singleuser.service,jupyter-zpapierski-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:33] (03CR) 10Jbond: [C: 03+2] P:gitlab: install gitlab-ce [puppet] - 10https://gerrit.wikimedia.org/r/684418 (https://phabricator.wikimedia.org/T279545) (owner: 10Jbond) [14:34:41] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-aarora-singleuser.service,jupyter-dcausse-singleuser.service,jupyter-fdans-singleuser.service,jupyter-mmiller-singleuser.service,jupyter-mstyles-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:09] ottomata: --^ [14:35:23] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-christinedk-singleuser.service,jupyter-joal-singleuser.service,jupyter-piccardi-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:24] stranged, i stoped them. [14:35:30] why is that degraded? [14:35:35] they are ephemeral units [14:35:37] anyway in progress, sorry [14:35:44] am about to remove them too, but stopping each one took a while... [14:36:22] ottomata: they are all listed as failed, I can do a quick pass and reset-fail them [14:36:42] i'm in there now, will do as soon as these other 3 finish stopping [14:37:59] ok all reset-failed [14:38:01] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:07] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:09] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:25] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:09] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:31] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Rename role configcluster_stretch to configcluster [puppet] - 10https://gerrit.wikimedia.org/r/683551 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [14:42:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove unused profile::etcd and related classes [puppet] - 10https://gerrit.wikimedia.org/r/684316 (https://phabricator.wikimedia.org/T271573) (owner: 10JMeybohm) [14:53:43] (03PS7) 10JMeybohm: kube-apiserver: Update admission controller config [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) [14:55:33] (03PS1) 10Mforns: analytics:refinery:job:test:data_purge: remove -skipTrash from drop_event [puppet] - 10https://gerrit.wikimedia.org/r/684427 (https://phabricator.wikimedia.org/T273789) [14:56:02] (03PS8) 10JMeybohm: kube-apiserver: Update admission controller config [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) [14:57:42] (03CR) 10Ahmon Dancy: "Thanks Dzahn and Joe!" [puppet] - 10https://gerrit.wikimedia.org/r/683989 (https://phabricator.wikimedia.org/T245184) (owner: 10Ahmon Dancy) [15:00:58] (03CR) 10Ottomata: [C: 03+2] analytics:refinery:job:test:data_purge: remove -skipTrash from drop_event [puppet] - 10https://gerrit.wikimedia.org/r/684427 (https://phabricator.wikimedia.org/T273789) (owner: 10Mforns) [15:01:21] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 (10RobH) [15:09:31] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10RobH) [15:12:33] RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:14:31] I'm upgrading a couple of mailing lists now [15:15:09] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [15:27:26] !log upgrade group A to mailman3 (T280322) [15:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:34] T280322: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 [15:44:07] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [15:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:03] (03PS1) 10Jbond: gitlab_sshd_macs: Fix type [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/684434 [15:48:25] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:36] (03PS2) 10Jbond: gitlab_sshd_macs: Fix type [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/684434 [15:55:58] (03PS1) 10Ryan Kemper: wdqs: shift 1 codfw internal host to codfw public [puppet] - 10https://gerrit.wikimedia.org/r/684435 (https://phabricator.wikimedia.org/T281498) [15:58:28] PROBLEM - MariaDB memory on clouddb1013 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (6246) = 66.4% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:59:00] ^ andrewbogott bstorm [15:59:18] interesting... [15:59:33] I think there is a ticket about this, I was told by manuel [15:59:56] not necesarilly an issue, but a monitoring issue, but do not know much really [16:00:06] Yeah. That went to "admins" [16:00:10] for one [16:00:19] and then there's the memory issue :) [16:01:14] It looks like it has a lot of free memory at the moment... [16:03:17] Ah, no it doesn't. [16:03:27] That's what I get for reading stuff like that during a meeting [16:04:04] bstorm: o/ Razz*i and Manuel reviewed the alarm for clouddb1021, and IIRC we decided that it was not that useful for multi-instance, so we added hiera lookups to raise the thresholds [16:04:30] Thanks for the info :) I'll make a ticket to dig a bit deeper today [16:04:35] ack :) [16:05:24] The dbs aren't doing much of anything [16:05:53] bstorm, @ meeting, maybe I can talk to you when I finish, as I know some of the issues [16:06:13] 👍🏻 [16:07:28] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [16:09:01] Ah good. The alert did go to the wmcs thingy as well :) [16:09:01] (03PS1) 10Gergő Tisza: [beta] GrowthExperiments: make link recommendations default in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684436 [16:12:52] (03CR) 10CDanis: [C: 03+1] wdqs: shift 1 codfw internal host to codfw public [puppet] - 10https://gerrit.wikimedia.org/r/684435 (https://phabricator.wikimedia.org/T281498) (owner: 10Ryan Kemper) [16:14:24] (03PS1) 10Jbond: C:gitlab::ssh: add new gilab::ssh class [puppet] - 10https://gerrit.wikimedia.org/r/684437 [16:14:27] (03PS1) 10Jbond: P:gitlab: add ability to manage gitlab sshd instance [puppet] - 10https://gerrit.wikimedia.org/r/684438 (https://phabricator.wikimedia.org/T276148) [16:15:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29360/console" [puppet] - 10https://gerrit.wikimedia.org/r/684438 (https://phabricator.wikimedia.org/T276148) (owner: 10Jbond) [16:15:56] (03PS1) 10Jbond: O:gitlab: manage sshd config [puppet] - 10https://gerrit.wikimedia.org/r/684439 (https://phabricator.wikimedia.org/T276148) [16:16:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29361/console" [puppet] - 10https://gerrit.wikimedia.org/r/684439 (https://phabricator.wikimedia.org/T276148) (owner: 10Jbond) [16:19:27] (03CR) 10Herron: "> Patch Set 4:" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677002 (https://phabricator.wikimedia.org/T224565) (owner: 10Herron) [16:19:37] !log legoktm@lists1001:~$ sudo apt install default-mysql-client # for temporary debugging [16:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:07] (03PS1) 10Hashar: [WMF] Add XDG_CACHE_HOME to tools/download_file.py [software/gerrit] (wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/684440 [16:23:27] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: shift 1 codfw internal host to codfw public [puppet] - 10https://gerrit.wikimedia.org/r/684435 (https://phabricator.wikimedia.org/T281498) (owner: 10Ryan Kemper) [16:24:11] (03PS6) 10Jbond: P:gitlab: Deploy acme chief certificate [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) [16:26:17] (03CR) 10Jbond: [C: 03+2] P:gitlab: Deploy acme chief certificate [puppet] - 10https://gerrit.wikimedia.org/r/670427 (https://phabricator.wikimedia.org/T276673) (owner: 10Jbond) [16:27:32] !log ryankemper@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=wdqs2004.codfw.wmnet [16:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:13] !log T281498 `sudo confctl select 'name=wdqs2004.codfw.wmnet' set/pooled=yes:weight=10` after merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/684435 [16:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:21] T281498: Transfer one codfw wdqs-internal host over to codfw wdqs (public) - https://phabricator.wikimedia.org/T281498 [16:30:11] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [16:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:19] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [16:40:21] (03PS1) 10Jbond: sshd review: do not merge [gitlab-ansible] - 10https://gerrit.wikimedia.org/r/684443 (https://phabricator.wikimedia.org/T276148) [16:43:10] PROBLEM - Check systemd state on elastic1060 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:16] (finised meeting) bstorm, so talk to manuel on ticket [16:43:25] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install phab2002 - https://phabricator.wikimedia.org/T280544 (10Papaul) [16:43:36] but the model that was used for old labsdbs may need some tweaks [16:43:38] PROBLEM - Check systemd state on elastic1066 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:52] either on how much memory is being used per instance [16:43:59] or the monitoring limits of it [16:44:06] Yeah. Will do :) [16:44:22] tuning is very workload dependent- so what works for produciton won't work for clouddbs [16:44:45] of both resources and monitoring [16:45:11] (03PS1) 10Ladsgroup: mailman3: Copy the config file before disabling the list [puppet] - 10https://gerrit.wikimedia.org/r/684444 (https://phabricator.wikimedia.org/T280322) [16:45:12] RECOVERY - Check systemd state on elastic1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:14] (03PS2) 10Ladsgroup: mailman3: Copy the config file before disabling the list [puppet] - 10https://gerrit.wikimedia.org/r/684444 (https://phabricator.wikimedia.org/T280322) [16:55:03] (03PS2) 10Jbond: C:gitlab::ssh: add new gilab::ssh class [puppet] - 10https://gerrit.wikimedia.org/r/684437 [16:55:33] (03CR) 10Legoktm: [C: 03+2] mailman3: Copy the config file before disabling the list [puppet] - 10https://gerrit.wikimedia.org/r/684444 (https://phabricator.wikimedia.org/T280322) (owner: 10Ladsgroup) [16:57:41] 10SRE, 10GitLab (Initialization), 10Patch-For-Review, 10Release-Engineering-Team (Doing), 10User-brennen: SSH Access of Git data in GitLab - https://phabricator.wikimedia.org/T276148 (10jbond) >>! In T276148#7050178, @Sergey.Trofimovsky.SF wrote: > Here it is, requesting settings review: > > https://ger... [16:58:30] (03PS9) 10JMeybohm: kube-apiserver: Update admission controller config [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) [17:00:05] ryankemper: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210503T1700). [17:11:24] RECOVERY - Check systemd state on elastic1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:35] (03PS3) 10Jdlrobson: Enable new language button for all logged in users outside test projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682758 (https://phabricator.wikimedia.org/T280526) [17:12:51] (03CR) 10Jdlrobson: "Jan will deploy this at 11am UTC tomorrow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682758 (https://phabricator.wikimedia.org/T280526) (owner: 10Jdlrobson) [17:14:19] (03CR) 10JMeybohm: "Think I finally managed. PCC with ml clusters not explicitly enabling any admission plugins (for me to test the struct type etc.) at: http" [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) (owner: 10JMeybohm) [17:14:47] (03PS10) 10JMeybohm: kube-apiserver: Update admission controller config [puppet] - 10https://gerrit.wikimedia.org/r/677922 (https://phabricator.wikimedia.org/T270063) [17:17:04] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) [17:18:26] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) @MMandere Just subscribed you to 2 new mailing lists called "ops". That is the name we had before we became SRE. also see: https://lists.wikimedia.org/mailman/listinfo/ops https://lists.wikime... [17:20:09] !log Restarting CI Jenkins due to "Gearman worker contint2001.wikimedia.org_manager" thread dieing unexpectedly # T281737 [17:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:17] T281737: Zuul can't stop jobs or set the build description - https://phabricator.wikimedia.org/T281737 [17:21:13] (03PS1) 10Volans: setup.py: relax elasticsearch dependencies [software/spicerack] - 10https://gerrit.wikimedia.org/r/684476 [17:25:33] (03PS2) 10Jdlrobson: Prepare for new configuration option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683720 (https://phabricator.wikimedia.org/T277951) [17:29:49] (03CR) 10Ryan Kemper: [C: 03+1] "LGTM! Thanks for the detailed context in the commit" [software/spicerack] - 10https://gerrit.wikimedia.org/r/684476 (owner: 10Volans) [17:30:41] (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [software/spicerack] - 10https://gerrit.wikimedia.org/r/684476 (owner: 10Volans) [17:37:55] (03CR) 10Bstorm: "I see you found the traffic quirks in this over at If3e7a29b5c17a012cdd2" [dns] - 10https://gerrit.wikimedia.org/r/684099 (owner: 10Majavah) [17:39:42] (03CR) 10Bstorm: "Adding some traffic team folks in case there are gotchas around that." [puppet] - 10https://gerrit.wikimedia.org/r/684100 (owner: 10Majavah) [17:44:34] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [17:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:43] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [17:48:13] (03CR) 10Bstorm: "Fun question that I don't have the answer to just yet: Is TLS is done by envoy, can we simply add the new name to hieradata/role/common/wm" [puppet] - 10https://gerrit.wikimedia.org/r/684100 (owner: 10Majavah) [17:48:19] (03CR) 10Volans: [C: 03+2] setup.py: relax elasticsearch dependencies [software/spicerack] - 10https://gerrit.wikimedia.org/r/684476 (owner: 10Volans) [17:55:26] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) [17:55:52] 10SRE, 10SRE-Access-Requests: SRE Onboarding for Marc Mandere - https://phabricator.wikimedia.org/T281344 (10Dzahn) - added to private exim aliases incl. root and dns-admin [17:56:16] (03Merged) 10jenkins-bot: setup.py: relax elasticsearch dependencies [software/spicerack] - 10https://gerrit.wikimedia.org/r/684476 (owner: 10Volans) [18:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210503T1800). Please do the needful. [18:00:05] jan_drewniak, tgr, and Majavah: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:09] here [18:00:15] o/ [18:00:16] mine are both beta only [18:00:34] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) [18:00:45] mine is also beta only [18:01:24] I can deploy today [18:01:40] 10SRE, 10SRE-Access-Requests: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10Dzahn) @xSavitar Hi, ticket looks good. I'll handle it as the "clinic duty" person this week. @thcipriani Let's start with your approval. Do you approve? [18:02:23] (03CR) 10Urbanecm: [C: 03+2] Hotfix: loadRelatedArticles should consider existence of container element [extensions/RelatedArticles] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684393 (https://phabricator.wikimedia.org/T281547) (owner: 10Jdrewniak) [18:02:27] 10SRE, 10SRE-Access-Requests: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10thcipriani) [18:02:43] (03CR) 10Urbanecm: [C: 03+2] [beta] GrowthExperiments: make link recommendations default in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684436 (owner: 10Gergő Tisza) [18:02:57] tgr_: +2'ed, will be live within ~30 minutes (but i bet you know it :)) [18:04:01] (03PS2) 10Urbanecm: beta: Use upload.wikimedia.beta.wmflabs.o for uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684381 (https://phabricator.wikimedia.org/T281650) (owner: 10Majavah) [18:04:06] (03CR) 10Urbanecm: [C: 03+2] beta: Use upload.wikimedia.beta.wmflabs.o for uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684381 (https://phabricator.wikimedia.org/T281650) (owner: 10Majavah) [18:04:08] 10SRE, 10SRE-Access-Requests: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10thcipriani) >>! In T281564#7054274, @Dzahn wrote: > @xSavitar Hi, ticket looks good. I'll handle it as the "clinic duty" person this week. > > @thcipriani Let's start with your approva... [18:05:01] (03CR) 10Urbanecm: [C: 03+2] beta: Switch to deployment-urldownloader03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684384 (owner: 10Majavah) [18:05:30] Majavah: +2'ed, will get deployed soon :). I'll sync CS.php changes to prod too, althrough they should be no-op. [18:05:38] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: make link recommendations default in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684436 (owner: 10Gergő Tisza) [18:05:41] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) Group A is done. It was really messy. - There are wildcard bans being overriden by userlist, please don't do that. - Some mailing lists simply don't have an owner (!) [18:06:05] ty! [18:07:07] (03Merged) 10jenkins-bot: beta: Switch to deployment-urldownloader03 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684384 (owner: 10Majavah) [18:09:49] (03PS3) 10Urbanecm: beta: Use upload.wikimedia.beta.wmflabs.o for uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684381 (https://phabricator.wikimedia.org/T281650) (owner: 10Majavah) [18:09:55] (03CR) 10Urbanecm: [C: 03+2] beta: Use upload.wikimedia.beta.wmflabs.o for uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684381 (https://phabricator.wikimedia.org/T281650) (owner: 10Majavah) [18:11:54] (03Merged) 10jenkins-bot: beta: Use upload.wikimedia.beta.wmflabs.o for uploads [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684381 (https://phabricator.wikimedia.org/T281650) (owner: 10Majavah) [18:14:39] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: bc1bc903169e4982c0c5a930094bed9f22616293: NOOP: beta: Use upload.wikimedia.beta.wmflabs.o for uploads (T281650; 1/2) (duration: 00m 58s) [18:14:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:47] T281650: Move upload.beta.wmflabs.org to upload.wikimedia.beta.wmflabs.org - https://phabricator.wikimedia.org/T281650 [18:15:54] (03Merged) 10jenkins-bot: Hotfix: loadRelatedArticles should consider existence of container element [extensions/RelatedArticles] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/684393 (https://phabricator.wikimedia.org/T281547) (owner: 10Jdrewniak) [18:15:57] !log urbanecm@deploy1002 Synchronized wmf-config/filebackend.php: bc1bc903169e4982c0c5a930094bed9f22616293: NOOP: beta: Use upload.wikimedia.beta.wmflabs.o for uploads (T281650; 2/2) (duration: 00m 57s) [18:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:41] jan_drewniak: pulled onto mwdebug1001, can you test it there, please? [18:16:52] Urbanecm: sure thing [18:17:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:18:10] Urbanecm: ok, looks good [18:18:14] thanks, syncing [18:18:31] (03PS1) 10Ottomata: refine - Remove webproxy for eventlogging_analytics job [puppet] - 10https://gerrit.wikimedia.org/r/684482 (https://phabricator.wikimedia.org/T247510) [18:19:49] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.3/extensions/RelatedArticles/resources/ext.relatedArticles.readMore.bootstrap/index.js: cf9d9da3bf272d33c2d9b29d9172b1c81bfd8beb: Hotfix: loadRelatedArticles should consider existence of container element (T281547) (duration: 00m 57s) [18:19:55] jan_drewniak: should be live [18:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:57] T281547: TypeError: Cannot read property 'top' of undefined - https://phabricator.wikimedia.org/T281547 [18:19:58] anything else, anyone? :) [18:20:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:20:07] Urbanecm: thanks! [18:20:10] any time [18:20:55] !log Morning B&C window done [18:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:10] (03CR) 10Ottomata: [C: 03+2] refine - Remove webproxy for eventlogging_analytics job [puppet] - 10https://gerrit.wikimedia.org/r/684482 (https://phabricator.wikimedia.org/T247510) (owner: 10Ottomata) [18:29:34] Amir1: hi, you there? [18:30:57] tabbycat: is it mailing list related? [18:31:13] legoktm: yup [18:31:26] it's a bit sensitive too, if you don't mind me PMing? [18:31:31] go for it [18:31:37] ok thanks [18:32:09] tabbycat: I'm [18:32:28] tabbycat: let me know if there is anything I can do [18:33:19] (03PS1) 10Bstorm: maintain_dbusers: add new multi-instance analytics dedicated host [puppet] - 10https://gerrit.wikimedia.org/r/684485 (https://phabricator.wikimedia.org/T281287) [18:33:26] I found a spelling error on the cu list description [18:33:49] Last 3 letters are missing [18:33:54] So maybe cut off? [18:34:17] RhinosF1: yeah... I should add that as well [18:34:49] I was being nosey to see what got moved [18:36:53] 10SRE, 10SRE-Access-Requests: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10Dzahn) Thanks @thcipriani I confirm L3 has already been signed as well. [18:38:36] Amir1: thanks, I'm talking to legoktm about it :) [18:38:43] 10SRE, 10SRE-Access-Requests: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10Dzahn) [18:48:42] (03PS3) 10Jbond: C:gitlab::ssh: add new gilab::ssh class [puppet] - 10https://gerrit.wikimedia.org/r/684437 [18:48:44] (03PS1) 10Jbond: P:gitlab: add basic gitlab class [puppet] - 10https://gerrit.wikimedia.org/r/684486 [18:48:46] (03PS1) 10Jbond: P:gitlab: manage gitlab with gitlab module [puppet] - 10https://gerrit.wikimedia.org/r/684487 [18:51:10] (03CR) 10jerkins-bot: [V: 04-1] P:gitlab: add basic gitlab class [puppet] - 10https://gerrit.wikimedia.org/r/684486 (owner: 10Jbond) [19:01:45] 10SRE, 10SRE-Access-Requests: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10Dzahn) @xSavitar All boxes are checked except the "sign valid NDA with legal". Assuming you haven't already done this, I will add @KFrancis to get this going. @KFrancis Hello, this... [19:01:53] 10SRE, 10SRE-Access-Requests: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10Dzahn) p:05Triage→03Medium [19:02:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:03:00] 10SRE, 10Dumps-Generation, 10SRE-Access-Requests: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it - https://phabricator.wikimedia.org/T277629 (10Dzahn) 05Open→03Stalled [19:03:23] 10SRE, 10SRE-Access-Requests: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10Dzahn) a:03Dzahn [19:04:27] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Delete lists-next.wikimedia.org - https://phabricator.wikimedia.org/T281548 (10Dzahn) p:05Triage→03Low [19:04:38] 10SRE, 10Services, 10Service-deployment-requests: New Service Request miscweb - https://phabricator.wikimedia.org/T281538 (10Dzahn) p:05Triage→03Medium [19:04:52] 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): June 2021 Datacenter switchover - https://phabricator.wikimedia.org/T281515 (10Dzahn) p:05Triage→03Medium [19:05:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:07:12] 10SRE, 10ops-codfw: Degraded RAID on wdqs2007 - https://phabricator.wikimedia.org/T281504 (10Dzahn) p:05Triage→03Medium [19:07:29] 10SRE, 10ops-codfw, 10Wikidata-Query-Service: Degraded RAID on wdqs2007 - https://phabricator.wikimedia.org/T281504 (10Dzahn) [19:13:34] 10SRE, 10ops-codfw, 10Wikidata-Query-Service: Degraded RAID on wdqs2007 - https://phabricator.wikimedia.org/T281504 (10RKemper) a:03RKemper [19:14:42] 10SRE, 10ops-codfw, 10Wikidata-Query-Service: Degraded RAID on wdqs2007 - https://phabricator.wikimedia.org/T281504 (10Dzahn) [19:14:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10Dzahn) [19:21:03] 10SRE, 10Wikimedia-Mailing-lists: Rename mailinglists eliso, and eliso-anoncoj - https://phabricator.wikimedia.org/T281686 (10Ladsgroup) Unfortunately, renaming a mailing list is not that easy in mailman3: https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/thread/Q3YHKZKUALBWIESNOQLRBFRNJ6F3O7... [19:21:29] !log T280382 [WDQS] `sudo confctl select 'name=wdqs1004.eqiad.wmnet' set/pooled=no` (`wdqs1004` failed re-image [not sure why yet] and won't let me ssh in to depool so using conftool instead) [19:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:41] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [19:21:46] !log ryankemper@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1004.eqiad.wmnet [19:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:35] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:57] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --without-lvs --source wdqs1003.eqiad.wmnet --dest wdqs1010.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` [19:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:36] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Mailman3 non-members interface is confusing - https://phabricator.wikimedia.org/T281746 (10Dzahn) p:05Triage→03Medium [19:32:03] 10SRE, 10Wikimedia-Mailing-lists: Rename mailinglists eliso, and eliso-anoncoj - https://phabricator.wikimedia.org/T281686 (10Dzahn) p:05Triage→03Low [19:36:25] (03PS1) 10Andrew Bogott: wmcs-policy-tests.py: add Trove policy tests [puppet] - 10https://gerrit.wikimedia.org/r/684494 (https://phabricator.wikimedia.org/T279845) [19:39:03] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2007.codfw.wmnet with reason: REIMAGE [19:39:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:43] (03PS2) 10Jbond: P:gitlab: add basic gitlab class [puppet] - 10https://gerrit.wikimedia.org/r/684486 [19:40:51] (03CR) 10jerkins-bot: [V: 04-1] P:gitlab: add basic gitlab class [puppet] - 10https://gerrit.wikimedia.org/r/684486 (owner: 10Jbond) [19:40:54] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2007.codfw.wmnet with reason: REIMAGE [19:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:27] (03CR) 10Herron: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/683695 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [20:00:04] chrisalbon and accraze: Time to snap out of that daydream and deploy Services – Graphoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210503T2000). [20:22:05] (03CR) 10Herron: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/683047 (https://phabricator.wikimedia.org/T279342) (owner: 10Herron) [20:24:01] (03CR) 10Herron: "> Patch Set 2:" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/683706 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [20:25:57] (03CR) 10Herron: "Ok, I think this is ready for another look" [puppet] - 10https://gerrit.wikimedia.org/r/683044 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [20:35:47] 10SRE, 10ops-codfw, 10Wikidata-Query-Service: Degraded RAID on wdqs2007 - https://phabricator.wikimedia.org/T281504 (10RKemper) 05Open→03Resolved [20:35:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: ssh unreachable for wdqs2007.codfw.wmnet - https://phabricator.wikimedia.org/T281437 (10RKemper) [20:36:18] 10SRE, 10ops-codfw, 10Wikidata-Query-Service: Degraded RAID on wdqs2007 - https://phabricator.wikimedia.org/T281504 (10RKemper) Re-image of `wdqs2007` was completed successfully. [20:37:05] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:30] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` [20:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:39] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [20:42:31] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:19] 10SRE, 10Mail, 10Wikimedia-Mailing-lists: In Mailman3 if a list has no owners, mail goes to root@ - https://phabricator.wikimedia.org/T281753 (10Legoktm) [20:44:01] 10SRE, 10Mail, 10Wikimedia-Mailing-lists: In Mailman3 if a list has no owners, mail goes to root@ - https://phabricator.wikimedia.org/T281753 (10Legoktm) [20:44:46] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [20:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:24] RECOVERY - Check systemd state on wdqs2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:46:36] PROBLEM - WDQS high update lag on wdqs1003 is CRITICAL: 4879 ge 3600 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [20:48:24] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: mailman3: Let users choose the UI language - https://phabricator.wikimedia.org/T281747 (10MarcoAurelio) Thanks for your comments. Indeed there's no dropdown to select in which language you'd like to see the UI in. In my case, the translations shown to me are some... [20:53:17] (03PS5) 10Jdlrobson: Replace $wgRelatedArticlesFooterWhitelistedSkins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680814 (owner: 10Reedy) [20:53:34] (03CR) 10Jdlrobson: [C: 03+1] "Reedy can you remove your -2 here? I'll get this deployed later." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680814 (owner: 10Reedy) [20:54:25] (03PS3) 10Jdlrobson: Prepare for new configuration option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683720 (https://phabricator.wikimedia.org/T277951) [20:54:33] (03CR) 10Jdlrobson: [C: 03+1] "Thanks! :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680814 (owner: 10Reedy) [20:56:31] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: mailman3: Let users choose the UI language - https://phabricator.wikimedia.org/T281747 (10Legoktm) [20:56:49] !log T280382 [WDQS] `ryankemper@wdqs2001:~$ sudo run-puppet-agent --force` [20:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:57] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [20:57:44] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-aarora-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:04] Reedy and sbassett: Your horoscope predicts another unfortunate Weekly Security deployment window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210503T2100). [21:01:01] (03PS2) 10Krinkle: Move ExternalStore log group from debug to error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682322 (https://phabricator.wikimedia.org/T281048) (owner: 10Reedy) [21:01:35] (03PS3) 10Krinkle: logging: Raise ExternalStore min level from debug to warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682322 (https://phabricator.wikimedia.org/T281048) (owner: 10Reedy) [21:02:45] !log T280382 `wdqs1010.eqiad.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2 2.6T 975G 1.5T 39% /srv` [21:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:54] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [21:04:44] 10SRE, 10Wikimedia-Mailing-lists: Rename mailinglists eliso, and eliso-anoncoj - https://phabricator.wikimedia.org/T281686 (10Legoktm) >>! In T281686#7054623, @Ladsgroup wrote: > Unfortunately, renaming a mailing list is not that easy in mailman3: > https://lists.mailman3.org/archives/list/mailman-users@mailma... [21:05:59] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:07] !log T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` [21:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:32] (03PS4) 10Krinkle: logging: Raise ExternalStore min level from debug to warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682322 (https://phabricator.wikimedia.org/T281048) (owner: 10Reedy) [21:09:40] !log T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1011.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` [21:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:48] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [21:10:03] (03CR) 10Krinkle: [C: 03+2] logging: Raise ExternalStore min level from debug to warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682322 (https://phabricator.wikimedia.org/T281048) (owner: 10Reedy) [21:11:10] (03Merged) 10jenkins-bot: logging: Raise ExternalStore min level from debug to warning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682322 (https://phabricator.wikimedia.org/T281048) (owner: 10Reedy) [21:14:44] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:15:12] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs1011.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:17:28] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1011.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [21:18:23] ^ looking [21:19:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:19:38] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1011.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [21:19:50] !log ryankemper@puppetmaster1001 conftool action : set/pooled=no; selector: name=wdqs1011.eqiad.wmnet [21:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:20] PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:20:27] !log T280382 [WDQS] `ryankemper@puppetmaster1001:~$ sudo confctl select 'name=wdqs1011.eqiad.wmnet' set/pooled=no` [21:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:35] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [21:22:03] !log [WDQS] `ryankemper@wdqs1003:~$ sudo pool` [21:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:34] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:22:36] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [21:22:36] Forcing a re-check to clear these alerts [21:22:40] (done) [21:23:26] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:25] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1011.eqiad.wmnet with reason: REIMAGE [21:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:27] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1011.eqiad.wmnet with reason: REIMAGE [21:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:31] !log krinkle@deploy1002 Synchronized wmf-config/InitialiseSettings.php: d95b91648 (duration: 00m 58s) [21:32:32] 10SRE, 10MediaWiki-Revision-backend, 10observability, 10MW-1.37-notes (1.37.0-wmf.3; 2021-04-27), 10Performance-Team (Radar): mwlog1001 is running out of free space on /srv/mw-log - https://phabricator.wikimedia.org/T281048 (10Krinkle) [21:32:35] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) >>! In T280989#7050524, @jcrespo wrote: > There is in fact 5 categories, with different meanings and alerting levels, from "Fresh" to "All failures", as seen at: https://wikitech.wikimedia.org/wiki/Ba... [21:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:44] 10SRE, 10MediaWiki-Revision-backend, 10observability, 10MW-1.37-notes (1.37.0-wmf.3; 2021-04-27), 10Performance-Team (Radar): mwlog1001 is running out of free space on /srv/mw-log - https://phabricator.wikimedia.org/T281048 (10Krinkle) 05Open→03Resolved a:03Krinkle [21:32:54] 10SRE, 10MediaWiki-Revision-backend, 10Performance-Team, 10observability, 10MW-1.37-notes (1.37.0-wmf.3; 2021-04-27): mwlog1001 is running out of free space on /srv/mw-log - https://phabricator.wikimedia.org/T281048 (10Krinkle) [21:36:34] (03PS1) 10Dzahn: Revert "bacula: add people1003 job to monitoring ignorelist" [puppet] - 10https://gerrit.wikimedia.org/r/684463 [21:40:33] (03CR) 10Bstorm: "This looks like a good idea. The only thing that gives me pause is whether or not this file is actually used for anything and if we should" [puppet] - 10https://gerrit.wikimedia.org/r/684115 (https://phabricator.wikimedia.org/T198673) (owner: 10Krinkle) [21:42:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:42:54] 10SRE, 10SRE-Access-Requests: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10KFrancis) @Dzahn I am confirming Alangi Derick has a signed NDA on file with legal. Thanks! [21:46:00] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [21:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:09] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [21:47:25] !log T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` [21:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:12] (03CR) 10Dzahn: [C: 03+2] Revert "bacula: add people1003 job to monitoring ignorelist" [puppet] - 10https://gerrit.wikimedia.org/r/684463 (owner: 10Dzahn) [21:52:43] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [21:52:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:51] T280563: Reboot elasticsearch* and relforge* to apply kernel security updates - https://phabricator.wikimedia.org/T280563 [21:54:24] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) I reverted the addition to the ignore list. Setup is done, there is no reason why it should fail. Let's see what happens. I am refresh Icinga etc. https://gerrit.wikimedia.org/r/c/operations/puppe... [21:54:48] !log T280563 eqiad reboot failed with: `curator.exceptions.FailedExecution: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='search.svc.eqiad.wmnet', port=9243): Read timed out. (read timeout=10))` [21:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:02] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 [21:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:04] !log T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` [21:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:44] ACKNOWLEDGEMENT - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service daniel_zahn https://phabricator.wikimedia.org/T280744 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:10] PROBLEM - Hadoop NodeManager on an-worker1131 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:02:32] PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:58] PROBLEM - Hadoop NodeManager on an-worker1123 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:08:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:10:22] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:14:28] [backup1001:~] $ sudo check_bacula.py --icinga [22:14:31] !log [backup1001:~] $ sudo check_bacula.py --icinga [22:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:54] ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: All failures: 1 (people1003), Fresh: 102 jobs daniel_zahn https://phabricator.wikimedia.org/T280989#7055122 https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [22:16:34] PROBLEM - Check systemd state on elastic1067 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:34] PROBLEM - Check systemd state on elastic1045 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:12] PROBLEM - Check systemd state on an-worker1123 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:22] !log ran disable_list for: iegcom wikien-l fundraiser spcommittee-private-l spcommittee-l mediation-en-l test-second wikifr-colloque-l [22:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:58] PROBLEM - Check systemd state on elastic1034 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:17:59] 10SRE, 10SRE-Access-Requests: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10Dzahn) Thank you @KFrancis , perfect. Will go ahead. [22:18:16] 10SRE, 10SRE-Access-Requests: Requesting access to Deployment shell for derick - https://phabricator.wikimedia.org/T281564 (10Dzahn) [22:24:42] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10Dzahn) ` == jobs_with_all_failures (1) == people1003.eqiad.wmnet-Monthly-1st-Sun-production-home ` ` [backup1001:~] $ sudo check_bacula.py people1003.eqiad.wmnet-Monthly-1st-Sun-production-home 2021-04-2... [22:25:00] RECOVERY - Hadoop NodeManager on an-worker1131 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:25:10] RECOVERY - Check systemd state on elastic1067 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:26] RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:52] RECOVERY - Check systemd state on an-worker1123 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:25:58] RECOVERY - Hadoop NodeManager on an-worker1123 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:26:44] PROBLEM - Check systemd state on elastic1032 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:26] RECOVERY - Check systemd state on elastic1045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:29:50] RECOVERY - WDQS high update lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1154 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:35:24] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 44 probes of 718 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:40:26] RECOVERY - Check systemd state on elastic1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:42:42] RECOVERY - Check systemd state on elastic1034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:06] PROBLEM - Check systemd state on elastic1033 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:24] PROBLEM - Check systemd state on elastic1048 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:53:42] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 40 probes of 721 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:56:46] RECOVERY - Check systemd state on elastic1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210503T2300). Please do the needful. [23:00:04] Jdlrobson: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:13] I can deploy today [23:00:28] Jdlrobson: around? 🙂 [23:00:52] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 39 probes of 716 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:01:36] RECOVERY - Check systemd state on elastic1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:42] hello [23:01:46] im here :) [23:01:52] RECOVERY - Check systemd state on snapshot1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:03:45] cool :) [23:04:16] could someone from sre confirm the IPv4 ping alerts are not a reason to worry please? [23:05:04] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10jcrespo) I am answering from mail- apologies for any formatting errors. I can have a deeper look tomorrow. But first..., one important thing I forgot to communicate: please do not ack/downtime the bacula... [23:08:41] cdanis @elukey are these a concern? ^ [23:08:49] Urbanecm: not sure exactly what's going on there, but you're fine to deploy [23:09:03] thanks a lot. Let's start then :). [23:09:06] Thanks rzl [23:09:24] thanks for checking! [23:09:56] (03PS6) 10Urbanecm: Replace $wgRelatedArticlesFooterWhitelistedSkins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680814 (owner: 10Reedy) [23:10:09] (03CR) 10Urbanecm: [C: 03+2] Replace $wgRelatedArticlesFooterWhitelistedSkins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680814 (owner: 10Reedy) [23:10:56] (03Merged) 10jenkins-bot: Replace $wgRelatedArticlesFooterWhitelistedSkins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680814 (owner: 10Reedy) [23:12:17] Jdlrobson: pulled onto mwdebug1001, can you check? [23:12:21] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-web1001 - https://phabricator.wikimedia.org/T281787 (10RobH) [23:12:22] looking [23:12:30] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install an-web1001 - https://phabricator.wikimedia.org/T281787 (10RobH) [23:13:21] (03CR) 10Bstorm: [C: 03+2] wikireplicas: redirect all database CNAMEs to the new system [puppet] - 10https://gerrit.wikimedia.org/r/683929 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [23:13:23] Urbanecm: LGTM, that one can be synced. [23:13:33] thanks, syncing [23:14:03] !log urbanecm@deploy1002 sync-file aborted: 7c47ee17b3936fb1f79590187a9e0028276e4a9d: Replace $wgRelatedArticlesFooterWhitelistedSkins (T277958)¨ (duration: 00m 01s) [23:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:12] (03PS4) 10Urbanecm: Prepare for new configuration option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683720 (https://phabricator.wikimedia.org/T277951) (owner: 10Jdlrobson) [23:14:12] T277958: Address Voice and Tone issues in RelatedArticles - https://phabricator.wikimedia.org/T277958 [23:14:17] (03CR) 10Urbanecm: [C: 03+2] Prepare for new configuration option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683720 (https://phabricator.wikimedia.org/T277951) (owner: 10Jdlrobson) [23:15:03] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 7c47ee17b3936fb1f79590187a9e0028276e4a9d: Replace $wgRelatedArticlesFooterWhitelistedSkins (T277958) (duration: 00m 57s) [23:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:13] should be live Jdlrobson [23:15:27] Urbanecm: yay [23:15:31] will watch the logs [23:15:36] (03Merged) 10jenkins-bot: Prepare for new configuration option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/683720 (https://phabricator.wikimedia.org/T277951) (owner: 10Jdlrobson) [23:15:41] Jdlrobson: i appreciate that [23:16:03] Jdlrobson: second one is on mwdebug1001, please test [23:16:09] Urbanecm: on it.. [23:17:18] Urbanecm: this one is also good to go. [23:17:24] excellent, syncing [23:18:51] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 230ef5716b34ca83348667f289180313b76ce8a3: Prepare for new configuration option (T277951) (duration: 00m 57s) [23:18:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:18:57] and also live. [23:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:00] anything else Jdlrobson ? [23:19:01] T277951: Address Voice and Tone issues in MobileFrontend - https://phabricator.wikimedia.org/T277951 [23:19:49] Urbanecm: nope that's it (provided no log spikes in next 10 mins) [23:19:57] I'll keep an eye on things but risk is low [23:20:02] thanks for your help! [23:20:05] glad it was a quick one! [23:20:18] okay. I should be reachable for next half an hour or so should you need me :) [23:21:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:21:47] good to know :) [23:22:44] RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:25:54] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 35 probes of 716 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:29:12] PROBLEM - WDQS SPARQL on wdqs1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 532 bytes in 1.075 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:35:40] Urbanecm: i think we're safe :) Hope you have a good evening/night/morning! [23:35:48] you too! [23:37:04] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 40 probes of 716 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:40:35] (03CR) 10Bstorm: "I was a little worried about using a heredoc, but puppet compiler says this is totally legit! https://puppet-compiler.wmflabs.org/compiler" [puppet] - 10https://gerrit.wikimedia.org/r/684485 (https://phabricator.wikimedia.org/T281287) (owner: 10Bstorm) [23:42:41] (03CR) 10Bstorm: [C: 03+2] maintain_dbusers: add new multi-instance analytics dedicated host [puppet] - 10https://gerrit.wikimedia.org/r/684485 (https://phabricator.wikimedia.org/T281287) (owner: 10Bstorm) [23:57:26] PROBLEM - Check systemd state on elastic1053 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:54] PROBLEM - Check systemd state on elastic1035 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:58:48] PROBLEM - Check systemd state on elastic1044 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state