[00:31:22] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:18] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:34] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:32] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:04] RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:47:08] PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:50:06] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:52:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:30:10] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:37:06] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:28:08] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:31:56] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:38:32] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:43:30] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.068 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:31:14] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:38:22] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:31] (03CR) 10Ladsgroup: "Yeah :( Django sucks. Flask is much better." [puppet] - 10https://gerrit.wikimedia.org/r/657952 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [05:30:20] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:14] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:50:35] (03PS1) 10Urbanecm: Enable SandboxLink at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658073 (https://phabricator.wikimedia.org/T272796) [05:54:25] (03PS1) 10Urbanecm: frwiki: Change back to normal logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658075 (https://phabricator.wikimedia.org/T272700) [06:03:45] (03PS1) 10Urbanecm: [beta] Initial configuration for votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658076 (https://phabricator.wikimedia.org/T272608) [06:21:30] (03PS1) 10Ladsgroup: cache: Migrate hiera() to lookup() and set datatypes in frontend [puppet] - 10https://gerrit.wikimedia.org/r/658079 (https://phabricator.wikimedia.org/T209953) [06:26:51] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) Could we have an ETA on when this server can be worked on? We were aiming to open a new infrastructure this server is part of to users 1st of Feb [06:26:57] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/27637/" [puppet] - 10https://gerrit.wikimedia.org/r/658079 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [06:28:03] (03CR) 10Ladsgroup: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/657560 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [06:30:52] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:37:48] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:43:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Populate x2 eqiad hosts into dbctl T269324', diff saved to https://phabricator.wikimedia.org/P13938 and previous config saved to /var/cache/conftool/dbconfig/20210125-064305-marostegui.json [06:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:10] T269324: Productionize x2 databases - https://phabricator.wikimedia.org/T269324 [06:44:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add x2 eqiad to dbctl T269324', diff saved to https://phabricator.wikimedia.org/P13939 and previous config saved to /var/cache/conftool/dbconfig/20210125-064419-marostegui.json [06:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:26] (03PS1) 10Marostegui: x2 hosts: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658081 (https://phabricator.wikimedia.org/T269324) [06:47:31] (03CR) 10Marostegui: [C: 03+2] x2 hosts: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658081 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [06:48:36] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui) [07:08:08] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw,logstash7-codfw,logstash7-eqiad} instance=kafkamon1002 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic={rsyslog-notice,rsyslog-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consum [07:08:08] h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [07:08:08] (03CR) 10Urbanecm: [C: 04-1] Create patroller user group for thwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T272149) (owner: 10Patsagorn Y.) [07:08:45] (03CR) 10Urbanecm: [C: 03+1] Set $wgCategoryCollation = uca-tr on trwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657997 (https://phabricator.wikimedia.org/T272783) (owner: 10Evrifaessa) [07:10:37] (03CR) 10Urbanecm: [C: 03+1] "Should work" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657998 (https://phabricator.wikimedia.org/T272784) (owner: 10Evrifaessa) [07:15:57] (03CR) 10Urbanecm: [C: 03+1] "Lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657955 (owner: 10Legoktm) [07:24:41] 10SRE, 10Machine Learning Platform, 10SRE-Access-Requests: Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10elukey) >>! In T272687#6769596, @calbon wrote: > kbazira needs access too Yep Tobias added Kevin as well, all covered! [07:25:09] 10SRE, 10Machine Learning Platform, 10SRE-Access-Requests: Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10elukey) 05Resolved→03Open [07:25:31] 10SRE, 10Machine Learning Platform, 10SRE-Access-Requests: Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10elukey) p:05High→03Medium [07:25:50] 10SRE, 10Machine Learning Platform, 10SRE-Access-Requests: Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10elukey) [07:26:32] 10SRE, 10Machine Learning Platform, 10SRE-Access-Requests: Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10elukey) Reopening to get this request validated by the SRE team during today's team meeting :) [07:30:04] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1142 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P13940 and previous config saved to /var/cache/conftool/dbconfig/20210125-073322-marostegui.json [07:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:54] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:47] (03PS1) 10Elukey: Remove old analytics decommed nodes [puppet] - 10https://gerrit.wikimedia.org/r/658087 (https://phabricator.wikimedia.org/T267932) [07:46:30] (03CR) 10Elukey: [C: 03+2] Remove old analytics decommed nodes [puppet] - 10https://gerrit.wikimedia.org/r/658087 (https://phabricator.wikimedia.org/T267932) (owner: 10Elukey) [07:51:46] (03PS1) 10Muehlenhoff: Remove access for risler [puppet] - 10https://gerrit.wikimedia.org/r/658089 [07:52:15] (03PS1) 10Elukey: Add the Hadoop worker profile to master/standby in Backup [puppet] - 10https://gerrit.wikimedia.org/r/658098 (https://phabricator.wikimedia.org/T260411) [07:54:00] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for risler [puppet] - 10https://gerrit.wikimedia.org/r/658089 (owner: 10Muehlenhoff) [07:54:16] (03PS2) 10Elukey: Add the Hadoop worker profile to master/standby in Backup [puppet] - 10https://gerrit.wikimedia.org/r/658098 (https://phabricator.wikimedia.org/T260411) [07:59:16] (03PS3) 10Elukey: Add the Hadoop worker profile to master/standby in Backup [puppet] - 10https://gerrit.wikimedia.org/r/658098 (https://phabricator.wikimedia.org/T260411) [07:59:18] (03PS1) 10Elukey: profile::hadoop::monitoring::resourcemanager: fix jmx resource name [puppet] - 10https://gerrit.wikimedia.org/r/658210 [08:01:00] (03CR) 10Elukey: [C: 03+2] profile::hadoop::monitoring::resourcemanager: fix jmx resource name [puppet] - 10https://gerrit.wikimedia.org/r/658210 (owner: 10Elukey) [08:01:46] (03PS4) 10Elukey: Add the Hadoop worker profile to master/standby in Backup [puppet] - 10https://gerrit.wikimedia.org/r/658098 (https://phabricator.wikimedia.org/T260411) [08:02:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 25%: After upgrading its kernel', diff saved to https://phabricator.wikimedia.org/P13941 and previous config saved to /var/cache/conftool/dbconfig/20210125-080204-root.json [08:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:21] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27639/console" [puppet] - 10https://gerrit.wikimedia.org/r/658098 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [08:04:51] 10SRE, 10DBA: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [08:05:51] (03CR) 10Elukey: [V: 03+1 C: 03+2] Add the Hadoop worker profile to master/standby in Backup [puppet] - 10https://gerrit.wikimedia.org/r/658098 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [08:12:16] (03PS1) 10Marostegui: mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/658211 (https://phabricator.wikimedia.org/T271427) [08:12:32] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/658211 (https://phabricator.wikimedia.org/T271427) (owner: 10Marostegui) [08:12:48] (03PS1) 10Muehlenhoff: Extend access for agaduran [puppet] - 10https://gerrit.wikimedia.org/r/658212 [08:13:44] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for agaduran [puppet] - 10https://gerrit.wikimedia.org/r/658212 (owner: 10Muehlenhoff) [08:13:46] (03PS1) 10Marostegui: wmnet: Update s4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/658213 (https://phabricator.wikimedia.org/T271427) [08:14:15] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/658213 (https://phabricator.wikimedia.org/T271427) (owner: 10Marostegui) [08:17:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 50%: After upgrading its kernel', diff saved to https://phabricator.wikimedia.org/P13942 and previous config saved to /var/cache/conftool/dbconfig/20210125-081708-root.json [08:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:20] (03PS1) 10Elukey: Add the HDFS balancer to the Master node in Hadoop backup [puppet] - 10https://gerrit.wikimedia.org/r/658215 (https://phabricator.wikimedia.org/T260411) [08:25:55] (03CR) 10Elukey: [C: 03+2] Add the HDFS balancer to the Master node in Hadoop backup [puppet] - 10https://gerrit.wikimedia.org/r/658215 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [08:30:44] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 75%: After upgrading its kernel', diff saved to https://phabricator.wikimedia.org/P13943 and previous config saved to /var/cache/conftool/dbconfig/20210125-083211-root.json [08:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:16] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:33:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:33:57] (03CR) 10Marostegui: "I guess we need to review those users on the new hosts to make sure they have the correct max_connection settings?" [puppet] - 10https://gerrit.wikimedia.org/r/657890 (https://phabricator.wikimedia.org/T269399) (owner: 10Bstorm) [08:35:55] (03CR) 10Marostegui: [C: 03+1] mariadb: Remove mariadb module mylvmbackup [puppet] - 10https://gerrit.wikimedia.org/r/657821 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo) [08:36:52] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:24] (03CR) 10Marostegui: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T111929) (owner: 10Jcrespo) [08:40:00] 10SRE, 10Wikimedia-Mailing-lists: wikipedia-mai & wikiur-l lists do not seem to have active list admins (mail archives empty after August 2018 & January 2019) - https://phabricator.wikimedia.org/T270837 (10Aklapper) >>! In T270837#6735597, @Aklapper wrote: > If this is a request to have active mailing list mod... [08:46:42] PROBLEM - Logstash rate of ingestion percent change compared to yesterday #o11y on alert1001 is CRITICAL: 1244 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [08:47:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 100%: After upgrading its kernel', diff saved to https://phabricator.wikimedia.org/P13944 and previous config saved to /var/cache/conftool/dbconfig/20210125-084715-root.json [08:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:21] so there are a ton of messages on logstash [08:54:49] going to move the discussion to #sre [09:01:52] 10SRE, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10JMeybohm) >>! In T238909#6769149, @akosiaris wrote: > Adding https://metallb.universe.tf/ as a potential solution as well. Wou... [09:06:50] !log cp3054: install varnish 6.0.1-1wm2 -- 6.0.1 without https://github.com/varnishcache/varnish-cache/pull/2705 T264398 [09:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:55] T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 [09:10:35] (03PS1) 10Marostegui: etcd.php: Add x2 mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658218 (https://phabricator.wikimedia.org/T269324) [09:11:54] (03CR) 10Jcrespo: [C: 03+1] etcd.php: Add x2 mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658218 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [09:12:36] (03CR) 10Marostegui: [C: 03+2] etcd.php: Add x2 mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658218 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [09:13:29] (03Merged) 10jenkins-bot: etcd.php: Add x2 mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658218 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [09:14:30] (03PS1) 10Elukey: Add a more restrictive default umask to Hadoop backup [puppet] - 10https://gerrit.wikimedia.org/r/658219 (https://phabricator.wikimedia.org/T260411) [09:15:10] (03CR) 10Elukey: [C: 03+2] Add a more restrictive default umask to Hadoop backup [puppet] - 10https://gerrit.wikimedia.org/r/658219 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [09:15:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Add x2 to the mapping array T269324 (duration: 01m 01s) [09:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:18] T269324: Productionize x2 databases - https://phabricator.wikimedia.org/T269324 [09:17:40] !log installing samba security updates on stretch [09:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:19:50] win 11 [09:19:52] ufff [09:21:27] !log marostegui@deploy1001 Synchronized wmf-config/etcd.php: Add x2 to the mapping array T269324 (duration: 00m 58s) [09:21:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:36] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:21:39] T269324: Productionize x2 databases - https://phabricator.wikimedia.org/T269324 [09:21:45] (03CR) 10JMeybohm: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/657855 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris) [09:26:01] (03CR) 10Muehlenhoff: [C: 03+2] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/657786 (owner: 10Muehlenhoff) [09:28:38] (03CR) 10DCausse: [C: 03+1] "I'm tentatively scheduling the deploy to jan 29 as T267175 problems appear to be fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657131 (https://phabricator.wikimedia.org/T270252) (owner: 10Lucas Werkmeister (WMDE)) [09:30:10] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:49] (03PS1) 10David Caro: config: allow using tilde `~` to specify config paths [software/cumin] - 10https://gerrit.wikimedia.org/r/658223 [09:36:34] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:40:26] !log bounce apache2 on logstash1024, stuck on high cpu [09:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:24] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10JMeybohm) The job fails on registry2002, leading to icinga alerts ` Jan 25 09:30:01 registry2002 systemd[1]: Started Build docker-registry home... [09:47:27] (03PS1) 10Urbanecm: Add bidgee.id.au to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658225 (https://phabricator.wikimedia.org/T272202) [09:48:05] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host urldownloader2002.wikimedia.org [09:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:46] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2002.wikimedia.org [09:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:23] (03PS4) 10Jcrespo: mariadb-backups: Document logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T111929) [09:51:25] (03PS2) 10Jcrespo: mariadb: Remove mariadb module mysqld_safe [puppet] - 10https://gerrit.wikimedia.org/r/657820 (https://phabricator.wikimedia.org/T272559) [09:51:27] (03PS2) 10Jcrespo: mariadb: Remove mariadb module mylvmbackup [puppet] - 10https://gerrit.wikimedia.org/r/657821 (https://phabricator.wikimedia.org/T272559) [09:51:29] (03PS1) 10Jcrespo: admin: Update ssh key for dvrandecic for production cluster access [puppet] - 10https://gerrit.wikimedia.org/r/658227 (https://phabricator.wikimedia.org/T272470) [09:51:36] (03PS2) 10Jcrespo: admin: Update ssh key for dvrandecic for production cluster access [puppet] - 10https://gerrit.wikimedia.org/r/658227 (https://phabricator.wikimedia.org/T272470) [09:52:23] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host urldownloader1002.wikimedia.org [09:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:29] (03CR) 10Jcrespo: "This will fix partially root@ alerts spamming." [puppet] - 10https://gerrit.wikimedia.org/r/658227 (https://phabricator.wikimedia.org/T272470) (owner: 10Jcrespo) [09:54:00] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. The other issues found by the account check are WIP, one already extended and one pinged for possible extension." [puppet] - 10https://gerrit.wikimedia.org/r/658227 (https://phabricator.wikimedia.org/T272470) (owner: 10Jcrespo) [09:55:28] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1002.wikimedia.org [09:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:15] (03CR) 10Jcrespo: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/658227 (https://phabricator.wikimedia.org/T272470) (owner: 10Jcrespo) [09:57:23] (03CR) 10Jcrespo: [C: 03+2] admin: Update ssh key for dvrandecic for production cluster access [puppet] - 10https://gerrit.wikimedia.org/r/658227 (https://phabricator.wikimedia.org/T272470) (owner: 10Jcrespo) [10:00:21] !log installing imagemagick security updates on stretch [10:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:04] 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10jcrespo) [10:05:33] (03PS1) 10Elukey: superset: disable old druid datasouce panels [puppet] - 10https://gerrit.wikimedia.org/r/658228 (https://phabricator.wikimedia.org/T263972) [10:06:23] (03CR) 10Muehlenhoff: [C: 03+1] superset: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657917 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [10:07:17] (03PS2) 10Ladsgroup: lvs: Migrate hiera() to lookup() and set datatypes [puppet] - 10https://gerrit.wikimedia.org/r/657958 (https://phabricator.wikimedia.org/T209953) [10:07:50] (03PS2) 10Ladsgroup: cache: Migrate hiera() to lookup() and set datatypes in frontend [puppet] - 10https://gerrit.wikimedia.org/r/658079 (https://phabricator.wikimedia.org/T209953) [10:07:58] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/657958 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [10:08:07] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/658079 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [10:09:00] (03CR) 10Vgutierrez: [C: 03+1] lvs: Migrate hiera() to lookup() and set datatypes [puppet] - 10https://gerrit.wikimedia.org/r/657958 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [10:16:58] (03CR) 10Ema: [C: 03+1] cache: Migrate hiera() to lookup() and set datatypes in frontend [puppet] - 10https://gerrit.wikimedia.org/r/658079 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [10:17:37] (03CR) 10Elukey: [C: 03+2] superset: disable old druid datasouce panels [puppet] - 10https://gerrit.wikimedia.org/r/658228 (https://phabricator.wikimedia.org/T263972) (owner: 10Elukey) [10:24:05] (03CR) 10Vgutierrez: [C: 03+2] lvs: Migrate hiera() to lookup() and set datatypes [puppet] - 10https://gerrit.wikimedia.org/r/657958 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [10:30:46] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:22] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:31] 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10jcrespo) @JKatzWMF thank you, could you help @JTannerWMF complete all required data to process the request, as per https://phabricator.wikimedia.org/tag/ldap-access-requests/ ? [10:38:59] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10jcrespo) a:05amy_rc→03jcrespo Thank you again, will check on our shared docs and proceed with the access request. [10:40:22] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10Lea_WMDE) @jcrespo sorry for the delay, this slipped through :/ Yes, Amy is interning with us until March 31 2021, good to know that that can be setup in advance! [10:44:13] !log swift decrease weight for ms-be20[16,18,20,22] - T272837 [10:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:23] T272837: Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837 [10:48:13] (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal) [10:50:56] (03PS3) 10Lucas Werkmeister (WMDE): Do weekly dumps of Wikidata Lexeme [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man) [10:52:47] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host scandium.eqiad.wmnet [10:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:18] (03PS5) 10Jcrespo: mariadb-backups: Document logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T111929) [10:53:20] (03PS3) 10Jcrespo: mariadb: Remove mariadb module mysqld_safe [puppet] - 10https://gerrit.wikimedia.org/r/657820 (https://phabricator.wikimedia.org/T272559) [10:53:22] (03PS3) 10Jcrespo: mariadb: Remove mariadb module mylvmbackup [puppet] - 10https://gerrit.wikimedia.org/r/657821 (https://phabricator.wikimedia.org/T272559) [10:53:24] (03PS1) 10Jcrespo: admin: Provide Superset access to Amrutha (WMDE intern) [puppet] - 10https://gerrit.wikimedia.org/r/658233 (https://phabricator.wikimedia.org/T271725) [10:58:05] (03PS1) 10Elukey: role::analytics_test_cluster::client: Upgrade to Bigtop [puppet] - 10https://gerrit.wikimedia.org/r/658234 [10:58:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host scandium.eqiad.wmnet [10:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:38] (03PS6) 10Jcrespo: mariadb-backups: Document logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T111929) [10:58:45] (03PS2) 10Jcrespo: admin: Provide Superset access to Amrutha (WMDE intern) [puppet] - 10https://gerrit.wikimedia.org/r/658233 (https://phabricator.wikimedia.org/T271725) [10:59:53] (03CR) 10Volans: [C: 03+1] "LGTM, thanks." [software/cumin] - 10https://gerrit.wikimedia.org/r/658223 (owner: 10David Caro) [10:59:56] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::client: Upgrade to Bigtop [puppet] - 10https://gerrit.wikimedia.org/r/658234 (owner: 10Elukey) [11:00:05] (03CR) 10Jcrespo: "I think this can new be abandoned due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/643346/ (and ticket)." [puppet] - 10https://gerrit.wikimedia.org/r/641508 (https://phabricator.wikimedia.org/T267744) (owner: 10Herron) [11:01:32] (03CR) 10David Caro: [C: 03+2] config: allow using tilde `~` to specify config paths [software/cumin] - 10https://gerrit.wikimedia.org/r/658223 (owner: 10David Caro) [11:04:24] PROBLEM - Disk space on an-worker1137 is CRITICAL: DISK CRITICAL - free space: / 455 MB (0% inode=91%): /tmp 455 MB (0% inode=91%): /var/tmp 455 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1137&var-datasource=eqiad+prometheus/ops [11:06:07] (03PS1) 10ArielGlenn: reduce number of xml/sql dumps kept on dumpsdata hosts by one [puppet] - 10https://gerrit.wikimedia.org/r/658237 [11:06:36] ahahha checking the an-worker1137 space, we are backupping data [11:07:31] (03Merged) 10jenkins-bot: config: allow using tilde `~` to specify config paths [software/cumin] - 10https://gerrit.wikimedia.org/r/658223 (owner: 10David Caro) [11:08:22] (03CR) 10ArielGlenn: [C: 03+2] reduce number of xml/sql dumps kept on dumpsdata hosts by one [puppet] - 10https://gerrit.wikimedia.org/r/658237 (owner: 10ArielGlenn) [11:11:18] !log thanos delete old orphaned blocks with replica=unset label [11:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:41] (03PS4) 10ArielGlenn: Do weekly dumps of Wikidata Lexeme [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man) [11:11:59] (03PS1) 10DCausse: Add an option to limit the size of the file_text field [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493) [11:12:16] 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10akosiaris) >>! In T272238#6771873, @Nemo_bis wrote: >>>! In T272238#6764574, @akosiaris wrote: >> I 've marked T272111 a... [11:13:04] (03CR) 10ArielGlenn: [C: 03+2] Do weekly dumps of Wikidata Lexeme [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man) [11:19:16] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 13.27 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [11:21:45] (03PS1) 10DCausse: [cirrus] set 50kb limit on file text indexing for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658240 (https://phabricator.wikimedia.org/T271493) [11:22:10] PROBLEM - SSH on ms-be2041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:22:12] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:04] RECOVERY - SSH on ms-be2041 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:23:36] (03PS1) 10Elukey: sre.hadoop.change-distro-from-cdh-clients: fix arg selection [cookbooks] - 10https://gerrit.wikimedia.org/r/658241 [11:24:24] RECOVERY - Disk space on an-worker1137 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1137&var-datasource=eqiad+prometheus/ops [11:25:54] PROBLEM - SSH on ms-be2044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:26:58] RECOVERY - SSH on ms-be2044 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:28:28] 10SRE, 10Analytics-Radar, 10Wikimedia-Logstash, 10observability, 10Performance-Team (Radar): Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10fgiunchedi) [11:29:28] (03CR) 10Vgutierrez: [C: 03+2] cache: Migrate hiera() to lookup() and set datatypes in frontend [puppet] - 10https://gerrit.wikimedia.org/r/658079 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [11:30:05] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210125T1130). [11:30:20] (03CR) 10Elukey: [C: 03+2] sre.hadoop.change-distro-from-cdh-clients: fix arg selection [cookbooks] - 10https://gerrit.wikimedia.org/r/658241 (owner: 10Elukey) [11:30:26] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:47] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658244 (https://phabricator.wikimedia.org/T128546) [11:33:02] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [11:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:20] 10SRE, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) >>! In T238909#6772562, @JMeybohm wrote: >>>! In T238909#6769149, @akosiaris wrote: >> Adding https://metallb.univer... [11:34:55] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658244 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:35:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [11:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:15] (03CR) 10ZPapierski: [C: 03+1] [cirrus] set 50kb limit on file text indexing for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658240 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse) [11:35:18] PROBLEM - SSH on ms-be2033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:35:19] (03CR) 10ZPapierski: Add an option to limit the size of the file_text field (031 comment) [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse) [11:35:40] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658244 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:36:24] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:38:22] RECOVERY - SSH on ms-be2033 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:39:23] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:658242| Bumping portals to master (T128546)]] (duration: 00m 58s) [11:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:27] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:40:18] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:658242| Bumping portals to master (T128546)]] (duration: 00m 55s) [11:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:06] (03CR) 10DCausse: Add an option to limit the size of the file_text field (031 comment) [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse) [11:42:26] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:00] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:47:58] PROBLEM - Check systemd state on ms-be2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:49:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:53:17] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10aborrero) [11:53:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:55:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:57:10] PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:00] PROBLEM - SSH on ms-be2034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210125T1200). [12:00:04] dcausse: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:00:15] o/ [12:00:17] \o/ [12:00:26] dcausse: will you self-deploy, or should I? [12:00:39] Urbanecm: I can deploy [12:00:51] please do, and let me know when over, so i can do some more stuff too :) [12:01:00] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 12.91 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [12:01:18] (03CR) 10DCausse: [C: 03+2] Add an option to limit the size of the file_text field [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse) [12:01:51] Urbanecm: I have to wait on jenkins for this^ [12:02:01] ah, so i can go with configs now dcausse ? [12:02:07] sure [12:02:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:04:09] (03CR) 10Urbanecm: [C: 03+2] Add bidgee.id.au to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658225 (https://phabricator.wikimedia.org/T272202) (owner: 10Urbanecm) [12:04:26] RECOVERY - SSH on ms-be2034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:05:06] (03Merged) 10jenkins-bot: Add bidgee.id.au to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658225 (https://phabricator.wikimedia.org/T272202) (owner: 10Urbanecm) [12:05:17] (03PS2) 10Urbanecm: Enable SandboxLink at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658073 (https://phabricator.wikimedia.org/T272796) [12:05:23] (03CR) 10Urbanecm: [C: 03+2] Enable SandboxLink at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658073 (https://phabricator.wikimedia.org/T272796) (owner: 10Urbanecm) [12:06:14] (03Merged) 10jenkins-bot: Enable SandboxLink at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658073 (https://phabricator.wikimedia.org/T272796) (owner: 10Urbanecm) [12:07:00] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 693eaec20a24620c2a709c8bac707c0d7af3436b: Add bidgee.id.au to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T272202) (duration: 01m 01s) [12:07:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:06] T272202: Add bidgee.id.au to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T272202 [12:07:27] (03CR) 10Urbanecm: [C: 03+2] frwiki: Change back to normal logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658075 (https://phabricator.wikimedia.org/T272700) (owner: 10Urbanecm) [12:07:39] (03PS2) 10Urbanecm: frwiki: Change back to normal logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658075 (https://phabricator.wikimedia.org/T272700) [12:07:43] (03CR) 10Urbanecm: [C: 03+2] frwiki: Change back to normal logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658075 (https://phabricator.wikimedia.org/T272700) (owner: 10Urbanecm) [12:08:30] (03Merged) 10jenkins-bot: frwiki: Change back to normal logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658075 (https://phabricator.wikimedia.org/T272700) (owner: 10Urbanecm) [12:10:38] PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:34] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:21] (03PS1) 10Urbanecm: Revert "Enable SandboxLink at viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658266 (https://phabricator.wikimedia.org/T272796) [12:13:51] (03PS2) 10Urbanecm: Revert "Enable SandboxLink at viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658266 (https://phabricator.wikimedia.org/T272796) [12:13:54] (03CR) 10Urbanecm: [C: 03+2] Revert "Enable SandboxLink at viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658266 (https://phabricator.wikimedia.org/T272796) (owner: 10Urbanecm) [12:15:06] (03Merged) 10jenkins-bot: Revert "Enable SandboxLink at viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658266 (https://phabricator.wikimedia.org/T272796) (owner: 10Urbanecm) [12:15:06] PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:40] (03PS2) 10Urbanecm: Enable SandboxLink on Turkish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657993 (https://phabricator.wikimedia.org/T272780) (owner: 10Evrifaessa) [12:15:45] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 75aa32fd5aee1feebe8a97360068da55cbcf06d8: frwiki: Change back to normal logo (T272700) (duration: 01m 07s) [12:15:46] (03CR) 10Urbanecm: [C: 03+2] Enable SandboxLink on Turkish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657993 (https://phabricator.wikimedia.org/T272780) (owner: 10Evrifaessa) [12:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:56] T272700: Remove birthday logo on French Wikipedia - https://phabricator.wikimedia.org/T272700 [12:17:06] (03Merged) 10jenkins-bot: Enable SandboxLink on Turkish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657993 (https://phabricator.wikimedia.org/T272780) (owner: 10Evrifaessa) [12:19:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 89d072378e16b0410d963deca2fd766c1406b5b6: Enable SandboxLink on Turkish Wikivoyage (T272780) (duration: 01m 05s) [12:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:10] T272780: Enable SandboxLink on Turkish Wikivoyage - https://phabricator.wikimedia.org/T272780 [12:20:06] (03PS2) 10Urbanecm: Defining wgSitename for trwikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657992 (https://phabricator.wikimedia.org/T272779) (owner: 10Evrifaessa) [12:20:10] (03CR) 10Urbanecm: [C: 03+2] Defining wgSitename for trwikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657992 (https://phabricator.wikimedia.org/T272779) (owner: 10Evrifaessa) [12:20:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:21:56] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:22:01] (03Merged) 10jenkins-bot: Defining wgSitename for trwikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657992 (https://phabricator.wikimedia.org/T272779) (owner: 10Evrifaessa) [12:22:34] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:23:45] (03PS1) 10Muehlenhoff: Remove obsolete tmpreaper Puppet classes [puppet] - 10https://gerrit.wikimedia.org/r/658271 (https://phabricator.wikimedia.org/T272559) [12:23:48] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 177339d96616b5941dbeb2c90ca6aa0be90e3b5a: Defining wgSitename for trwikivoyage (T272779) (duration: 01m 00s) [12:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:52] T272779: In some places, trwikivoyage displays the project's default name "Wikivoyage" instead of the localized "Vikigezgin" - https://phabricator.wikimedia.org/T272779 [12:23:56] (03PS2) 10Urbanecm: Resize the logo of Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657998 (https://phabricator.wikimedia.org/T272784) (owner: 10Evrifaessa) [12:24:04] (03CR) 10Urbanecm: [C: 03+2] Resize the logo of Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657998 (https://phabricator.wikimedia.org/T272784) (owner: 10Evrifaessa) [12:24:07] (03PS2) 10Urbanecm: Set $wgCategoryCollation = uca-tr on trwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657997 (https://phabricator.wikimedia.org/T272783) (owner: 10Evrifaessa) [12:24:10] PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:24:13] (03CR) 10Urbanecm: [C: 03+2] Set $wgCategoryCollation = uca-tr on trwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657997 (https://phabricator.wikimedia.org/T272783) (owner: 10Evrifaessa) [12:25:18] (03Merged) 10jenkins-bot: Resize the logo of Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657998 (https://phabricator.wikimedia.org/T272784) (owner: 10Evrifaessa) [12:25:38] (03CR) 10Urbanecm: [C: 04-1] Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657971 (https://phabricator.wikimedia.org/T272776) (owner: 10Evrifaessa) [12:25:46] (03Merged) 10jenkins-bot: Set $wgCategoryCollation = uca-tr on trwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657997 (https://phabricator.wikimedia.org/T272783) (owner: 10Evrifaessa) [12:27:37] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: d34cb3205a58d5ac50800f2f218af6213f74f5e7: Resize the logo of Turkish Wikivoyage (T272784) (duration: 00m 54s) [12:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:44] T272784: Resize the logo of Turkish Wikivoyage - https://phabricator.wikimedia.org/T272784 [12:29:13] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: bcc7ad7acf721a5e0521bbecfe6df8671ac1822c: Set $wgCategoryCollation = uca-tr on trwikivoyage (T272783) (duration: 00m 57s) [12:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:18] T272783: Set $wgCategoryCollation = uca-tr on trwikivoyage - https://phabricator.wikimedia.org/T272783 [12:30:28] !log [urbanecm@mwmaint1002 ~]$ mwscript updateCollation.php --wiki=trwikivoyage --previous-collation=uppercase # T272783 [12:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:00] (03PS2) 10Urbanecm: Adding namespace aliases on arbcom-ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656964 (https://phabricator.wikimedia.org/T272292) (owner: 10Luke081515) [12:31:06] (03CR) 10Urbanecm: [C: 03+2] Adding namespace aliases on arbcom-ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656964 (https://phabricator.wikimedia.org/T272292) (owner: 10Luke081515) [12:31:15] (03CR) 10jerkins-bot: [V: 04-1] Add an option to limit the size of the file_text field [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse) [12:32:00] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:32:18] dcausse: ^^^ [12:32:23] sigh... [12:32:32] failure is unrelated [12:32:39] (03CR) 10Marostegui: [C: 03+1] mariadb: Remove mariadb module mysqld_safe [puppet] - 10https://gerrit.wikimedia.org/r/657820 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo) [12:32:51] so re+2 and wait :/ [12:33:02] (03CR) 10DCausse: Add an option to limit the size of the file_text field [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse) [12:33:04] (03CR) 10Urbanecm: [C: 03+2] Create Contact page for Ombuds commission at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655786 (https://phabricator.wikimedia.org/T271828) (owner: 10Luke081515) [12:33:07] (03CR) 10DCausse: [C: 03+2] Add an option to limit the size of the file_text field [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse) [12:33:36] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 148 probes of 592 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:34:11] (03CR) 10Jcrespo: [C: 03+2] admin: Provide Superset access to Amrutha (WMDE intern) [puppet] - 10https://gerrit.wikimedia.org/r/658233 (https://phabricator.wikimedia.org/T271725) (owner: 10Jcrespo) [12:35:37] (03Merged) 10jenkins-bot: Adding namespace aliases on arbcom-ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656964 (https://phabricator.wikimedia.org/T272292) (owner: 10Luke081515) [12:35:49] (03Merged) 10jenkins-bot: Create Contact page for Ombuds commission at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655786 (https://phabricator.wikimedia.org/T271828) (owner: 10Luke081515) [12:37:12] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 833833385f1cf02a4578edb9b5108d173bdf30bd: Adding namespace aliases on arbcom-ruwiki (T272292) (duration: 00m 57s) [12:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:18] T272292: arbcom-ru.wikipedia.org: Adding an aliases for namespaces - https://phabricator.wikimedia.org/T272292 [12:37:35] jynus: still on duty, or should we update the topic? :) [12:37:54] Urbanecm, it changes today, although we can keep it until later [12:38:01] do you have any needs? [12:38:19] jynus: not really, just wondering :) [12:38:38] don't worry, SREs have a meeting later so we will update then :-) [12:38:46] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:56] cool :) [12:39:10] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 104 probes of 590 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:41:08] !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=arbcom_ruwiki --fix # T272292 [12:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:27] !log urbanecm@deploy1001 Synchronized wmf-config/MetaContactPages.php: 7a6a60fcaa635a8f891a6d09f3611f8620490497: Create Contact page for Ombuds commission at Meta-Wiki (T271828) (duration: 01m 00s) [12:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:32] T271828: Create Contact page for Ombuds commission at Meta-Wiki - https://phabricator.wikimedia.org/T271828 [12:43:06] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10jcrespo) 05Open→03Resolved Extra needed privileges have been provided: https://ldap.toolforge.org/user/amy-wmde, closing as resolved. You c... [12:44:02] RECOVERY - Check systemd state on ms-be2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:44:12] (03PS2) 10Urbanecm: Revert "Add fiwiki 500k temporary logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655282 (owner: 10Majavah) [12:44:37] (03CR) 10Urbanecm: [C: 03+2] "let's assume people use the new logos now; besides, the URL should stay in cache for some time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655282 (owner: 10Majavah) [12:44:50] ah thanks Urbanecm, totally forgot those [12:44:56] np :) [12:45:03] I think 14 days is more than enough for this purpose [12:45:31] (03Merged) 10jenkins-bot: Revert "Add fiwiki 500k temporary logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655282 (owner: 10Majavah) [12:47:08] !log urbanecm@deploy1001 Synchronized static/images/project-logos/: 6a4cbe662655edaa4f6c36e69877766a6a48d828: Revert "Switch fiwiki to their 500k temporary logo!": delete temporary logo files (duration: 00m 57s) [12:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:51] Majavah: I'm intentionally going to not purge the removed URIs, and I'm leaving them at the mercy of our caching infra [12:48:48] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [12:49:21] dcausse: it seems i finally exhausted the long list of things to deploy i had, so once it merges, feel free to deploy it :) [12:49:41] Urbanecm: ok, thanks! :) [12:51:08] RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:03:08] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:04:04] (03PS11) 10Jbond: diffscan: pyhotnify [puppet] - 10https://gerrit.wikimedia.org/r/634572 [13:04:22] PROBLEM - MariaDB Replica Lag: pc1 on pc2007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 731.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:06:57] marostegui, all pc* hosts seems to be lagging on codfw [13:07:12] let's see.. [13:07:32] qps halfed [13:07:40] (03Merged) 10jenkins-bot: Add an option to limit the size of the file_text field [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse) [13:07:42] (03PS4) 10A2569875: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612) [13:08:09] https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=2&orgId=1&from=now-12h&to=now&var-server=pc1008&var-port=9104 [13:08:24] (03PS5) 10A2569875: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612) [13:08:58] PROBLEM - MariaDB Replica Lag: pc3 on pc2009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 509.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:09:09] I am wondering if it could be something with the deployment [13:09:23] marostegui, write pc traffic seems to have doubled on eqiad [13:09:31] yeah, see my paste above [13:09:36] let me check the binlogs [13:09:40] to see if there's something obvious [13:09:52] Urbanecm: anything on the deployment that can hit parsercache? [13:11:07] marostegui: I was about to ship something should I wait [13:11:13] dcausse: please wait [13:11:15] sure [13:12:23] marostegui: I updated category collation (wgCategoryCollation) at trwikivoyage. That updates categorylinks, but it MIGHT cause re-render. However, trwikivoyage is a very tiny wiki, so I doubt that's the cause, even if it does re-render pages automagically [13:13:23] (might=I'm not sure) [13:13:27] anomaly started around 12:12, but peaked at 12:42 [13:14:25] it is alsmost impossible to get anything interesting from pc binlogs :( [13:14:49] marostegui: but I did only pretty standard changes, and i never observed them to hurt parsercache in any way [13:16:40] (03Abandoned) 10Matthias Mullie: Guard against this file being included twice [extensions/WikibaseMediaInfo] (wmf/1.35.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655927 (https://phabricator.wikimedia.org/T271933) (owner: 10Cparle) [13:17:26] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=6&orgId=1&refresh=5m&var-server=pc1008&var-datasource=thanos&var-cluster=mysql&from=now-24h&to=now [13:17:31] something new is being stored on pc? [13:17:53] I definitely didn't enable any new feature [13:17:57] ah, nevermind that graph, that's not the partition [13:19:51] (03CR) 10Jbond: "Sorry for the delay have made some updates but needs testing" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634572 (owner: 10Jbond) [13:20:31] What I am seenig on the last binlogs are mostly deletes, so maybe some invalidation? [13:24:10] are you sure about deletes? on graph it seems to be REPLACEs [13:24:36] From what I can see on the binlogs (there are so many) on the last ones there are mostly deletes, and before mostly replaces [13:24:51] still scanning the "recent" ones [13:25:40] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/657770 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff) [13:25:46] the ones starting at .12 have replaces [13:26:09] I am basing on https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=2&orgId=1&var-server=pc1007&var-port=9104&from=1611408361871&to=1611581161871 [13:26:13] but not in a particular wiki [13:26:28] deletes until 19 h yesterday [13:26:30] yes, it is the same on all of them [13:26:36] RECOVERY - Logstash rate of ingestion percent change compared to yesterday #o11y on alert1001 is OK: (C)210 ge (W)150 ge 103.9 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [13:26:38] but spikes seems to be lately replaces [13:26:47] enwiki, enwiktionary, commonswiki [13:27:14] (as in, with same pattern as normal traffic) [13:27:45] the only thing that matches on SAL are: [13:27:46] 12:15 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 75aa32f: frwiki: Change back to normal logo (T272700) (duration: 01m 07s) [13:27:46] 12:07 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 693eaec: Add bidgee.id.au to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T272202) (duration: 01m 01s) [13:27:47] T272700: Remove birthday logo on French Wikipedia - https://phabricator.wikimedia.org/T272700 [13:27:47] T272202: Add bidgee.id.au to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T272202 [13:27:52] but not sure if it can be related in anyway [13:28:16] marostegui: I'm 99% convinced it can't [13:28:24] I don't think it should be that, I am going to check traffic patterns [13:28:36] Urbanecm: yeah, I agree [13:29:20] marostegui, almost definitely traffic, see: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1 [13:29:27] traffic as in, web traffic [13:29:37] (03CR) 10Alexandros Kosiaris: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/657855 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris) [13:29:46] jynus: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=26&orgId=1 [13:29:48] this is pretty crazy [13:29:50] there is a small increase in throughput, and a high increase in latency [13:31:02] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:34:26] (03PS1) 10Alexandros Kosiaris: Remove linkrecommendation-external [dns] - 10https://gerrit.wikimedia.org/r/658303 (https://phabricator.wikimedia.org/T258978) [13:37:58] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:39:42] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:40:54] (03PS1) 10ArielGlenn: handle backwards searches for bbz2 blocks in tiny files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658305 [13:40:56] (03PS1) 10ArielGlenn: update tests for different distros and for split-bz2 using local binaries [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658306 [13:44:26] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:47] (03PS1) 10Filippo Giunchedi: alertmanager: add JSON logging of all notifications [puppet] - 10https://gerrit.wikimedia.org/r/658307 (https://phabricator.wikimedia.org/T272474) [13:57:51] (03PS1) 10Filippo Giunchedi: rsyslog: send AM notifications logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/658308 (https://phabricator.wikimedia.org/T272474) [13:58:58] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Marostegui) >>! In T272559#6766714, @jcrespo wrote: > Comments for persistence-related modules: but please @Marostegui @Kormat comment too. > > * profile::proxysql > I wrote this for... [14:01:02] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 54 probes of 590 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:10:41] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/657560 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [14:18:51] (03CR) 10Jbond: "This look fine however (not in scope if this change) i think ultimately we should drop nginx altogether and updated to use envoy for tls t" [puppet] - 10https://gerrit.wikimedia.org/r/657782 (owner: 10Muehlenhoff) [14:18:58] (03CR) 10Jbond: [C: 03+1] Adapt proxy setting in debmonitor nginx site for CAS [puppet] - 10https://gerrit.wikimedia.org/r/657782 (owner: 10Muehlenhoff) [14:20:18] PROBLEM - Disk space on maps1004 is CRITICAL: DISK CRITICAL - free space: /srv 60202 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops [14:20:22] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:24] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/657782 (owner: 10Muehlenhoff) [14:22:59] (03CR) 10Jbond: "Its unclear why this is required" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657793 (owner: 10Muehlenhoff) [14:24:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove linkrecommendation-external [dns] - 10https://gerrit.wikimedia.org/r/658303 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris) [14:24:30] PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:30] (03CR) 10Jbond: debmonitor: Don't include debmonitor_static for the internal listener (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657795 (owner: 10Muehlenhoff) [14:25:49] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [14:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:53] (03CR) 10Ottomata: [C: 03+2] eventgate-main - bump to 2021-01-22-173634-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/657885 (https://phabricator.wikimedia.org/T262226) (owner: 10Ottomata) [14:26:22] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:12] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 592 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:28:41] (03PS2) 10Ottomata: Remove migrated EventLoggingSchemas overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657688 (https://phabricator.wikimedia.org/T259163) [14:28:44] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:28:44] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:26] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:48] !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:55] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [14:33:56] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [14:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:38] (03CR) 10Muehlenhoff: debmonitor: Don't include debmonitor_static for the internal listener (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657795 (owner: 10Muehlenhoff) [14:35:42] !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox [14:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:00] (03CR) 10Ottomata: [C: 03+2] Remove migrated EventLoggingSchemas overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657688 (https://phabricator.wikimedia.org/T259163) (owner: 10Ottomata) [14:37:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:37:46] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove 2 Remove migrated EventLoggingSchemas overrides - T259163, T267352 (duration: 00m 56s) [14:37:48] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:52] T259163: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 [14:37:52] T267352: UniversalLanguageSelector Event Platform Migration - https://phabricator.wikimedia.org/T267352 [14:39:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:41:03] (03PS1) 10Ottomata: Migrate SpecialMuteSubmit to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658314 (https://phabricator.wikimedia.org/T268517) [14:43:58] (03CR) 10Ottomata: [C: 03+2] Migrate SpecialMuteSubmit to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658314 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata) [14:47:34] (03PS1) 10Ottomata: Revert "Migrate SpecialMuteSubmit to Event Platform on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658343 [14:49:52] (03CR) 10Ottomata: [C: 03+2] Revert "Migrate SpecialMuteSubmit to Event Platform on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658343 (owner: 10Ottomata) [14:53:10] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/657890 (https://phabricator.wikimedia.org/T269399) (owner: 10Bstorm) [14:53:18] (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: remove unused code from main.conf [puppet] - 10https://gerrit.wikimedia.org/r/657138 (https://phabricator.wikimedia.org/T272305) [14:53:20] (03PS8) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305) [14:53:22] (03PS1) 10Giuseppe Lavagetto: httpbb: Add test for gzipping of static css files. [puppet] - 10https://gerrit.wikimedia.org/r/658317 (https://phabricator.wikimedia.org/T272305) [14:57:18] (03CR) 10Volans: "First pass" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657877 (owner: 10Jbond) [14:58:24] RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:53] (03CR) 10Volans: sre.misc-clusters.thumbor: create batch action cook book for thumbor (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [15:02:36] (03PS1) 10Elukey: Revert "role::analytics_test_cluster::client: Upgrade to Bigtop" [puppet] - 10https://gerrit.wikimedia.org/r/658344 [15:03:33] (03CR) 10Elukey: [C: 03+2] Revert "role::analytics_test_cluster::client: Upgrade to Bigtop" [puppet] - 10https://gerrit.wikimedia.org/r/658344 (owner: 10Elukey) [15:04:52] (03CR) 10Volans: "Apparently the iface label is now nullable and our script are setting it to empty string, see https://netbox.wikimedia.org/extras/changelo" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov) [15:08:42] (03PS1) 10Rosalie Perside (WMDE): Remove Wikibase.NewItemIdFormatter log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658321 (https://phabricator.wikimedia.org/T268870) [15:09:00] (03Abandoned) 10Awight: Lower maxHighlightLineLength limit to 5000 [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657308 (https://phabricator.wikimedia.org/T270238) (owner: 10WMDE-Fisch) [15:09:57] !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [15:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:39] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [15:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:13] very nice [15:13:40] 10SRE, 10observability, 10CAS-SSO: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10fgiunchedi) I can confirm I'm able to reproduce this, AFAICS the problematic case is an XHR from thanos UI with an expired SSO session. In this case the XHR will get a 302 to... [15:15:08] jouncebot: now [15:15:08] No deployments scheduled for the next 2 hour(s) and 44 minute(s) [15:16:03] !log re-opening EU Backport window to ship pending patches [15:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:51] (03CR) 10Volans: [C: 03+1] "LGTM, nit inline." (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond) [15:18:03] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: send w3creportingapi logs to indexes with custom schema [puppet] - 10https://gerrit.wikimedia.org/r/657452 (https://phabricator.wikimedia.org/T265938) (owner: 10Cwhite) [15:20:22] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27640/console" [puppet] - 10https://gerrit.wikimedia.org/r/657370 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:20:49] !log dcausse@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/CirrusSearch/: Add an option to limit the size of the file_text field: T271493 (duration: 00m 58s) [15:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:53] T271493: Implement 50kb limit on file text indexing for to reduce increasing commonswiki_file on-disk size - https://phabricator.wikimedia.org/T271493 [15:21:34] (03PS1) 10DCausse: Revert "Add an option to limit the size of the file_text field" [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658345 [15:21:42] (03CR) 10DCausse: [V: 03+2 C: 03+2] Revert "Add an option to limit the size of the file_text field" [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658345 (owner: 10DCausse) [15:22:41] sigh [15:23:43] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] logstash: enable curator to accept custom age filters [puppet] - 10https://gerrit.wikimedia.org/r/657370 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:23:43] !log dcausse@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/CirrusSearch/: revert: Add an option to limit the size of the file_text field: T271493 (duration: 01m 05s) [15:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:43] (03CR) 10Hnowlan: start using imposm as OSM sync tool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [15:29:09] 10SRE, 10Cloud-VPS, 10cloud-services-team (Kanban): CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) [15:29:25] 10SRE, 10Platform Engineering (Icebox), 10User-Eevans: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839 (10fgiunchedi) [15:30:11] 10Puppet, 10SRE, 10observability, 10User-jbond: PuppetDB grafana graphs not matching logs - https://phabricator.wikimedia.org/T265649 (10fgiunchedi) [15:31:13] 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 6 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10fgiunchedi) [15:31:23] (03PS1) 10Volans: tests: cover untested property in the irc module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347 [15:31:38] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:36] (03CR) 10jerkins-bot: [V: 04-1] tests: cover untested property in the irc module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347 (owner: 10Volans) [15:34:01] what? Riccardo got -1 from jenkins? [15:34:07] * elukey braces for impact [15:34:18] * elukey runs away [15:34:20] :D [15:34:49] usually it's the other way round [15:39:06] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:54] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming for rebuild [15:42:55] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming for rebuild [15:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:18] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming for rebuild [15:44:18] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming for rebuild [15:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:01] (03PS2) 10Hnowlan: maps: reimage maps1009 with buster. [puppet] - 10https://gerrit.wikimedia.org/r/656404 (https://phabricator.wikimedia.org/T238753) [15:48:03] (03PS2) 10Hnowlan: maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) [15:48:45] (03CR) 10jerkins-bot: [V: 04-1] maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [15:54:16] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:58:04] (03CR) 10DCausse: bump memory for flink processes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/657941 (owner: 10Mstyles) [15:58:48] the netbox DNS might be me, I've set some ms-be2* hosts to active [16:01:19] ah no, linkrecommendation [16:05:38] (03PS1) 10Bstorm: Revert "toolforge-k8s: AdmissionsConfiguration is GA after 1.17" [puppet] - 10https://gerrit.wikimedia.org/r/658366 [16:06:00] RECOVERY - MariaDB Replica Lag: pc1 on pc2007 is OK: OK slave_sql_lag Replication lag: 57.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:06:48] (03PS2) 10Volans: tests: cover untested property in the irc module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347 [16:06:59] (03CR) 10Bstorm: [C: 03+2] Revert "toolforge-k8s: AdmissionsConfiguration is GA after 1.17" [puppet] - 10https://gerrit.wikimedia.org/r/658366 (owner: 10Bstorm) [16:07:01] (03CR) 10David Caro: [C: 03+2] Revert "toolforge-k8s: AdmissionsConfiguration is GA after 1.17" [puppet] - 10https://gerrit.wikimedia.org/r/658366 (owner: 10Bstorm) [16:07:06] PROBLEM - tileratorui on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [16:07:31] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming for rebuild [16:07:31] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming for rebuild [16:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:14] godog: are you about to run the cookbook to fix the above uncommitted changes? [16:08:45] (03PS3) 10Hnowlan: maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) [16:08:47] (03CR) 10jerkins-bot: [V: 04-1] tests: cover untested property in the irc module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347 (owner: 10Volans) [16:08:50] I see a start from akosiaris but never an end... maybe you forgot to confirm akoopal ? [16:09:09] *akosiaris ^^^ (sorry for the ping akoo.pal ) [16:09:49] (03PS3) 10Volans: tests: cover untested property in the irc module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347 [16:11:44] 10SRE, 10Inuka-Team, 10Privacy, 10Product-Analytics (Kanban), 10Security: Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10JFishback_WMF) Untagging #security-team for now, but please feel free to add back if there is something else needed. [16:13:41] volans: no I ran the test cookbook to double check, in a meeting now [16:13:51] k [16:17:31] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Add support for scraping php applications to the kubernetes prometheus scraper - https://phabricator.wikimedia.org/T271822 (10lmata) Hi Joe, Let us know if there is any support you'd like from our team on this task, otherwise moving to Radar for now. [16:20:27] 10SRE, 10netops, 10observability: Add Icinga check for SRX cluster status - https://phabricator.wikimedia.org/T271298 (10lmata) Hi Arzhel, Please let me know if there is any specific support you need for this task, moving to Radar meanwhile. Thanks! [16:22:11] (03PS1) 10Bstorm: toolforge-k8s: update the api version [puppet] - 10https://gerrit.wikimedia.org/r/658350 [16:23:20] (03CR) 10David Caro: [C: 03+2] toolforge-k8s: update the api version [puppet] - 10https://gerrit.wikimedia.org/r/658350 (owner: 10Bstorm) [16:24:34] RECOVERY - MariaDB Replica Lag: pc3 on pc2009 is OK: OK slave_sql_lag Replication lag: 55.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:27:30] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.438 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [16:28:02] RECOVERY - Disk space on maps1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops [16:29:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:31:20] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:36:06] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:37:03] (03CR) 10CRusnov: "> Patch Set 4:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov) [16:38:18] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:41] 10SRE, 10SRE-Access-Requests, 10Machine Learning Platform (Current): Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10klausman) [16:38:51] (03CR) 10Bstorm: [C: 03+1] "I believe Toolforge just installs the package and configures it on a quick check, so this shouldn't impact cloud. I also don't see it in i" [puppet] - 10https://gerrit.wikimedia.org/r/658271 (https://phabricator.wikimedia.org/T272559) (owner: 10Muehlenhoff) [16:39:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:40:30] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:48] volans: indeed. I just typed "go" at the prompt right now. I guess it went (wherever it was meant to "go") :-) [16:41:12] akosiaris: rotfl [16:41:12] (03CR) 10Jbond: [C: 03+1] "thanks lgtm then 😊" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657795 (owner: 10Muehlenhoff) [16:41:19] 10SRE, 10Analytics-Clusters, 10vm-requests: Eq: new Druid test VM for analytics - https://phabricator.wikimedia.org/T266771 (10Ottomata) [16:41:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:43:39] (03PS14) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 [16:50:18] 10SRE, 10observability, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO: "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520 (10lmata) 05Open→03Resolved a:03lmata Hello, 3M delay seems like... [16:50:58] (03PS1) 10Jason Linehan: Enables MediaWiki client error instrument on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658356 (https://phabricator.wikimedia.org/T255585) [16:52:50] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:38] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) I am working on it, I am dependant on Dell. I do need to update all the f/w and idrac today. [16:54:14] (03CR) 10Jbond: "Thanks for the review, however i'm not sure if cookbooks, cumin is currently the right tool do do what i want. which is basic a way to ru" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657877 (owner: 10Jbond) [16:56:11] (03PS5) 10Jbond: WIP) sre.apt.audit: produce a report of manually packages [cookbooks] - 10https://gerrit.wikimedia.org/r/657877 [16:57:57] (03CR) 10Muehlenhoff: "LGTM, but one thing I noticed is that profile::maps::osm_master needs to be adapted for Buster as well; we need to update the $pgversion c" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [16:58:48] (03CR) 10jerkins-bot: [V: 04-1] WIP) sre.apt.audit: produce a report of manually packages [cookbooks] - 10https://gerrit.wikimedia.org/r/657877 (owner: 10Jbond) [17:00:06] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:05] (03CR) 10Jason Linehan: "Playing it safe by just knocking out the enwiki off switch for this test run." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658356 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [17:01:22] (03CR) 10Jbond: "thanks for the review comments inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [17:01:46] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:02:10] (03PS1) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 [17:02:47] (03CR) 10David Caro: wmcs: first try on creating a new etcd for toolforge (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro) [17:04:03] (03PS2) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 [17:07:12] (03CR) 10jerkins-bot: [V: 04-1] wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro) [17:07:49] (03PS3) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 [17:10:01] (03CR) 10jerkins-bot: [V: 04-1] wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro) [17:10:03] (03CR) 10Jdlrobson: [C: 03+1] Enables MediaWiki client error instrument on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658356 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [17:19:50] (03PS5) 10Jbond: dns: update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 [17:20:05] (03CR) 10Jbond: "updated thanks" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond) [17:22:18] (03CR) 10Jbond: [C: 03+1] "lgtm and thanks 😊" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347 (owner: 10Volans) [17:22:37] (03CR) 10jerkins-bot: [V: 04-1] dns: update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond) [17:22:55] (03CR) 10Volans: [C: 03+2] tests: cover untested property in the irc module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347 (owner: 10Volans) [17:25:14] (03Merged) 10jenkins-bot: tests: cover untested property in the irc module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347 (owner: 10Volans) [17:25:20] (03CR) 10Jbond: "resolve nits" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond) [17:26:19] (03PS6) 10Jbond: dns: update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 [17:29:45] (03PS1) 10Elukey: Add more users to the Hadoop Backup cluster (no ssh access) [puppet] - 10https://gerrit.wikimedia.org/r/658394 (https://phabricator.wikimedia.org/T260411) [17:31:48] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:57] 10SRE, 10SRE-Access-Requests, 10Machine Learning Platform (Current): Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10jbond) @klausman (adding a comment here incase it was missed from the meeting) when this access is revoked and the hacking is over... [17:39:04] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:39:23] (03CR) 10Elukey: [C: 03+2] Add more users to the Hadoop Backup cluster (no ssh access) [puppet] - 10https://gerrit.wikimedia.org/r/658394 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey) [17:42:18] (03CR) 10Hnowlan: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [17:42:20] (03PS4) 10Hnowlan: maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) [17:42:26] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 240495320 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:44:52] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 524296 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:46:10] (03PS1) 10Ladsgroup: wmcs: Migrate hiera() to lookup() and set datatypes in nfs primary [puppet] - 10https://gerrit.wikimedia.org/r/658397 (https://phabricator.wikimedia.org/T209953) [17:48:09] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27644/console" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [17:48:19] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/27643/" [puppet] - 10https://gerrit.wikimedia.org/r/658397 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [17:48:52] PROBLEM - Ensure local MW versions match expected deployment on deploy2002 is CRITICAL: CRITICAL: Missing 1 sites from wikiversions. 966 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [17:49:27] (03PS1) 10RobH: updating r740xd2 skus [software] - 10https://gerrit.wikimedia.org/r/658398 [17:49:28] PROBLEM - Ensure local MW versions match expected deployment on deploy1002 is CRITICAL: CRITICAL: Missing 1 sites from wikiversions. 966 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [17:49:53] (03CR) 10RobH: [C: 03+2] updating r740xd2 skus [software] - 10https://gerrit.wikimedia.org/r/658398 (owner: 10RobH) [17:50:36] (03Merged) 10jenkins-bot: updating r740xd2 skus [software] - 10https://gerrit.wikimedia.org/r/658398 (owner: 10RobH) [17:50:46] huh [17:50:50] it auto merged in that repo [17:50:53] i did... not expect that [17:51:19] its just the ser software repo for local use on laptops so not a big deal, just unexpected. [17:51:27] sre [17:53:40] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove Wikibase.NewItemIdFormatter log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658321 (https://phabricator.wikimedia.org/T268870) (owner: 10Rosalie Perside (WMDE)) [17:58:12] 10SRE, 10SRE-Access-Requests, 10Machine Learning Platform (Current): Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10klausman) >>! In T272687#6774181, @jbond wrote: > @klausman (adding a comment here incase it was missed from the meeting) when thi... [18:00:04] ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210125T1800). [18:04:22] 10SRE, 10ops-eqiad: ms-be1046 stuck on reboot - https://phabricator.wikimedia.org/T272396 (10Cmjohnson) I also attempted to update bios, power f/w, idrac, and all were failed due to the server's inability to power up. A dell ticket has been created. SR1049823171 [18:05:20] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) I failed to re-connect the mgmt cable after getting it to power on and was not able to remotely access the server to get the logs for the Dell tech. I connected everything, updated the bios and... [18:06:03] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) Thanks Chris, any chances that we can get the host to boot up at least so MySQL replication can catch up a bit. Thank you! [18:06:47] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) @marostegui it should be accessible now [18:07:18] 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Use lookup() instead of hiera() in Puppet - https://phabricator.wikimedia.org/T209953 (10Dzahn) [18:07:33] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) There is just let memory at the moment [18:08:25] 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Use lookup() instead of hiera() in Puppet - https://phabricator.wikimedia.org/T209953 (10Dzahn) [18:16:56] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) Dell ticket number SR1049824647 [18:18:11] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:19:10] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:21:53] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Legoktm) p:05Triage→03Lowest [18:22:20] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:22:44] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:25:40] 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Cmjohnson) The issue that Dell has with this is we cannot determine which DIMM is failed. The hardware logs all look good and do not indicate an... [18:25:45] (03CR) 10Jcrespo: [C: 03+1] "This was approved today:" [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn) [18:25:53] (03PS7) 10Jcrespo: add platform engineering folks to snapshot and dumpsdata server access [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn) [18:26:20] (03CR) 10jerkins-bot: [V: 04-1] add platform engineering folks to snapshot and dumpsdata server access [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn) [18:27:34] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27645/console" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan) [18:28:04] 10SRE, 10Wikimedia-Mailing-lists, 10I18n: Mailman password reminder mail (and other texts) has broken encoding in Czech - https://phabricator.wikimedia.org/T271123 (10Legoktm) >>! In T271123#6752321, @Mormegil wrote: > Well, yes, for Czech, the subscription confirmation e-mail seems to be sent correctly, now... [18:29:24] (03CR) 10Jcrespo: [C: 04-1] "Oh, we need to update GIDs on rebase, as changes have happened since then. :-( +1 otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn) [18:29:34] 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Cmjohnson) I am attaching the TSR report so you will see none of the h/w logs suggest there is an issue. {F34022892} [18:30:16] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:36] 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) Thanks Chris - I can now access the server and will start mysql so it can catch up on replication!. Let's coordinate to install the new memory once it arrives. Thanks again [18:31:42] (03CR) 10Jcrespo: [C: 04-1] "```" [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn) [18:33:03] (03PS3) 10Hnowlan: mtail: create separate metrics histogram for REST API requests [puppet] - 10https://gerrit.wikimedia.org/r/634207 (https://phabricator.wikimedia.org/T263727) [18:33:34] (03PS1) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407 [18:33:52] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 317251272 and 82 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:34:07] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1411.eqiad.wmnet with reason: REIMAGE [18:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:16] (03CR) 10jerkins-bot: [V: 04-1] O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond) [18:35:02] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 18 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [18:36:08] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1412.eqiad.wmnet with reason: REIMAGE [18:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1411.eqiad.wmnet with reason: REIMAGE [18:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:22] 10SRE, 10Dumps-Generation, 10SRE-Access-Requests, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Add all of CPT to snapshot/dumpsdata admins - https://phabricator.wikimedia.org/T271718 (10Legoktm) This was approved in today's SRE meeting. [18:38:18] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1412.eqiad.wmnet with reason: REIMAGE [18:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:25] (03CR) 10Legoktm: add platform engineering folks to snapshot and dumpsdata server access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn) [18:40:33] (03PS2) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407 [18:41:22] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2326.codfw.wmnet with reason: REIMAGE [18:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:58] PROBLEM - MariaDB Replica Lag: pc3 on pc2009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.90 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:43:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2326.codfw.wmnet with reason: REIMAGE [18:43:24] I am going to silence all pc codfw replicas for a day [18:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:36] oh, is it happening again? [18:43:58] (03PS8) 10Legoktm: admin: Add platform engineering team to {snapshot,dumpsdata}-admins [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn) [18:44:00] cool to me [18:44:16] (03PS3) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407 [18:44:24] It was a small spike, I guess it will be spiking maybe a few hours [18:44:27] I will downtime to avoid noise [18:44:33] +1, thanks [18:45:16] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2324.codfw.wmnet with reason: REIMAGE [18:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:32] (03PS4) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407 [18:47:43] alert2001? [18:48:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:49:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:50:14] 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10EBernhardson) As far as I understand it, it's not possible for the linux kernel to map a physical address back to a single dimm. It just doesn't... [18:50:24] 10Puppet, 10SRE, 10Patch-For-Review: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10Legoktm) p:05Triage→03High a:03jbond [18:52:07] (03PS5) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407 [18:53:23] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27650/console" [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond) [18:54:12] 10SRE, 10SRE-Access-Requests, 10Machine Learning Platform (Current): Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10Legoktm) This was approved in today's SRE meeting with the following notes: * Approved due to the recognition of exceptional circu... [18:54:37] hi, anyone around bored and willing to help with a small thing? can you run `var_dump( ChangeTags::listDefinedTags() )` for me on production enwiki, and paste the output somewhere? [18:55:02] this is mostly the same data as on https://en.wikipedia.org/wiki/Special:Tags , but i want to know the internal order [18:55:47] (03CR) 10Jbond: [V: 03+1] "As far as i can tell PCC just shows some reordering" [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond) [18:56:11] sure [18:56:29] (03PS1) 10Ottomata: eventgate-analytics-external - bump to 2021-01-25-183848-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/658410 (https://phabricator.wikimedia.org/T257237) [18:56:39] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:57:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1411.eqiad.wmnet'] ` an... [18:57:48] MatmaRex: https://phabricator.wikimedia.org/P13947 [18:58:03] thanks legoktm <3 [18:58:33] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1412.eqiad.wmnet'] ` an... [18:59:27] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics-external - bump to 2021-01-25-183848-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/658410 (https://phabricator.wikimedia.org/T257237) (owner: 10Ottomata) [19:00:05] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210125T1900). [19:00:05] tgr: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:14] o/ [19:00:21] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [19:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:10] tgr_: can i add a config change to the backport window? [19:01:18] sure [19:01:27] 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform: Reduce cache TTL of schema.wikimedia.org - https://phabricator.wikimedia.org/T267557 (10fdans) 05Open→03Resolved [19:01:42] tgr it's https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/658356 [19:02:02] 10SRE, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10fdans) 05Open→03Resolved [19:02:51] have added to the calendar [19:03:21] (03CR) 10Legoktm: [C: 03+2] admin: Add platform engineering team to {snapshot,dumpsdata}-admins [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn) [19:07:03] (03PS2) 10Gergő Tisza: [beta] GrowthExperiments: set link recommendation feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657292 [19:07:14] 10SRE, 10Dumps-Generation, 10SRE-Access-Requests, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Add all of CPT to snapshot/dumpsdata admins - https://phabricator.wikimedia.org/T271718 (10Legoktm) 05Open→03Resolved This should rollout over the next 20-30 minutes. Please re-open if... [19:08:50] (03CR) 10Gergő Tisza: [C: 03+2] [beta] GrowthExperiments: set link recommendation feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657292 (owner: 10Gergő Tisza) [19:09:53] (03Merged) 10jenkins-bot: [beta] GrowthExperiments: set link recommendation feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657292 (owner: 10Gergő Tisza) [19:12:00] 10SRE, 10Wikimedia-Mailing-lists: Show listadmins (names or email addresses?) on main page of each mailing list - https://phabricator.wikimedia.org/T272778 (10Legoktm) p:05Triage→03Low >>! In T272778#6772072, @Ciell wrote: > We have several lists (our admin-list for instance) that do not allow open registr... [19:13:23] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:14:21] (03PS3) 10Ottomata: eventgate - Map from eventgate event and error statsd metrics to prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/657908 (https://phabricator.wikimedia.org/T257237) [19:14:48] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2326.codfw.wmnet'] ` an... [19:15:45] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2324.codfw.wmnet'] ` an... [19:16:08] (03PS2) 10Gergő Tisza: Enables MediaWiki client error instrument on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658356 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [19:16:12] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:657292|[beta] GrowthExperiments: set link recommendation feature flags ()]] (duration: 01m 06s) [19:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:21] (03PS4) 10Ottomata: eventgate - Map from eventgate event and error statsd metrics to prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/657908 (https://phabricator.wikimedia.org/T257237) [19:16:38] (03PS5) 10Ottomata: eventgate - Map from eventgate event and error statsd metrics to prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/657908 (https://phabricator.wikimedia.org/T257237) [19:17:39] RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:18:11] (03CR) 10Ottomata: [C: 03+2] eventgate - Map from eventgate event and error statsd metrics to prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/657908 (https://phabricator.wikimedia.org/T257237) (owner: 10Ottomata) [19:18:19] (03CR) 10Gergő Tisza: [C: 03+2] Enables MediaWiki client error instrument on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658356 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [19:18:23] 10SRE, 10DNS, 10Mail, 10Traffic: ITS request to update SPF & DNS Records for Trust & Safety - https://phabricator.wikimedia.org/T272750 (10Legoktm) From the Greenhouse task @Aklapper linked: >>! In T189065#4031997, faidon wrote: > This has been discussed in bigger requests a couple of times before (T10389... [19:18:33] 10SRE, 10DNS, 10Mail, 10Traffic: ITS request to update SPF & DNS Records for Trust & Safety - https://phabricator.wikimedia.org/T272750 (10Legoktm) p:05Triage→03Medium [19:18:47] 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Legoktm) p:05Triage→03Low [19:19:29] (03Merged) 10jenkins-bot: Enables MediaWiki client error instrument on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658356 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [19:20:38] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [19:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:47] Jdlrobson: it's on mwdebug1001 [19:20:53] sweet on it [19:21:23] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 43552 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:22:49] tgr_: works! [19:24:11] tgr_:https://logstash.wikimedia.org/app/dashboards#/doc/logstash-*/logstash-2021.01.25?id=U3L_OncBXM-H9NFXjGWr [19:25:05] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:658356|Enables MediaWiki client error instrument on English Wikipedia (T255585)]] (duration: 01m 01s) [19:25:07] (03CR) 10CDanis: [C: 03+1] "Sigh, sorry for forgetting about another piece of this mess... retroactive LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658218 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [19:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:09] T255585: [EPIC] Extend client-side error logging coverage to include English Wikipedia - https://phabricator.wikimedia.org/T255585 [19:25:26] Jdlrobson: it's live. Thanks for pushing JS error logging forward! [19:25:46] sweet... now for the fun game of identifying broken gadgets :) [19:25:55] btw should we add the JS error channel to the things deployers should keep an eye on, or is it too noisy for that? [19:26:48] 10SRE, 10docker-pkg, 10serviceops, 10Technical-Debt: Get rid of the concept of "seed image" in docker-pkg - https://phabricator.wikimedia.org/T272154 (10Legoktm) p:05Triage→03Low [19:27:06] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Add support for scraping php applications to the kubernetes prometheus scraper - https://phabricator.wikimedia.org/T271822 (10Legoktm) p:05Triage→03Medium [19:29:46] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:29:47] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [19:29:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:11] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:31:55] 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10Legoktm) p:05Triage→03Medium Have they announced when the 7.11 release is happening? It's not clear to me how long w... [19:32:39] (03PS1) 10Ottomata: eventgate-* - bump to 2021-01-25-183848-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/658412 (https://phabricator.wikimedia.org/T257237) [19:32:42] (03PS1) 10Ottomata: eventgate-main - precache /mediawiki/revision/recommendation-create/1.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/658413 (https://phabricator.wikimedia.org/T262226) [19:34:47] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 47995128 and 29 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:35:03] 10SRE, 10Traffic, 10serviceops, 10Performance Issue: When logged in, loading the frwiki homepage takes a very long time - https://phabricator.wikimedia.org/T270631 (10Legoktm) 05Open→03Resolved a:03Legoktm >>! In T270631#6706316, @Legoktm wrote: > @Thibaut120094 I believe this requires editing https:... [19:35:55] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:36:17] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 522488 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:36:30] (03CR) 10Ottomata: [C: 03+2] eventgate-* - bump to 2021-01-25-183848-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/658412 (https://phabricator.wikimedia.org/T257237) (owner: 10Ottomata) [19:36:39] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [19:36:39] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [19:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:18] !log Morning deploys done [19:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:26] (03CR) 10Ottomata: [C: 03+2] eventgate-main - precache /mediawiki/revision/recommendation-create/1.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/658413 (https://phabricator.wikimedia.org/T262226) (owner: 10Ottomata) [19:37:47] > btw should we add the JS error channel to the things deployers should keep an eye on, or is it too noisy for that? [19:37:53] @tgr definitely [19:37:57] tgr_: definitely [19:38:12] I'm attending the deploy triage meeting for the train [19:38:26] but yeh I think when backporting JS changes, it's important for us to be checking the graphs post-change [19:40:08] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [19:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:16] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [19:41:16] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [19:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:39] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [19:41:39] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [19:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:22] 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10EBernhardson) We have as long as we want to figure out what to do next, I don't think the day they drop 7.11 changes any... [19:44:00] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [19:44:00] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [19:44:01] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [19:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:52] (03PS1) 10Bstorm: metrics-server: upgrade to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/658416 [19:48:06] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1411.eqiad.wmnet [19:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:23] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw14124.eqiad.wmnet [19:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:30] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [19:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:54] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2326.codfw.wmnet [19:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:28] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2324.codfw.wmnet [19:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:26] 10SRE, 10DBA, 10Platform Engineering Roadmap Decision Making, 10Performance-Team (Radar), 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Krinkle) [19:52:12] !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [19:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:39] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1411.eqiad.wmnet [19:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:16] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1412.eqiad.wmnet [19:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:55] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [19:54:55] !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [19:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:43] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Jclark-ctr) replaced Dac cable for an-worker1119 and an-worker1131 @elukey confirmed both are seeing network [19:56:51] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:57:50] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:58:00] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2326.codfw.wmnet'] ` Of... [19:58:14] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:59:31] 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10AntiCompositeNumber) Elastic does not announce release dates in advance. CirrusSearch is still on ES 6, which will beco... [20:00:48] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2324.codfw.wmnet [20:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:40] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:01:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2326.codfw.wmnet'] ` Of... [20:02:25] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:04:02] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:12:37] (03PS1) 10Mforns: Refine SuggestedTagsAction schema using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/658419 (https://phabricator.wikimedia.org/T267351) [20:12:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1410.eqiad.wmnet with reason: REIMAGE [20:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1410.eqiad.wmnet with reason: REIMAGE [20:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:50] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 65633552 and 42 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:16:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:17:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2326.codfw.wmnet with reason: REIMAGE [20:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:00] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 328304 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:19:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:19:22] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2326.codfw.wmnet with reason: REIMAGE [20:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:21] (03CR) 10Dzahn: [C: 04-1] "Yes, indeed. looks like a duplicate and this is already done" [puppet] - 10https://gerrit.wikimedia.org/r/641508 (https://phabricator.wikimedia.org/T267744) (owner: 10Herron) [20:21:04] (03Abandoned) 10Herron: admin: add ldap_only_user entry for tillmletzko-wmde [puppet] - 10https://gerrit.wikimedia.org/r/641508 (https://phabricator.wikimedia.org/T267744) (owner: 10Herron) [20:21:29] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2322.codfw.wmnet with reason: REIMAGE [20:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:06] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2323.codfw.wmnet with reason: REIMAGE [20:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2322.codfw.wmnet with reason: REIMAGE [20:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:36] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2323.codfw.wmnet with reason: REIMAGE [20:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:18] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:32:08] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) >>! In T272559#6766384, @jbond wrote: > Im not familiar with the frack set up, do they depend on our repo or do they have there own. My assumption has always been that they ah... [20:35:27] (03PS1) 10Papaul: DHCP: Add MAC address and partman for cloudgw2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/658422 (https://phabricator.wikimedia.org/T271590) [20:35:28] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:35:39] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Dzahn) > Gateway Time-out for url: https://docker-registry.discovery.wmnet Gotta set HTTP_PROXY/HTTPS_PROXY env variable? [20:35:41] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [20:35:41] !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [20:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:50] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1410.eqiad.wmnet'] ` an... [20:37:12] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address and partman for cloudgw2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/658422 (https://phabricator.wikimedia.org/T271590) (owner: 10Papaul) [20:40:59] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1410.eqiad.wmnet [20:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:38] (03PS1) 10Papaul: Partman: Add cloudgw2002 [puppet] - 10https://gerrit.wikimedia.org/r/658423 (https://phabricator.wikimedia.org/T271590) [20:42:28] (03CR) 10Papaul: [C: 03+2] Partman: Add cloudgw2002 [puppet] - 10https://gerrit.wikimedia.org/r/658423 (https://phabricator.wikimedia.org/T271590) (owner: 10Papaul) [20:44:07] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1410.eqiad.wmnet [20:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:25] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:46:30] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2326.codfw.wmnet'] ` an... [20:46:57] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2326.codfw.wmnet [20:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:42] (03PS1) 10Papaul: Add clougw2002-dev to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/658424 (https://phabricator.wikimedia.org/T271590) [20:49:05] (03CR) 10Papaul: [C: 03+2] Add clougw2002-dev to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/658424 (https://phabricator.wikimedia.org/T271590) (owner: 10Papaul) [20:49:43] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2326.codfw.wmnet [20:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:10] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2322.codfw.wmnet'] ` an... [20:50:35] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10wiki_willy) Hi @Jgreen - it looks like we're running a bit tight on space in the Fundraising rack. In order for us to rack the servers for this install, do you have 1-2 existing se... [20:52:30] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cloudgw20... [20:52:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:52:59] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2323.codfw.wmnet'] ` an... [20:53:50] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:53:54] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jclark-ctr) [20:55:00] (03PS1) 10Mforns: Migrate WebUIActionsTracking schemas to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658426 (https://phabricator.wikimedia.org/T267347) [20:55:23] (03CR) 10Jcrespo: [C: 03+1] "Looking good: https://puppet-compiler.wmflabs.org/compiler1002/27642/" [puppet] - 10https://gerrit.wikimedia.org/r/658211 (https://phabricator.wikimedia.org/T271427) (owner: 10Marostegui) [20:57:12] (03CR) 10Jcrespo: [C: 03+1] wmnet: Update s4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/658213 (https://phabricator.wikimedia.org/T271427) (owner: 10Marostegui) [21:00:04] chrisalbon and accraze: How many deployers does it take to do Services – Graphoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210125T2100). [21:02:01] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) ` legoktm@registry2002:~$ time curl "https://docker-registry.discovery.wmnet/v2/_catalog?last=releng%2Fquibble-jessie-php55&n=100" {"re... [21:07:08] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 111119872 and 60 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:07:25] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1338.eqiad.wmnet with reason: REIMAGE [21:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10Papaul) [21:08:03] !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE [21:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:05] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) ` legoktm@registry2002:~$ time curl "https://docker-registry.discovery.wmnet/v2/_catalog?last=releng%2Fquibble-jessie-php55&n=100" RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 565368 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:08:56] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Dzahn) Fwiw, i get the same timeout when doing that curl command from registry1002. [21:09:28] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1338.eqiad.wmnet with reason: REIMAGE [21:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:24] !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE [21:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:01] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 253265904 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:13:05] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:19:13] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudgw2002-dev.codfw.wmnet'] ` and were **ALL** successful. [21:19:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:20:13] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10Papaul) [21:21:45] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10Papaul) 05Open→03Resolved @aborrero this is complete. let me know if you have any questions. Thanks. [21:30:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:30:38] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:37:08] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:43:22] RECOVERY - MariaDB Replica Lag: pc3 on pc2009 is OK: OK slave_sql_lag Replication lag: 12.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [21:49:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:53:35] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1338.eqiad.wmnet'] ` an... [21:53:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:54:49] (03PS1) 10Legoktm: docker_registry_ha: Increase nginx proxy timeout to 120s [puppet] - 10https://gerrit.wikimedia.org/r/658436 (https://phabricator.wikimedia.org/T179696) [21:57:02] 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) In my testing of repeatedly issuing the same curl command over and over, it usually took ~35s to respond, but sometimes it took over 1m... [21:59:13] PROBLEM - Check systemd state on registry1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:00:04] Reedy and sbassett: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210125T2200). [22:02:29] (03CR) 10Bstorm: [C: 03+2] wikireplicas: Add new DNS names for multiinstance replicas [puppet] - 10https://gerrit.wikimedia.org/r/657155 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm) [22:04:44] (03PS2) 10Legoktm: Drop obsolete requirements.txt and setup.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657954 [22:04:46] (03PS2) 10Legoktm: Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657955 [22:04:48] (03PS5) 10Legoktm: Add script to mostly automate logo management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640) [22:12:23] (03CR) 10Krinkle: Add script to mostly automate logo management (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640) (owner: 10Legoktm) [22:15:32] (03CR) 10Krinkle: Add script to mostly automate logo management (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640) (owner: 10Legoktm) [22:17:05] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 0 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [22:17:55] (03PS6) 10Legoktm: Add script to mostly automate logo management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640) [22:18:06] (03CR) 10Legoktm: Add script to mostly automate logo management (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640) (owner: 10Legoktm) [22:19:25] (03PS3) 10Cwhite: logstash: enable curator to accept custom age filters [puppet] - 10https://gerrit.wikimedia.org/r/657370 (https://phabricator.wikimedia.org/T234565) [22:23:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:23:35] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 18 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [22:24:29] (03CR) 10Volans: "reply inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond) [22:28:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:29:16] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1338.eqiad.wmnet [22:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:50] (03CR) 10Cwhite: [C: 03+2] logstash: enable curator to accept custom age filters [puppet] - 10https://gerrit.wikimedia.org/r/657370 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:29:58] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2323.codfw.wmnet [22:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:59] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2322.codfw.wmnet [22:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:51] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:53] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 225817080 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:34:13] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 58848 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:34:40] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2323.codfw.wmnet [22:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:50] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2322.codfw.wmnet [22:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:20] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:35:23] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2321.codfw.wmnet'] ` Of... [22:35:59] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:36:04] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2320.codfw.wmnet'] ` Of... [22:38:27] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:38:36] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2320.codfw.wmnet'] ` Of... [22:38:45] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:39:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin2001.codf... [22:39:34] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2320.codfw.wmnet'] ` Of... [22:40:24] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:41:18] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:41:22] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2321.codfw.wmnet'] ` Of... [22:41:39] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:43:53] (03PS1) 10Aaron Schulz: Enable "coalesceKeys" for global keys for WANCache (III) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658372 [22:44:07] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1338.eqiad.wmnet [22:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:12] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:47:13] RECOVERY - Check systemd state on registry1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:47:30] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [22:50:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [22:52:18] (03PS1) 10Andrew Bogott: cloud-vps instances: add a helper script to format & mount a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114) [22:53:52] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps instances: add a helper script to format & mount a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [22:54:08] (03CR) 10Bstorm: "I haven't merged this yet partly because I'm a little fuzzy on the docs. If I merge this, is it basically just a repo update or do I have " [puppet] - 10https://gerrit.wikimedia.org/r/639881 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm) [22:55:52] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 43158016 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:56:16] (03PS2) 10Legoktm: docker_registry_ha: Increase nginx proxy timeout to 120s [puppet] - 10https://gerrit.wikimedia.org/r/658436 (https://phabricator.wikimedia.org/T179696) [22:56:39] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for MewOphaswongse - https://phabricator.wikimedia.org/T272912 (10Aklapper) 05Open→03Stalled Hi @mewoph, thanks for taking the time to report this and welcome to Wikimedia Phabricator! Could you please follow https://phabricator.wikimedia.org/project/... [22:57:06] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7749336 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:57:28] (03PS2) 10Andrew Bogott: cloud-vps instances: add a helper script to format & mount a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114) [22:58:57] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps instances: add a helper script to format & mount a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [22:59:27] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2320.codfw.wmnet with reason: REIMAGE [22:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:18] (03PS3) 10Andrew Bogott: cloud-vps instances: add a helper script to format & mount a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114) [23:00:42] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2331.codfw.wmnet with reason: REIMAGE [23:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2320.codfw.wmnet with reason: REIMAGE [23:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:03:10] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for MewOphaswongse - https://phabricator.wikimedia.org/T272912 (10mewoph) @Aklapper I just linked my MediaWiki account w/Phabricator. I'm a full time employee so I didn't fill in the contractor-only fields. Thanks! [23:03:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2331.codfw.wmnet with reason: REIMAGE [23:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:05:18] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2319.codfw.wmnet with reason: REIMAGE [23:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:51] (03PS1) 10Legoktm: openldap: Prepare cross-validate-accounts for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455 [23:06:18] (03CR) 10jerkins-bot: [V: 04-1] openldap: Prepare cross-validate-accounts for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455 (owner: 10Legoktm) [23:06:34] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2318.codfw.wmnet with reason: REIMAGE [23:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:07] (03PS2) 10Legoktm: openldap: Prepare cross-validate-accounts for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455 [23:07:20] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2319.codfw.wmnet with reason: REIMAGE [23:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:30] (03PS2) 10Legoktm: snapshot: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657916 (https://phabricator.wikimedia.org/T266479) [23:07:56] (03PS5) 10Dzahn: add deploy1002 and deploy2002 to deployment_hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/635079 (https://phabricator.wikimedia.org/T265963) [23:09:25] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2318.codfw.wmnet with reason: REIMAGE [23:09:27] (03CR) 10Legoktm: [C: 03+2] snapshot: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657916 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [23:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:18] (03PS1) 10Bstorm: wikireplicas: fix error in VM proxy config [puppet] - 10https://gerrit.wikimedia.org/r/658457 (https://phabricator.wikimedia.org/T271476) [23:19:28] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:21:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:22:10] (03CR) 10Bstorm: [C: 03+2] wikireplicas: fix error in VM proxy config [puppet] - 10https://gerrit.wikimedia.org/r/658457 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [23:24:37] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2331.codfw.wmnet'] ` an... [23:25:42] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2320.codfw.wmnet'] ` an... [23:30:04] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2319.codfw.wmnet'] ` an... [23:31:14] RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2318.codfw.wmnet'] ` an... [23:37:48] PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:16] (03CR) 10Bstorm: "For what this does, I almost wish it was more of an interactive script rather than argument driven." [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott) [23:47:28] (03PS1) 10Tks4Fish: arbcom_enwiki: Change favicon to a renamed copy of arbcom_ruwiki.ico [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658461 (https://phabricator.wikimedia.org/T272920) [23:50:18] (03Abandoned) 10Tks4Fish: arbcom_enwiki: Change favicon to a renamed copy of arbcom_ruwiki.ico [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658461 (https://phabricator.wikimedia.org/T272920) (owner: 10Tks4Fish) [23:55:28] (03CR) 10Legoktm: [C: 03+1] "verified the IPs match what netbox has" [puppet] - 10https://gerrit.wikimedia.org/r/635079 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [23:56:54] (03PS1) 10Tks4Fish: arbcom_enwiki: Change favicon to a renamed copy of arbcom_ruwiki.ico [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658463 (https://phabricator.wikimedia.org/T272920) [23:57:33] (03CR) 10Subramanya Sastry: [C: 03+1] parsoid::testing: switch db_host from m5-master to localhost [puppet] - 10https://gerrit.wikimedia.org/r/654565 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)