[00:31:22] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:38:18] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:31:34] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:38:32] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:40:04] <icinga-wm>	 RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:47:08] <icinga-wm>	 PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:50:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:52:28] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:30:10] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:37:06] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:28:08] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:31:56] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:38:32] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:43:30] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.068 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[04:31:14] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:38:22] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:21:31] <wikibugs>	 (03CR) 10Ladsgroup: "Yeah :( Django sucks. Flask is much better." [puppet] - 10https://gerrit.wikimedia.org/r/657952 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup)
[05:30:20] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:37:14] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:50:35] <wikibugs>	 (03PS1) 10Urbanecm: Enable SandboxLink at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658073 (https://phabricator.wikimedia.org/T272796)
[05:54:25] <wikibugs>	 (03PS1) 10Urbanecm: frwiki: Change back to normal logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658075 (https://phabricator.wikimedia.org/T272700)
[06:03:45] <wikibugs>	 (03PS1) 10Urbanecm: [beta] Initial configuration for votewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658076 (https://phabricator.wikimedia.org/T272608)
[06:21:30] <wikibugs>	 (03PS1) 10Ladsgroup: cache: Migrate hiera() to lookup() and set datatypes in frontend [puppet] - 10https://gerrit.wikimedia.org/r/658079 (https://phabricator.wikimedia.org/T209953)
[06:26:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) Could we have an ETA on when this server can be worked on? We were aiming to open a new infrastructure this server is part of to users 1st of Feb
[06:26:57] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/27637/" [puppet] - 10https://gerrit.wikimedia.org/r/658079 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[06:28:03] <wikibugs>	 (03CR) 10Ladsgroup: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/657560 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[06:30:52] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:37:48] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:43:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Populate x2 eqiad hosts into dbctl T269324', diff saved to https://phabricator.wikimedia.org/P13938 and previous config saved to /var/cache/conftool/dbconfig/20210125-064305-marostegui.json
[06:43:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:43:10] <stashbot>	 T269324: Productionize x2 databases - https://phabricator.wikimedia.org/T269324
[06:44:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add x2 eqiad to dbctl T269324', diff saved to https://phabricator.wikimedia.org/P13939 and previous config saved to /var/cache/conftool/dbconfig/20210125-064419-marostegui.json
[06:44:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:26] <wikibugs>	 (03PS1) 10Marostegui: x2 hosts: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658081 (https://phabricator.wikimedia.org/T269324)
[06:47:31] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] x2 hosts: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/658081 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui)
[06:48:36] <wikibugs>	 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Marostegui)
[07:08:08] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw,logstash7-codfw,logstash7-eqiad} instance=kafkamon1002 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic={rsyslog-notice,rsyslog-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consum
[07:08:08] <icinga-wm>	 h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[07:08:08] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] Create patroller user group for thwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657225 (https://phabricator.wikimedia.org/T272149) (owner: 10Patsagorn Y.)
[07:08:45] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] Set $wgCategoryCollation = uca-tr on trwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657997 (https://phabricator.wikimedia.org/T272783) (owner: 10Evrifaessa)
[07:10:37] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "Should work" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657998 (https://phabricator.wikimedia.org/T272784) (owner: 10Evrifaessa)
[07:15:57] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "Lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657955 (owner: 10Legoktm)
[07:24:41] <wikibugs>	 10SRE, 10Machine Learning Platform, 10SRE-Access-Requests: Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10elukey) >>! In T272687#6769596, @calbon wrote: > kbazira needs access too  Yep Tobias added Kevin as well, all covered!
[07:25:09] <wikibugs>	 10SRE, 10Machine Learning Platform, 10SRE-Access-Requests: Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10elukey) 05Resolved→03Open
[07:25:31] <wikibugs>	 10SRE, 10Machine Learning Platform, 10SRE-Access-Requests: Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10elukey) p:05High→03Medium
[07:25:50] <wikibugs>	 10SRE, 10Machine Learning Platform, 10SRE-Access-Requests: Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10elukey)
[07:26:32] <wikibugs>	 10SRE, 10Machine Learning Platform, 10SRE-Access-Requests: Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10elukey) Reopening to get this request validated by the SRE team during today's team meeting :)
[07:30:04] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:33:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1142 for kernel upgrade', diff saved to https://phabricator.wikimedia.org/P13940 and previous config saved to /var/cache/conftool/dbconfig/20210125-073322-marostegui.json
[07:33:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:54] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:45:47] <wikibugs>	 (03PS1) 10Elukey: Remove old analytics decommed nodes [puppet] - 10https://gerrit.wikimedia.org/r/658087 (https://phabricator.wikimedia.org/T267932)
[07:46:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Remove old analytics decommed nodes [puppet] - 10https://gerrit.wikimedia.org/r/658087 (https://phabricator.wikimedia.org/T267932) (owner: 10Elukey)
[07:51:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for risler [puppet] - 10https://gerrit.wikimedia.org/r/658089
[07:52:15] <wikibugs>	 (03PS1) 10Elukey: Add the Hadoop worker profile to master/standby in Backup [puppet] - 10https://gerrit.wikimedia.org/r/658098 (https://phabricator.wikimedia.org/T260411)
[07:54:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for risler [puppet] - 10https://gerrit.wikimedia.org/r/658089 (owner: 10Muehlenhoff)
[07:54:16] <wikibugs>	 (03PS2) 10Elukey: Add the Hadoop worker profile to master/standby in Backup [puppet] - 10https://gerrit.wikimedia.org/r/658098 (https://phabricator.wikimedia.org/T260411)
[07:59:16] <wikibugs>	 (03PS3) 10Elukey: Add the Hadoop worker profile to master/standby in Backup [puppet] - 10https://gerrit.wikimedia.org/r/658098 (https://phabricator.wikimedia.org/T260411)
[07:59:18] <wikibugs>	 (03PS1) 10Elukey: profile::hadoop::monitoring::resourcemanager: fix jmx resource name [puppet] - 10https://gerrit.wikimedia.org/r/658210
[08:01:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::hadoop::monitoring::resourcemanager: fix jmx resource name [puppet] - 10https://gerrit.wikimedia.org/r/658210 (owner: 10Elukey)
[08:01:46] <wikibugs>	 (03PS4) 10Elukey: Add the Hadoop worker profile to master/standby in Backup [puppet] - 10https://gerrit.wikimedia.org/r/658098 (https://phabricator.wikimedia.org/T260411)
[08:02:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 25%: After upgrading its kernel', diff saved to https://phabricator.wikimedia.org/P13941 and previous config saved to /var/cache/conftool/dbconfig/20210125-080204-root.json
[08:02:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:21] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27639/console" [puppet] - 10https://gerrit.wikimedia.org/r/658098 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[08:04:51] <wikibugs>	 10SRE, 10DBA: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[08:05:51] <wikibugs>	 (03CR) 10Elukey: [V: 03+1 C: 03+2] Add the Hadoop worker profile to master/standby in Backup [puppet] - 10https://gerrit.wikimedia.org/r/658098 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[08:12:16] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1138 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/658211 (https://phabricator.wikimedia.org/T271427)
[08:12:32] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/658211 (https://phabricator.wikimedia.org/T271427) (owner: 10Marostegui)
[08:12:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend access for agaduran [puppet] - 10https://gerrit.wikimedia.org/r/658212
[08:13:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend access for agaduran [puppet] - 10https://gerrit.wikimedia.org/r/658212 (owner: 10Muehlenhoff)
[08:13:46] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update s4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/658213 (https://phabricator.wikimedia.org/T271427)
[08:14:15] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/658213 (https://phabricator.wikimedia.org/T271427) (owner: 10Marostegui)
[08:17:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 50%: After upgrading its kernel', diff saved to https://phabricator.wikimedia.org/P13942 and previous config saved to /var/cache/conftool/dbconfig/20210125-081708-root.json
[08:17:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:20] <wikibugs>	 (03PS1) 10Elukey: Add the HDFS balancer to the Master node in Hadoop backup [puppet] - 10https://gerrit.wikimedia.org/r/658215 (https://phabricator.wikimedia.org/T260411)
[08:25:55] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add the HDFS balancer to the Master node in Hadoop backup [puppet] - 10https://gerrit.wikimedia.org/r/658215 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[08:30:44] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:32:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 75%: After upgrading its kernel', diff saved to https://phabricator.wikimedia.org/P13943 and previous config saved to /var/cache/conftool/dbconfig/20210125-083211-root.json
[08:32:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:33:48] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:33:57] <wikibugs>	 (03CR) 10Marostegui: "I guess we need to review those users on the new hosts to make sure they have the correct max_connection settings?" [puppet] - 10https://gerrit.wikimedia.org/r/657890 (https://phabricator.wikimedia.org/T269399) (owner: 10Bstorm)
[08:35:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Remove mariadb module mylvmbackup [puppet] - 10https://gerrit.wikimedia.org/r/657821 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo)
[08:36:52] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:37:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T111929) (owner: 10Jcrespo)
[08:40:00] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: wikipedia-mai & wikiur-l lists do not seem to have active list admins (mail archives empty after August 2018 & January 2019) - https://phabricator.wikimedia.org/T270837 (10Aklapper) >>! In T270837#6735597, @Aklapper wrote: > If this is a request to have active mailing list mod...
[08:46:42] <icinga-wm>	 PROBLEM - Logstash rate of ingestion percent change compared to yesterday #o11y on alert1001 is CRITICAL: 1244 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[08:47:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1142 (re)pooling @ 100%: After upgrading its kernel', diff saved to https://phabricator.wikimedia.org/P13944 and previous config saved to /var/cache/conftool/dbconfig/20210125-084715-root.json
[08:47:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:21] <elukey>	 so there are a ton of messages on logstash
[08:54:49] <elukey>	 going to move the discussion to #sre
[09:01:52] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10JMeybohm) >>! In T238909#6769149, @akosiaris wrote: > Adding https://metallb.universe.tf/ as a potential solution as well.  Wou...
[09:06:50] <ema>	 !log cp3054: install varnish 6.0.1-1wm2 -- 6.0.1 without https://github.com/varnishcache/varnish-cache/pull/2705 T264398
[09:06:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:55] <stashbot>	 T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398
[09:10:35] <wikibugs>	 (03PS1) 10Marostegui: etcd.php: Add x2 mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658218 (https://phabricator.wikimedia.org/T269324)
[09:11:54] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] etcd.php: Add x2 mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658218 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui)
[09:12:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] etcd.php: Add x2 mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658218 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui)
[09:13:29] <wikibugs>	 (03Merged) 10jenkins-bot: etcd.php: Add x2 mapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658218 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui)
[09:14:30] <wikibugs>	 (03PS1) 10Elukey: Add a more restrictive default umask to Hadoop backup [puppet] - 10https://gerrit.wikimedia.org/r/658219 (https://phabricator.wikimedia.org/T260411)
[09:15:10] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add a more restrictive default umask to Hadoop backup [puppet] - 10https://gerrit.wikimedia.org/r/658219 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[09:15:14] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Add x2 to the mapping array T269324 (duration: 01m 01s)
[09:15:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:18] <stashbot>	 T269324: Productionize x2 databases - https://phabricator.wikimedia.org/T269324
[09:17:40] <moritzm>	 !log installing samba security updates on stretch
[09:17:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:19:50] <elukey>	 win 11
[09:19:52] <elukey>	 ufff
[09:21:27] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/etcd.php: Add x2 to the mapping array T269324 (duration: 00m 58s)
[09:21:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:36] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:21:39] <stashbot>	 T269324: Productionize x2 databases - https://phabricator.wikimedia.org/T269324
[09:21:45] <wikibugs>	 (03CR) 10JMeybohm: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/657855 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris)
[09:26:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Failover URL downloaders [dns] - 10https://gerrit.wikimedia.org/r/657786 (owner: 10Muehlenhoff)
[09:28:38] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] "I'm tentatively scheduling the deploy to jan 29 as T267175 problems appear to be fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657131 (https://phabricator.wikimedia.org/T270252) (owner: 10Lucas Werkmeister (WMDE))
[09:30:10] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:31:49] <wikibugs>	 (03PS1) 10David Caro: config: allow using tilde `~` to specify config paths [software/cumin] - 10https://gerrit.wikimedia.org/r/658223
[09:36:34] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:40:26] <godog>	 !log bounce apache2 on logstash1024, stuck on high cpu
[09:40:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:42:24] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10JMeybohm) The job fails on registry2002, leading to icinga alerts  ` Jan 25 09:30:01 registry2002 systemd[1]: Started Build docker-registry home...
[09:47:27] <wikibugs>	 (03PS1) 10Urbanecm: Add bidgee.id.au to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658225 (https://phabricator.wikimedia.org/T272202)
[09:48:05] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host urldownloader2002.wikimedia.org
[09:48:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:46] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2002.wikimedia.org
[09:49:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:23] <wikibugs>	 (03PS4) 10Jcrespo: mariadb-backups: Document logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T111929)
[09:51:25] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Remove mariadb module mysqld_safe [puppet] - 10https://gerrit.wikimedia.org/r/657820 (https://phabricator.wikimedia.org/T272559)
[09:51:27] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Remove mariadb module mylvmbackup [puppet] - 10https://gerrit.wikimedia.org/r/657821 (https://phabricator.wikimedia.org/T272559)
[09:51:29] <wikibugs>	 (03PS1) 10Jcrespo: admin: Update ssh key for dvrandecic for production cluster access [puppet] - 10https://gerrit.wikimedia.org/r/658227 (https://phabricator.wikimedia.org/T272470)
[09:51:36] <wikibugs>	 (03PS2) 10Jcrespo: admin: Update ssh key for dvrandecic for production cluster access [puppet] - 10https://gerrit.wikimedia.org/r/658227 (https://phabricator.wikimedia.org/T272470)
[09:52:23] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host urldownloader1002.wikimedia.org
[09:52:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:29] <wikibugs>	 (03CR) 10Jcrespo: "This will fix partially root@ alerts spamming." [puppet] - 10https://gerrit.wikimedia.org/r/658227 (https://phabricator.wikimedia.org/T272470) (owner: 10Jcrespo)
[09:54:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. The other issues found by the account check are WIP, one already extended and one pinged for possible extension." [puppet] - 10https://gerrit.wikimedia.org/r/658227 (https://phabricator.wikimedia.org/T272470) (owner: 10Jcrespo)
[09:55:28] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1002.wikimedia.org
[09:55:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:15] <wikibugs>	 (03CR) 10Jcrespo: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/658227 (https://phabricator.wikimedia.org/T272470) (owner: 10Jcrespo)
[09:57:23] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] admin: Update ssh key for dvrandecic for production cluster access [puppet] - 10https://gerrit.wikimedia.org/r/658227 (https://phabricator.wikimedia.org/T272470) (owner: 10Jcrespo)
[10:00:21] <moritzm>	 !log installing imagemagick security updates on stretch
[10:00:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:04] <wikibugs>	 10SRE, 10Patch-For-Review, 10Tracking-Neverending: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10jcrespo)
[10:05:33] <wikibugs>	 (03PS1) 10Elukey: superset: disable old druid datasouce panels [puppet] - 10https://gerrit.wikimedia.org/r/658228 (https://phabricator.wikimedia.org/T263972)
[10:06:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] superset: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657917 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm)
[10:07:17] <wikibugs>	 (03PS2) 10Ladsgroup: lvs: Migrate hiera() to lookup() and set datatypes [puppet] - 10https://gerrit.wikimedia.org/r/657958 (https://phabricator.wikimedia.org/T209953)
[10:07:50] <wikibugs>	 (03PS2) 10Ladsgroup: cache: Migrate hiera() to lookup() and set datatypes in frontend [puppet] - 10https://gerrit.wikimedia.org/r/658079 (https://phabricator.wikimedia.org/T209953)
[10:07:58] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/657958 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[10:08:07] <wikibugs>	 (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/658079 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[10:09:00] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] lvs: Migrate hiera() to lookup() and set datatypes [puppet] - 10https://gerrit.wikimedia.org/r/657958 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[10:16:58] <wikibugs>	 (03CR) 10Ema: [C: 03+1] cache: Migrate hiera() to lookup() and set datatypes in frontend [puppet] - 10https://gerrit.wikimedia.org/r/658079 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[10:17:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] superset: disable old druid datasouce panels [puppet] - 10https://gerrit.wikimedia.org/r/658228 (https://phabricator.wikimedia.org/T263972) (owner: 10Elukey)
[10:24:05] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] lvs: Migrate hiera() to lookup() and set datatypes [puppet] - 10https://gerrit.wikimedia.org/r/657958 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[10:30:46] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:37:22] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:37:31] <wikibugs>	 10SRE, 10LDAP-Access-Requests: LDAP access to the wmf group for Jazmin Tanner - https://phabricator.wikimedia.org/T272522 (10jcrespo) @JKatzWMF thank you, could you help @JTannerWMF complete all required data to process the request, as per https://phabricator.wikimedia.org/tag/ldap-access-requests/ ?
[10:38:59] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10jcrespo) a:05amy_rc→03jcrespo Thank you again, will check on our shared docs and proceed with the access request.
[10:40:22] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10Lea_WMDE) @jcrespo sorry for the delay, this slipped through :/ Yes, Amy is interning with us until March 31 2021, good to know that that can be setup in advance!
[10:44:13] <godog>	 !log swift decrease weight for ms-be20[16,18,20,22] - T272837
[10:44:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:23] <stashbot>	 T272837:  Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837
[10:48:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::analytics::refinery::job::hdfs_cleaner Update [puppet] - 10https://gerrit.wikimedia.org/r/656376 (https://phabricator.wikimedia.org/T271560) (owner: 10Joal)
[10:50:56] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Do weekly dumps of Wikidata Lexeme [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man)
[10:52:47] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host scandium.eqiad.wmnet
[10:52:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:18] <wikibugs>	 (03PS5) 10Jcrespo: mariadb-backups: Document logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T111929)
[10:53:20] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Remove mariadb module mysqld_safe [puppet] - 10https://gerrit.wikimedia.org/r/657820 (https://phabricator.wikimedia.org/T272559)
[10:53:22] <wikibugs>	 (03PS3) 10Jcrespo: mariadb: Remove mariadb module mylvmbackup [puppet] - 10https://gerrit.wikimedia.org/r/657821 (https://phabricator.wikimedia.org/T272559)
[10:53:24] <wikibugs>	 (03PS1) 10Jcrespo: admin: Provide Superset access to Amrutha (WMDE intern) [puppet] - 10https://gerrit.wikimedia.org/r/658233 (https://phabricator.wikimedia.org/T271725)
[10:58:05] <wikibugs>	 (03PS1) 10Elukey: role::analytics_test_cluster::client: Upgrade to Bigtop [puppet] - 10https://gerrit.wikimedia.org/r/658234
[10:58:09] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host scandium.eqiad.wmnet
[10:58:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:38] <wikibugs>	 (03PS6) 10Jcrespo: mariadb-backups: Document logical backups grants throughout production dbs [puppet] - 10https://gerrit.wikimedia.org/r/657801 (https://phabricator.wikimedia.org/T111929)
[10:58:45] <wikibugs>	 (03PS2) 10Jcrespo: admin: Provide Superset access to Amrutha (WMDE intern) [puppet] - 10https://gerrit.wikimedia.org/r/658233 (https://phabricator.wikimedia.org/T271725)
[10:59:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks." [software/cumin] - 10https://gerrit.wikimedia.org/r/658223 (owner: 10David Caro)
[10:59:56] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::client: Upgrade to Bigtop [puppet] - 10https://gerrit.wikimedia.org/r/658234 (owner: 10Elukey)
[11:00:05] <wikibugs>	 (03CR) 10Jcrespo: "I think this can new be abandoned due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/643346/ (and ticket)." [puppet] - 10https://gerrit.wikimedia.org/r/641508 (https://phabricator.wikimedia.org/T267744) (owner: 10Herron)
[11:01:32] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] config: allow using tilde `~` to specify config paths [software/cumin] - 10https://gerrit.wikimedia.org/r/658223 (owner: 10David Caro)
[11:04:24] <icinga-wm>	 PROBLEM - Disk space on an-worker1137 is CRITICAL: DISK CRITICAL - free space: / 455 MB (0% inode=91%): /tmp 455 MB (0% inode=91%): /var/tmp 455 MB (0% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1137&var-datasource=eqiad+prometheus/ops
[11:06:07] <wikibugs>	 (03PS1) 10ArielGlenn: reduce number of xml/sql dumps kept on dumpsdata hosts by one [puppet] - 10https://gerrit.wikimedia.org/r/658237
[11:06:36] <elukey>	 ahahha checking the an-worker1137 space, we are backupping data
[11:07:31] <wikibugs>	 (03Merged) 10jenkins-bot: config: allow using tilde `~` to specify config paths [software/cumin] - 10https://gerrit.wikimedia.org/r/658223 (owner: 10David Caro)
[11:08:22] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] reduce number of xml/sql dumps kept on dumpsdata hosts by one [puppet] - 10https://gerrit.wikimedia.org/r/658237 (owner: 10ArielGlenn)
[11:11:18] <godog>	 !log thanos delete old orphaned blocks with replica=unset label
[11:11:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:11:41] <wikibugs>	 (03PS4) 10ArielGlenn: Do weekly dumps of Wikidata Lexeme [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man)
[11:11:59] <wikibugs>	 (03PS1) 10DCausse: Add an option to limit the size of the file_text field [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493)
[11:12:16] <wikibugs>	 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10akosiaris) >>! In T272238#6771873, @Nemo_bis wrote: >>>! In T272238#6764574, @akosiaris wrote: >> I 've marked T272111 a...
[11:13:04] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] Do weekly dumps of Wikidata Lexeme [puppet] - 10https://gerrit.wikimedia.org/r/637895 (https://phabricator.wikimedia.org/T264883) (owner: 10Hoo man)
[11:19:16] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 13.27 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[11:21:45] <wikibugs>	 (03PS1) 10DCausse: [cirrus] set 50kb limit on file text indexing for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658240 (https://phabricator.wikimedia.org/T271493)
[11:22:10] <icinga-wm>	 PROBLEM - SSH on ms-be2041 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:22:12] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:23:04] <icinga-wm>	 RECOVERY - SSH on ms-be2041 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:23:36] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.change-distro-from-cdh-clients: fix arg selection [cookbooks] - 10https://gerrit.wikimedia.org/r/658241
[11:24:24] <icinga-wm>	 RECOVERY - Disk space on an-worker1137 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1137&var-datasource=eqiad+prometheus/ops
[11:25:54] <icinga-wm>	 PROBLEM - SSH on ms-be2044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:26:58] <icinga-wm>	 RECOVERY - SSH on ms-be2044 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:28:28] <wikibugs>	 10SRE, 10Analytics-Radar, 10Wikimedia-Logstash, 10observability, 10Performance-Team (Radar): Retire udp2log: onboard its producers and consumers to the logging pipeline - https://phabricator.wikimedia.org/T205856 (10fgiunchedi)
[11:29:28] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] cache: Migrate hiera() to lookup() and set datatypes in frontend [puppet] - 10https://gerrit.wikimedia.org/r/658079 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[11:30:05] <jouncebot>	 jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210125T1130).
[11:30:20] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.change-distro-from-cdh-clients: fix arg selection [cookbooks] - 10https://gerrit.wikimedia.org/r/658241 (owner: 10Elukey)
[11:30:26] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:47] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658244 (https://phabricator.wikimedia.org/T128546)
[11:33:02] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[11:33:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:20] <wikibugs>	 10SRE, 10Prod-Kubernetes, 10Pybal, 10Traffic, 10serviceops: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) >>! In T238909#6772562, @JMeybohm wrote: >>>! In T238909#6769149, @akosiaris wrote: >> Adding https://metallb.univer...
[11:34:55] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658244 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[11:35:10] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[11:35:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:15] <wikibugs>	 (03CR) 10ZPapierski: [C: 03+1] [cirrus] set 50kb limit on file text indexing for commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658240 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse)
[11:35:18] <icinga-wm>	 PROBLEM - SSH on ms-be2033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:35:19] <wikibugs>	 (03CR) 10ZPapierski: Add an option to limit the size of the file_text field (031 comment) [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse)
[11:35:40] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658244 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[11:36:24] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:38:22] <icinga-wm>	 RECOVERY - SSH on ms-be2033 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:39:23] <logmsgbot>	 !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:658242| Bumping portals to master (T128546)]] (duration: 00m 58s)
[11:39:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:39:27] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[11:40:18] <logmsgbot>	 !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:658242| Bumping portals to master (T128546)]] (duration: 00m 55s)
[11:40:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:06] <wikibugs>	 (03CR) 10DCausse: Add an option to limit the size of the file_text field (031 comment) [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse)
[11:42:26] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:47:00] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:47:58] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:49:02] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:53:17] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10aborrero)
[11:53:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:55:40] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:57:10] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:00:00] <icinga-wm>	 PROBLEM - SSH on ms-be2034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210125T1200).
[12:00:04] <jouncebot>	 dcausse: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[12:00:10] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:00:15] <dcausse>	 o/
[12:00:17] <Urbanecm>	 \o/
[12:00:26] <Urbanecm>	 dcausse: will you self-deploy, or should I?
[12:00:39] <dcausse>	 Urbanecm: I can deploy
[12:00:51] <Urbanecm>	 please do, and let me know when over, so i can do some more stuff too :)
[12:01:00] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 12.91 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[12:01:18] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] Add an option to limit the size of the file_text field [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse)
[12:01:51] <dcausse>	 Urbanecm: I have to wait on jenkins for this^
[12:02:01] <Urbanecm>	 ah, so i can go with configs now dcausse ?
[12:02:07] <dcausse>	 sure
[12:02:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:04:09] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add bidgee.id.au to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658225 (https://phabricator.wikimedia.org/T272202) (owner: 10Urbanecm)
[12:04:26] <icinga-wm>	 RECOVERY - SSH on ms-be2034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[12:05:06] <wikibugs>	 (03Merged) 10jenkins-bot: Add bidgee.id.au to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658225 (https://phabricator.wikimedia.org/T272202) (owner: 10Urbanecm)
[12:05:17] <wikibugs>	 (03PS2) 10Urbanecm: Enable SandboxLink at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658073 (https://phabricator.wikimedia.org/T272796)
[12:05:23] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable SandboxLink at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658073 (https://phabricator.wikimedia.org/T272796) (owner: 10Urbanecm)
[12:06:14] <wikibugs>	 (03Merged) 10jenkins-bot: Enable SandboxLink at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658073 (https://phabricator.wikimedia.org/T272796) (owner: 10Urbanecm)
[12:07:00] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 693eaec20a24620c2a709c8bac707c0d7af3436b: Add bidgee.id.au to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T272202) (duration: 01m 01s)
[12:07:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:06] <stashbot>	 T272202: Add bidgee.id.au to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T272202
[12:07:27] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] frwiki: Change back to normal logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658075 (https://phabricator.wikimedia.org/T272700) (owner: 10Urbanecm)
[12:07:39] <wikibugs>	 (03PS2) 10Urbanecm: frwiki: Change back to normal logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658075 (https://phabricator.wikimedia.org/T272700)
[12:07:43] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] frwiki: Change back to normal logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658075 (https://phabricator.wikimedia.org/T272700) (owner: 10Urbanecm)
[12:08:30] <wikibugs>	 (03Merged) 10jenkins-bot: frwiki: Change back to normal logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658075 (https://phabricator.wikimedia.org/T272700) (owner: 10Urbanecm)
[12:10:38] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:11:34] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:13:21] <wikibugs>	 (03PS1) 10Urbanecm: Revert "Enable SandboxLink at viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658266 (https://phabricator.wikimedia.org/T272796)
[12:13:51] <wikibugs>	 (03PS2) 10Urbanecm: Revert "Enable SandboxLink at viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658266 (https://phabricator.wikimedia.org/T272796)
[12:13:54] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "Enable SandboxLink at viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658266 (https://phabricator.wikimedia.org/T272796) (owner: 10Urbanecm)
[12:15:06] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Enable SandboxLink at viwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658266 (https://phabricator.wikimedia.org/T272796) (owner: 10Urbanecm)
[12:15:06] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:15:40] <wikibugs>	 (03PS2) 10Urbanecm: Enable SandboxLink on Turkish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657993 (https://phabricator.wikimedia.org/T272780) (owner: 10Evrifaessa)
[12:15:45] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 75aa32fd5aee1feebe8a97360068da55cbcf06d8: frwiki: Change back to normal logo (T272700) (duration: 01m 07s)
[12:15:46] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable SandboxLink on Turkish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657993 (https://phabricator.wikimedia.org/T272780) (owner: 10Evrifaessa)
[12:15:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:56] <stashbot>	 T272700: Remove birthday logo on French Wikipedia - https://phabricator.wikimedia.org/T272700
[12:17:06] <wikibugs>	 (03Merged) 10jenkins-bot: Enable SandboxLink on Turkish Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657993 (https://phabricator.wikimedia.org/T272780) (owner: 10Evrifaessa)
[12:19:05] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 89d072378e16b0410d963deca2fd766c1406b5b6: Enable SandboxLink on Turkish Wikivoyage (T272780) (duration: 01m 05s)
[12:19:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:10] <stashbot>	 T272780: Enable SandboxLink on Turkish Wikivoyage - https://phabricator.wikimedia.org/T272780
[12:20:06] <wikibugs>	 (03PS2) 10Urbanecm: Defining wgSitename for trwikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657992 (https://phabricator.wikimedia.org/T272779) (owner: 10Evrifaessa)
[12:20:10] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Defining wgSitename for trwikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657992 (https://phabricator.wikimedia.org/T272779) (owner: 10Evrifaessa)
[12:20:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:21:56] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:22:01] <wikibugs>	 (03Merged) 10jenkins-bot: Defining wgSitename for trwikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657992 (https://phabricator.wikimedia.org/T272779) (owner: 10Evrifaessa)
[12:22:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:23:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete tmpreaper Puppet classes [puppet] - 10https://gerrit.wikimedia.org/r/658271 (https://phabricator.wikimedia.org/T272559)
[12:23:48] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 177339d96616b5941dbeb2c90ca6aa0be90e3b5a: Defining wgSitename for trwikivoyage (T272779) (duration: 01m 00s)
[12:23:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:52] <stashbot>	 T272779: In some places, trwikivoyage displays the project's default name "Wikivoyage" instead of the localized "Vikigezgin" - https://phabricator.wikimedia.org/T272779
[12:23:56] <wikibugs>	 (03PS2) 10Urbanecm: Resize the logo of Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657998 (https://phabricator.wikimedia.org/T272784) (owner: 10Evrifaessa)
[12:24:04] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Resize the logo of Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657998 (https://phabricator.wikimedia.org/T272784) (owner: 10Evrifaessa)
[12:24:07] <wikibugs>	 (03PS2) 10Urbanecm: Set $wgCategoryCollation = uca-tr on trwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657997 (https://phabricator.wikimedia.org/T272783) (owner: 10Evrifaessa)
[12:24:10] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:24:13] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Set $wgCategoryCollation = uca-tr on trwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657997 (https://phabricator.wikimedia.org/T272783) (owner: 10Evrifaessa)
[12:25:18] <wikibugs>	 (03Merged) 10jenkins-bot: Resize the logo of Turkish Wikivoyage. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657998 (https://phabricator.wikimedia.org/T272784) (owner: 10Evrifaessa)
[12:25:38] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] Add localized Wikivoyage wordmark for the mobile view of Turkish Wikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657971 (https://phabricator.wikimedia.org/T272776) (owner: 10Evrifaessa)
[12:25:46] <wikibugs>	 (03Merged) 10jenkins-bot: Set $wgCategoryCollation = uca-tr on trwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657997 (https://phabricator.wikimedia.org/T272783) (owner: 10Evrifaessa)
[12:27:37] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized static/images/project-logos/: d34cb3205a58d5ac50800f2f218af6213f74f5e7: Resize the logo of Turkish Wikivoyage (T272784) (duration: 00m 54s)
[12:27:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:44] <stashbot>	 T272784: Resize the logo of Turkish Wikivoyage - https://phabricator.wikimedia.org/T272784
[12:29:13] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: bcc7ad7acf721a5e0521bbecfe6df8671ac1822c: Set $wgCategoryCollation = uca-tr on trwikivoyage (T272783) (duration: 00m 57s)
[12:29:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:29:18] <stashbot>	 T272783: Set $wgCategoryCollation = uca-tr on trwikivoyage - https://phabricator.wikimedia.org/T272783
[12:30:28] <Urbanecm>	 !log [urbanecm@mwmaint1002 ~]$ mwscript updateCollation.php --wiki=trwikivoyage --previous-collation=uppercase # T272783
[12:30:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:31:00] <wikibugs>	 (03PS2) 10Urbanecm: Adding namespace aliases on arbcom-ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656964 (https://phabricator.wikimedia.org/T272292) (owner: 10Luke081515)
[12:31:06] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Adding namespace aliases on arbcom-ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656964 (https://phabricator.wikimedia.org/T272292) (owner: 10Luke081515)
[12:31:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add an option to limit the size of the file_text field [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse)
[12:32:00] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:32:18] <Urbanecm>	 dcausse: ^^^
[12:32:23] <dcausse>	 sigh...
[12:32:32] <dcausse>	 failure is unrelated
[12:32:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Remove mariadb module mysqld_safe [puppet] - 10https://gerrit.wikimedia.org/r/657820 (https://phabricator.wikimedia.org/T272559) (owner: 10Jcrespo)
[12:32:51] <Urbanecm>	 so re+2 and wait :/
[12:33:02] <wikibugs>	 (03CR) 10DCausse: Add an option to limit the size of the file_text field [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse)
[12:33:04] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Create Contact page for Ombuds commission at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655786 (https://phabricator.wikimedia.org/T271828) (owner: 10Luke081515)
[12:33:07] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] Add an option to limit the size of the file_text field [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse)
[12:33:36] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 148 probes of 592 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:34:11] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] admin: Provide Superset access to Amrutha (WMDE intern) [puppet] - 10https://gerrit.wikimedia.org/r/658233 (https://phabricator.wikimedia.org/T271725) (owner: 10Jcrespo)
[12:35:37] <wikibugs>	 (03Merged) 10jenkins-bot: Adding namespace aliases on arbcom-ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/656964 (https://phabricator.wikimedia.org/T272292) (owner: 10Luke081515)
[12:35:49] <wikibugs>	 (03Merged) 10jenkins-bot: Create Contact page for Ombuds commission at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655786 (https://phabricator.wikimedia.org/T271828) (owner: 10Luke081515)
[12:37:12] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 833833385f1cf02a4578edb9b5108d173bdf30bd: Adding namespace aliases on arbcom-ruwiki (T272292) (duration: 00m 57s)
[12:37:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:18] <stashbot>	 T272292: arbcom-ru.wikipedia.org: Adding an aliases for namespaces - https://phabricator.wikimedia.org/T272292
[12:37:35] <Urbanecm>	 jynus: still on duty, or should we update the topic? :)
[12:37:54] <jynus>	 Urbanecm, it changes today, although we can keep it until later
[12:38:01] <jynus>	 do you have any needs?
[12:38:19] <Urbanecm>	 jynus: not really, just wondering :)
[12:38:38] <jynus>	 don't worry, SREs have a meeting later so we will update then :-)
[12:38:46] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:38:56] <Urbanecm>	 cool :)
[12:39:10] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 104 probes of 590 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:41:08] <Urbanecm>	 !log [urbanecm@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki=arbcom_ruwiki --fix # T272292
[12:41:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:27] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/MetaContactPages.php: 7a6a60fcaa635a8f891a6d09f3611f8620490497: Create Contact page for Ombuds commission at Meta-Wiki (T271828) (duration: 01m 00s)
[12:41:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:32] <stashbot>	 T271828: Create Contact page for Ombuds commission at Meta-Wiki - https://phabricator.wikimedia.org/T271828
[12:43:06] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Allow WMDE intern Amrutha to access Superset - https://phabricator.wikimedia.org/T271725 (10jcrespo) 05Open→03Resolved Extra needed privileges have been provided: https://ldap.toolforge.org/user/amy-wmde,  closing as resolved. You c...
[12:44:02] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:44:12] <wikibugs>	 (03PS2) 10Urbanecm: Revert "Add fiwiki 500k temporary logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655282 (owner: 10Majavah)
[12:44:37] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "let's assume people use the new logos now; besides, the URL should stay in cache for some time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655282 (owner: 10Majavah)
[12:44:50] <Majavah>	 ah thanks Urbanecm, totally forgot those
[12:44:56] <Urbanecm>	 np :)
[12:45:03] <Urbanecm>	 I think 14 days is more than enough for this purpose
[12:45:31] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add fiwiki 500k temporary logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655282 (owner: 10Majavah)
[12:47:08] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized static/images/project-logos/: 6a4cbe662655edaa4f6c36e69877766a6a48d828: Revert "Switch fiwiki to their 500k temporary logo!": delete temporary logo files (duration: 00m 57s)
[12:47:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:51] <Urbanecm>	 Majavah: I'm intentionally going to not purge the removed URIs, and I'm leaving them at the mercy of our caching infra
[12:48:48] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[12:49:21] <Urbanecm>	 dcausse: it seems i finally exhausted the long list of things to deploy i had, so once it merges, feel free to deploy it :)
[12:49:41] <dcausse>	 Urbanecm: ok, thanks! :)
[12:51:08] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:00:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:03:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:04:04] <wikibugs>	 (03PS11) 10Jbond: diffscan: pyhotnify [puppet] - 10https://gerrit.wikimedia.org/r/634572
[13:04:22] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc1 on pc2007 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 731.45 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:06:57] <jynus>	 marostegui, all pc* hosts seems to be lagging on codfw
[13:07:12] <marostegui>	 let's see..
[13:07:32] <jynus>	 qps halfed
[13:07:40] <wikibugs>	 (03Merged) 10jenkins-bot: Add an option to limit the size of the file_text field [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658249 (https://phabricator.wikimedia.org/T271493) (owner: 10DCausse)
[13:07:42] <wikibugs>	 (03PS4) 10A2569875: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612)
[13:08:09] <marostegui>	 https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=2&orgId=1&from=now-12h&to=now&var-server=pc1008&var-port=9104
[13:08:24] <wikibugs>	 (03PS5) 10A2569875: Add WikiProject and WikiProject_talk namespace and its aliases for zh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657572 (https://phabricator.wikimedia.org/T271612)
[13:08:58] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc3 on pc2009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 509.64 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:09:09] <marostegui>	 I am wondering if it could be something with the deployment
[13:09:23] <jynus>	 marostegui, write pc traffic seems to have doubled on eqiad
[13:09:31] <marostegui>	 yeah, see my paste above
[13:09:36] <marostegui>	 let me check the binlogs
[13:09:40] <marostegui>	 to see if there's something obvious
[13:09:52] <marostegui>	 Urbanecm: anything on the deployment that can hit parsercache?
[13:11:07] <dcausse>	 marostegui: I was about to ship something should I wait
[13:11:13] <marostegui>	 dcausse: please wait
[13:11:15] <dcausse>	 sure
[13:12:23] <Urbanecm>	 marostegui: I updated category collation (wgCategoryCollation) at trwikivoyage. That updates categorylinks, but it MIGHT cause re-render. However, trwikivoyage is a very tiny wiki, so I doubt that's the cause, even if it does re-render pages automagically
[13:13:23] <Urbanecm>	 (might=I'm not sure)
[13:13:27] <jynus>	 anomaly started around 12:12, but peaked at 12:42
[13:14:25] <marostegui>	 it is alsmost impossible to get anything interesting from pc binlogs :(
[13:14:49] <Urbanecm>	 marostegui: but I did only pretty standard changes, and i never observed them to hurt parsercache in any way
[13:16:40] <wikibugs>	 (03Abandoned) 10Matthias Mullie: Guard against this file being included twice [extensions/WikibaseMediaInfo] (wmf/1.35.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655927 (https://phabricator.wikimedia.org/T271933) (owner: 10Cparle)
[13:17:26] <marostegui>	 https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=6&orgId=1&refresh=5m&var-server=pc1008&var-datasource=thanos&var-cluster=mysql&from=now-24h&to=now
[13:17:31] <marostegui>	 something new is being stored on pc?
[13:17:53] <Urbanecm>	 I definitely didn't enable any new feature
[13:17:57] <marostegui>	 ah, nevermind that graph, that's not the partition
[13:19:51] <wikibugs>	 (03CR) 10Jbond: "Sorry for the delay have made some updates but needs testing" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/634572 (owner: 10Jbond)
[13:20:31] <marostegui>	 What I am seenig on the last binlogs are mostly deletes, so maybe some invalidation?
[13:24:10] <jynus>	 are you sure about deletes? on graph it seems to be REPLACEs
[13:24:36] <marostegui>	 From what I can see on the binlogs (there are so many) on the last ones there are mostly deletes, and before mostly replaces
[13:24:51] <marostegui>	 still scanning the "recent" ones
[13:25:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/657770 (https://phabricator.wikimedia.org/T235162) (owner: 10Muehlenhoff)
[13:25:46] <marostegui>	 the ones starting at .12 have replaces
[13:26:09] <jynus>	 I am basing on https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=2&orgId=1&var-server=pc1007&var-port=9104&from=1611408361871&to=1611581161871
[13:26:13] <marostegui>	 but not in a particular wiki
[13:26:28] <jynus>	 deletes until 19 h yesterday
[13:26:30] <marostegui>	 yes, it is the same on all of them
[13:26:36] <icinga-wm>	 RECOVERY - Logstash rate of ingestion percent change compared to yesterday #o11y on alert1001 is OK: (C)210 ge (W)150 ge 103.9 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[13:26:38] <jynus>	 but spikes seems to be lately replaces
[13:26:47] <jynus>	 enwiki, enwiktionary, commonswiki
[13:27:14] <jynus>	 (as in, with same pattern as normal traffic)
[13:27:45] <marostegui>	 the only thing that matches on SAL are:
[13:27:46] <marostegui>	 12:15 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 75aa32f: frwiki: Change back to normal logo (T272700) (duration: 01m 07s)
[13:27:46] <marostegui>	 12:07 urbanecm@deploy1001: Synchronized wmf-config/InitialiseSettings.php: 693eaec: Add bidgee.id.au to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T272202) (duration: 01m 01s)
[13:27:47] <stashbot>	 T272700: Remove birthday logo on French Wikipedia - https://phabricator.wikimedia.org/T272700
[13:27:47] <stashbot>	 T272202: Add bidgee.id.au to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T272202
[13:27:52] <marostegui>	 but not sure if it can be related in anyway
[13:28:16] <Urbanecm>	 marostegui: I'm 99% convinced it can't
[13:28:24] <jynus>	 I don't think it should be that, I am going to check traffic patterns
[13:28:36] <marostegui>	 Urbanecm: yeah, I agree
[13:29:20] <jynus>	 marostegui, almost definitely traffic, see: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1
[13:29:27] <jynus>	 traffic as in, web traffic
[13:29:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/657855 (https://phabricator.wikimedia.org/T265603) (owner: 10Alexandros Kosiaris)
[13:29:46] <marostegui>	 jynus: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=26&orgId=1
[13:29:48] <marostegui>	 this is pretty crazy
[13:29:50] <jynus>	 there is a small increase in throughput, and a high increase in latency
[13:31:02] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:32:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:34:26] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Remove linkrecommendation-external [dns] - 10https://gerrit.wikimedia.org/r/658303 (https://phabricator.wikimedia.org/T258978)
[13:37:58] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:39:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:40:54] <wikibugs>	 (03PS1) 10ArielGlenn: handle backwards searches for bbz2 blocks in tiny files [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658305
[13:40:56] <wikibugs>	 (03PS1) 10ArielGlenn: update tests for different distros and for split-bz2 using local binaries [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/658306
[13:44:26] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:57:47] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: add JSON logging of all notifications [puppet] - 10https://gerrit.wikimedia.org/r/658307 (https://phabricator.wikimedia.org/T272474)
[13:57:51] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rsyslog: send AM notifications logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/658308 (https://phabricator.wikimedia.org/T272474)
[13:58:58] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Marostegui) >>! In T272559#6766714, @jcrespo wrote: > Comments for persistence-related modules: but please @Marostegui @Kormat comment too. >  > * profile::proxysql > I wrote this for...
[14:01:02] <icinga-wm>	 RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 54 probes of 590 (alerts on 65) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:10:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/657560 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[14:18:51] <wikibugs>	 (03CR) 10Jbond: "This look fine however (not in scope if this change) i think ultimately we should drop nginx altogether and updated to use envoy for tls t" [puppet] - 10https://gerrit.wikimedia.org/r/657782 (owner: 10Muehlenhoff)
[14:18:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Adapt proxy setting in debmonitor nginx site for CAS [puppet] - 10https://gerrit.wikimedia.org/r/657782 (owner: 10Muehlenhoff)
[14:20:18] <icinga-wm>	 PROBLEM - Disk space on maps1004 is CRITICAL: DISK CRITICAL - free space: /srv 60202 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops
[14:20:22] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:24] <wikibugs>	 (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/657782 (owner: 10Muehlenhoff)
[14:22:59] <wikibugs>	 (03CR) 10Jbond: "Its unclear why this is required" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657793 (owner: 10Muehlenhoff)
[14:24:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Remove linkrecommendation-external [dns] - 10https://gerrit.wikimedia.org/r/658303 (https://phabricator.wikimedia.org/T258978) (owner: 10Alexandros Kosiaris)
[14:24:30] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:25:30] <wikibugs>	 (03CR) 10Jbond: debmonitor: Don't include debmonitor_static for the internal listener (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657795 (owner: 10Muehlenhoff)
[14:25:49] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox
[14:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:53] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate-main - bump to 2021-01-22-173634-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/657885 (https://phabricator.wikimedia.org/T262226) (owner: 10Ottomata)
[14:26:22] <logmsgbot>	 !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[14:26:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:27:12] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 592 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:28:41] <wikibugs>	 (03PS2) 10Ottomata: Remove migrated EventLoggingSchemas overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657688 (https://phabricator.wikimedia.org/T259163)
[14:28:44] <logmsgbot>	 !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[14:28:44] <logmsgbot>	 !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
[14:28:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:26] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:31:48] <logmsgbot>	 !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[14:31:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:55] <logmsgbot>	 !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[14:33:56] <logmsgbot>	 !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
[14:33:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:38] <wikibugs>	 (03CR) 10Muehlenhoff: debmonitor: Don't include debmonitor_static for the internal listener (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657795 (owner: 10Muehlenhoff)
[14:35:42] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.dns.netbox
[14:35:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:00] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Remove migrated EventLoggingSchemas overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657688 (https://phabricator.wikimedia.org/T259163) (owner: 10Ottomata)
[14:37:04] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:37:46] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Remove 2 Remove migrated EventLoggingSchemas overrides - T259163, T267352 (duration: 00m 56s)
[14:37:48] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:37:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:52] <stashbot>	 T259163: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163
[14:37:52] <stashbot>	 T267352: UniversalLanguageSelector Event Platform Migration - https://phabricator.wikimedia.org/T267352
[14:39:32] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:41:03] <wikibugs>	 (03PS1) 10Ottomata: Migrate SpecialMuteSubmit to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658314 (https://phabricator.wikimedia.org/T268517)
[14:43:58] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Migrate SpecialMuteSubmit to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658314 (https://phabricator.wikimedia.org/T268517) (owner: 10Ottomata)
[14:47:34] <wikibugs>	 (03PS1) 10Ottomata: Revert "Migrate SpecialMuteSubmit to Event Platform on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658343
[14:49:52] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Revert "Migrate SpecialMuteSubmit to Event Platform on all wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658343 (owner: 10Ottomata)
[14:53:10] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/657890 (https://phabricator.wikimedia.org/T269399) (owner: 10Bstorm)
[14:53:18] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki::web::prod_sites: remove unused code from main.conf [puppet] - 10https://gerrit.wikimedia.org/r/657138 (https://phabricator.wikimedia.org/T272305)
[14:53:20] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: [WiP] mediawiki: use a data structure to define prod_sites [puppet] - 10https://gerrit.wikimedia.org/r/657139 (https://phabricator.wikimedia.org/T272305)
[14:53:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: httpbb: Add test for gzipping of static css files. [puppet] - 10https://gerrit.wikimedia.org/r/658317 (https://phabricator.wikimedia.org/T272305)
[14:57:18] <wikibugs>	 (03CR) 10Volans: "First pass" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657877 (owner: 10Jbond)
[14:58:24] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:00:53] <wikibugs>	 (03CR) 10Volans: sre.misc-clusters.thumbor: create batch action cook book for thumbor (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond)
[15:02:36] <wikibugs>	 (03PS1) 10Elukey: Revert "role::analytics_test_cluster::client: Upgrade to Bigtop" [puppet] - 10https://gerrit.wikimedia.org/r/658344
[15:03:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "role::analytics_test_cluster::client: Upgrade to Bigtop" [puppet] - 10https://gerrit.wikimedia.org/r/658344 (owner: 10Elukey)
[15:04:52] <wikibugs>	 (03CR) 10Volans: "Apparently the iface label is now nullable and our script are setting it to empty string, see https://netbox.wikimedia.org/extras/changelo" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov)
[15:08:42] <wikibugs>	 (03PS1) 10Rosalie Perside (WMDE): Remove Wikibase.NewItemIdFormatter log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658321 (https://phabricator.wikimedia.org/T268870)
[15:09:00] <wikibugs>	 (03Abandoned) 10Awight: Lower maxHighlightLineLength limit to 5000 [extensions/CodeMirror] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/657308 (https://phabricator.wikimedia.org/T270238) (owner: 10WMDE-Fisch)
[15:09:57] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.change-distro-from-cdh-clients for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[15:09:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:39] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.change-distro-from-cdh-clients (exit_code=0) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001
[15:10:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:13] <elukey>	 very nice
[15:13:40] <wikibugs>	 10SRE, 10observability, 10CAS-SSO: thanos u/i gives errors if left idle for a few hours - https://phabricator.wikimedia.org/T268233 (10fgiunchedi) I can confirm I'm able to reproduce this, AFAICS the problematic case is an XHR from thanos UI with an expired SSO session. In this case the XHR will get a 302 to...
[15:15:08] <dcausse>	 jouncebot: now
[15:15:08] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 44 minute(s)
[15:16:03] <dcausse>	 !log re-opening EU Backport window to ship pending patches
[15:16:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:51] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, nit inline." (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond)
[15:18:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] profile: send w3creportingapi logs to indexes with custom schema [puppet] - 10https://gerrit.wikimedia.org/r/657452 (https://phabricator.wikimedia.org/T265938) (owner: 10Cwhite)
[15:20:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27640/console" [puppet] - 10https://gerrit.wikimedia.org/r/657370 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[15:20:49] <logmsgbot>	 !log dcausse@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/CirrusSearch/: Add an option to limit the size of the file_text field: T271493 (duration: 00m 58s)
[15:20:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:53] <stashbot>	 T271493: Implement 50kb limit on file text indexing for to reduce increasing commonswiki_file on-disk size - https://phabricator.wikimedia.org/T271493
[15:21:34] <wikibugs>	 (03PS1) 10DCausse: Revert "Add an option to limit the size of the file_text field" [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658345
[15:21:42] <wikibugs>	 (03CR) 10DCausse: [V: 03+2 C: 03+2] Revert "Add an option to limit the size of the file_text field" [extensions/CirrusSearch] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/658345 (owner: 10DCausse)
[15:22:41] <dcausse>	 sigh
[15:23:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+1] logstash: enable curator to accept custom age filters [puppet] - 10https://gerrit.wikimedia.org/r/657370 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[15:23:43] <logmsgbot>	 !log dcausse@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/CirrusSearch/: revert: Add an option to limit the size of the file_text field: T271493 (duration: 01m 05s)
[15:23:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:43] <wikibugs>	 (03CR) 10Hnowlan: start using imposm as OSM sync tool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos)
[15:29:09] <wikibugs>	 10SRE, 10Cloud-VPS, 10cloud-services-team (Kanban): CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi)
[15:29:25] <wikibugs>	 10SRE, 10Platform Engineering (Icebox), 10User-Eevans: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839 (10fgiunchedi)
[15:30:11] <wikibugs>	 10Puppet, 10SRE, 10observability, 10User-jbond: PuppetDB grafana graphs not matching logs - https://phabricator.wikimedia.org/T265649 (10fgiunchedi)
[15:31:13] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10SRE-swift-storage, and 6 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10fgiunchedi)
[15:31:23] <wikibugs>	 (03PS1) 10Volans: tests: cover untested property in the irc module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347
[15:31:38] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:33:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tests: cover untested property in the irc module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347 (owner: 10Volans)
[15:34:01] <elukey>	 what? Riccardo got -1 from jenkins?
[15:34:07] * elukey braces for impact
[15:34:18] * elukey runs away
[15:34:20] <elukey>	 :D
[15:34:49] <ema>	 usually it's the other way round
[15:39:06] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:42:54] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming for rebuild
[15:42:55] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming for rebuild
[15:42:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:18] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming for rebuild
[15:44:18] <logmsgbot>	 !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming for rebuild
[15:44:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:48:01] <wikibugs>	 (03PS2) 10Hnowlan: maps: reimage maps1009 with buster. [puppet] - 10https://gerrit.wikimedia.org/r/656404 (https://phabricator.wikimedia.org/T238753)
[15:48:03] <wikibugs>	 (03PS2) 10Hnowlan: maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753)
[15:48:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan)
[15:54:16] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[15:58:04] <wikibugs>	 (03CR) 10DCausse: bump memory for flink processes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/657941 (owner: 10Mstyles)
[15:58:48] <godog>	 the netbox DNS might be me, I've set some ms-be2* hosts to active
[16:01:19] <godog>	 ah no, linkrecommendation
[16:05:38] <wikibugs>	 (03PS1) 10Bstorm: Revert "toolforge-k8s: AdmissionsConfiguration is GA after 1.17" [puppet] - 10https://gerrit.wikimedia.org/r/658366
[16:06:00] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: pc1 on pc2007 is OK: OK slave_sql_lag Replication lag: 57.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:06:48] <wikibugs>	 (03PS2) 10Volans: tests: cover untested property in the irc module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347
[16:06:59] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] Revert "toolforge-k8s: AdmissionsConfiguration is GA after 1.17" [puppet] - 10https://gerrit.wikimedia.org/r/658366 (owner: 10Bstorm)
[16:07:01] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Revert "toolforge-k8s: AdmissionsConfiguration is GA after 1.17" [puppet] - 10https://gerrit.wikimedia.org/r/658366 (owner: 10Bstorm)
[16:07:06] <icinga-wm>	 PROBLEM - tileratorui on maps1009 is CRITICAL: connect to address 10.64.32.8 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui
[16:07:31] <logmsgbot>	 !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming for rebuild
[16:07:31] <logmsgbot>	 !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on maps1009.eqiad.wmnet with reason: Downtiming for rebuild
[16:07:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:14] <volans>	 godog: are you about to run the cookbook to fix the above uncommitted changes?
[16:08:45] <wikibugs>	 (03PS3) 10Hnowlan: maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753)
[16:08:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] tests: cover untested property in the irc module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347 (owner: 10Volans)
[16:08:50] <volans>	 I see a start from akosiaris but never an end... maybe you forgot to confirm akoopal ?
[16:09:09] <volans>	 *akosiaris  ^^^ (sorry for the ping akoo.pal )
[16:09:49] <wikibugs>	 (03PS3) 10Volans: tests: cover untested property in the irc module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347
[16:11:44] <wikibugs>	 10SRE, 10Inuka-Team, 10Privacy, 10Product-Analytics (Kanban), 10Security: Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10JFishback_WMF) Untagging #security-team for now, but please feel free to add back if there is something else needed.
[16:13:41] <godog>	 volans: no I ran the test cookbook to double check, in a meeting now
[16:13:51] <volans>	 k
[16:17:31] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Add support for scraping php applications to the kubernetes prometheus scraper - https://phabricator.wikimedia.org/T271822 (10lmata) Hi Joe,  Let us know if there is any support you'd like from our team on this task, otherwise moving to Radar for now.
[16:20:27] <wikibugs>	 10SRE, 10netops, 10observability: Add Icinga check for SRX cluster status - https://phabricator.wikimedia.org/T271298 (10lmata) Hi Arzhel,  Please let me know if there is any specific support you need for this task, moving to Radar meanwhile. Thanks!
[16:22:11] <wikibugs>	 (03PS1) 10Bstorm: toolforge-k8s: update the api version [puppet] - 10https://gerrit.wikimedia.org/r/658350
[16:23:20] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] toolforge-k8s: update the api version [puppet] - 10https://gerrit.wikimedia.org/r/658350 (owner: 10Bstorm)
[16:24:34] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: pc3 on pc2009 is OK: OK slave_sql_lag Replication lag: 55.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:27:30] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.438 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[16:28:02] <icinga-wm>	 RECOVERY - Disk space on maps1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops
[16:29:56] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:31:20] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:32:14] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:36:06] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[16:37:03] <wikibugs>	 (03CR) 10CRusnov: "> Patch Set 4:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/657199 (owner: 10CRusnov)
[16:38:18] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:38:41] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Machine Learning Platform (Current): Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10klausman)
[16:38:51] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "I believe Toolforge just installs the package and configures it on a quick check, so this shouldn't impact cloud. I also don't see it in i" [puppet] - 10https://gerrit.wikimedia.org/r/658271 (https://phabricator.wikimedia.org/T272559) (owner: 10Muehlenhoff)
[16:39:12] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:40:30] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:40:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:48] <akosiaris>	 volans: indeed. I just typed "go" at the prompt right now. I guess it went (wherever it was meant to "go") :-)
[16:41:12] <volans>	 akosiaris: rotfl
[16:41:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thanks lgtm then 😊" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/657795 (owner: 10Muehlenhoff)
[16:41:19] <wikibugs>	 10SRE, 10Analytics-Clusters, 10vm-requests: Eq: new Druid test VM for analytics - https://phabricator.wikimedia.org/T266771 (10Ottomata)
[16:41:32] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:43:39] <wikibugs>	 (03PS14) 10Jbond: sre: convert the generic reboot functions to the cookbook class API [cookbooks] - 10https://gerrit.wikimedia.org/r/657102
[16:50:18] <wikibugs>	 10SRE, 10observability, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO: "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) - https://phabricator.wikimedia.org/T141520 (10lmata) 05Open→03Resolved a:03lmata Hello,  3M delay seems like...
[16:50:58] <wikibugs>	 (03PS1) 10Jason Linehan: Enables MediaWiki client error instrument on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658356 (https://phabricator.wikimedia.org/T255585)
[16:52:50] <icinga-wm>	 RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:53:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) I am working on it, I am dependant on Dell. I do need to update all the f/w and idrac today.
[16:54:14] <wikibugs>	 (03CR) 10Jbond: "Thanks for the review, however i'm not sure if cookbooks, cumin is currently the right tool do do what i want.  which is basic a way to ru" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657877 (owner: 10Jbond)
[16:56:11] <wikibugs>	 (03PS5) 10Jbond: WIP) sre.apt.audit: produce a report of manually packages [cookbooks] - 10https://gerrit.wikimedia.org/r/657877
[16:57:57] <wikibugs>	 (03CR) 10Muehlenhoff: "LGTM, but one thing I noticed is that profile::maps::osm_master needs to be adapted for Buster as well; we need to update the $pgversion c" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan)
[16:58:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP) sre.apt.audit: produce a report of manually packages [cookbooks] - 10https://gerrit.wikimedia.org/r/657877 (owner: 10Jbond)
[17:00:06] <icinga-wm>	 PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:01:05] <wikibugs>	 (03CR) 10Jason Linehan: "Playing it safe by just knocking out the enwiki off switch for this test run." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658356 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan)
[17:01:22] <wikibugs>	 (03CR) 10Jbond: "thanks for the review comments inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond)
[17:01:46] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[17:02:10] <wikibugs>	 (03PS1) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357
[17:02:47] <wikibugs>	 (03CR) 10David Caro: wmcs: first try on creating a new etcd for toolforge (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro)
[17:04:03] <wikibugs>	 (03PS2) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357
[17:07:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro)
[17:07:49] <wikibugs>	 (03PS3) 10David Caro: wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357
[17:10:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs: first try on creating a new etcd for toolforge [cookbooks] - 10https://gerrit.wikimedia.org/r/658357 (owner: 10David Caro)
[17:10:03] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Enables MediaWiki client error instrument on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658356 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan)
[17:19:50] <wikibugs>	 (03PS5) 10Jbond: dns:  update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141
[17:20:05] <wikibugs>	 (03CR) 10Jbond: "updated thanks" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond)
[17:22:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm and thanks 😊" [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347 (owner: 10Volans)
[17:22:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] dns:  update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141 (owner: 10Jbond)
[17:22:55] <wikibugs>	 (03CR) 10Volans: [C: 03+2] tests: cover untested property in the irc module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347 (owner: 10Volans)
[17:25:14] <wikibugs>	 (03Merged) 10jenkins-bot: tests: cover untested property in the irc module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/658347 (owner: 10Volans)
[17:25:20] <wikibugs>	 (03CR) 10Jbond: "resolve nits" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/657102 (owner: 10Jbond)
[17:26:19] <wikibugs>	 (03PS6) 10Jbond: dns:  update DNS to support multiple namservers [software/pywmflib] - 10https://gerrit.wikimedia.org/r/656141
[17:29:45] <wikibugs>	 (03PS1) 10Elukey: Add more users to the Hadoop Backup cluster (no ssh access) [puppet] - 10https://gerrit.wikimedia.org/r/658394 (https://phabricator.wikimedia.org/T260411)
[17:31:48] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:38:57] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Machine Learning Platform (Current): Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10jbond) @klausman (adding a comment here incase it was missed from the meeting) when this access is revoked and the hacking is over...
[17:39:04] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:39:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add more users to the Hadoop Backup cluster (no ssh access) [puppet] - 10https://gerrit.wikimedia.org/r/658394 (https://phabricator.wikimedia.org/T260411) (owner: 10Elukey)
[17:42:18] <wikibugs>	 (03CR) 10Hnowlan: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan)
[17:42:20] <wikibugs>	 (03PS4) 10Hnowlan: maps: make maps1009 a new, independent buster master. [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753)
[17:42:26] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 240495320 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:44:52] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 524296 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[17:46:10] <wikibugs>	 (03PS1) 10Ladsgroup: wmcs: Migrate hiera() to lookup() and set datatypes in nfs primary [puppet] - 10https://gerrit.wikimedia.org/r/658397 (https://phabricator.wikimedia.org/T209953)
[17:48:09] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27644/console" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan)
[17:48:19] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/27643/" [puppet] - 10https://gerrit.wikimedia.org/r/658397 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[17:48:52] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on deploy2002 is CRITICAL: CRITICAL: Missing 1 sites from wikiversions. 966 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[17:49:27] <wikibugs>	 (03PS1) 10RobH: updating r740xd2 skus [software] - 10https://gerrit.wikimedia.org/r/658398
[17:49:28] <icinga-wm>	 PROBLEM - Ensure local MW versions match expected deployment on deploy1002 is CRITICAL: CRITICAL: Missing 1 sites from wikiversions. 966 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers
[17:49:53] <wikibugs>	 (03CR) 10RobH: [C: 03+2] updating r740xd2 skus [software] - 10https://gerrit.wikimedia.org/r/658398 (owner: 10RobH)
[17:50:36] <wikibugs>	 (03Merged) 10jenkins-bot: updating r740xd2 skus [software] - 10https://gerrit.wikimedia.org/r/658398 (owner: 10RobH)
[17:50:46] <robh>	 huh
[17:50:50] <robh>	 it auto merged in that repo
[17:50:53] <robh>	 i did... not expect that
[17:51:19] <robh>	 its just the ser software repo for local use on laptops so not a big deal, just unexpected.
[17:51:27] <robh>	 sre
[17:53:40] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove Wikibase.NewItemIdFormatter log channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658321 (https://phabricator.wikimedia.org/T268870) (owner: 10Rosalie Perside (WMDE))
[17:58:12] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Machine Learning Platform (Current): Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10klausman) >>! In T272687#6774181, @jbond wrote: > @klausman (adding a comment here incase it was missed from the meeting) when thi...
[18:00:04] <jouncebot>	 ryankemper: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210125T1800).
[18:04:22] <wikibugs>	 10SRE, 10ops-eqiad: ms-be1046 stuck on reboot - https://phabricator.wikimedia.org/T272396 (10Cmjohnson) I also attempted to update bios, power f/w, idrac, and all were failed due to the server's inability to power up.   A dell ticket has been created.  SR1049823171
[18:05:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) I failed to re-connect the mgmt cable after getting it to power on and was not able to remotely access the server to get the logs for the Dell tech.  I connected everything, updated the bios and...
[18:06:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) Thanks Chris, any chances that we can get the host to boot up at least so MySQL replication can catch up a bit. Thank you!
[18:06:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) @marostegui it should be accessible now
[18:07:18] <wikibugs>	 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Use lookup() instead of hiera() in Puppet - https://phabricator.wikimedia.org/T209953 (10Dzahn)
[18:07:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) There is just let memory at the moment
[18:08:25] <wikibugs>	 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Use lookup() instead of hiera() in Puppet - https://phabricator.wikimedia.org/T209953 (10Dzahn)
[18:16:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Cmjohnson) Dell ticket number SR1049824647
[18:18:11] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:19:10] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:21:53] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Legoktm) p:05Triage→03Lowest
[18:22:20] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:22:44] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[18:25:40] <wikibugs>	 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Cmjohnson) The issue that Dell has with this is we cannot determine which DIMM is failed. The hardware logs all look good and do not indicate an...
[18:25:45] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "This was approved today:" [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn)
[18:25:53] <wikibugs>	 (03PS7) 10Jcrespo: add platform engineering folks to snapshot and dumpsdata server access [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn)
[18:26:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] add platform engineering folks to snapshot and dumpsdata server access [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn)
[18:27:34] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27645/console" [puppet] - 10https://gerrit.wikimedia.org/r/656460 (https://phabricator.wikimedia.org/T238753) (owner: 10Hnowlan)
[18:28:04] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10I18n: Mailman password reminder mail (and other texts) has broken encoding in Czech - https://phabricator.wikimedia.org/T271123 (10Legoktm) >>! In T271123#6752321, @Mormegil wrote: > Well, yes, for Czech, the subscription confirmation e-mail seems to be sent correctly, now...
[18:29:24] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Oh, we need to update GIDs on rebase, as changes have happened since then. :-( +1 otherwise." [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn)
[18:29:34] <wikibugs>	 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Cmjohnson) I am attaching the TSR report so you will see none of the h/w logs suggest there is an issue. {F34022892}
[18:30:16] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:31:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Memory errors on clouddb1019 - https://phabricator.wikimedia.org/T272125 (10Marostegui) Thanks Chris - I can now access the server and will start mysql so it can catch up on replication!. Let's coordinate to install the new memory once it arrives. Thanks again
[18:31:42] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "```" [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn)
[18:33:03] <wikibugs>	 (03PS3) 10Hnowlan: mtail: create separate metrics histogram for REST API requests [puppet] - 10https://gerrit.wikimedia.org/r/634207 (https://phabricator.wikimedia.org/T263727)
[18:33:34] <wikibugs>	 (03PS1) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407
[18:33:52] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 317251272 and 82 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:34:07] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1411.eqiad.wmnet with reason: REIMAGE
[18:34:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond)
[18:35:02] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 18 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[18:36:08] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1412.eqiad.wmnet with reason: REIMAGE
[18:36:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:11] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1411.eqiad.wmnet with reason: REIMAGE
[18:36:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:37:22] <wikibugs>	 10SRE, 10Dumps-Generation, 10SRE-Access-Requests, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Add all of CPT to snapshot/dumpsdata admins - https://phabricator.wikimedia.org/T271718 (10Legoktm) This was approved in today's SRE meeting.
[18:38:18] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1412.eqiad.wmnet with reason: REIMAGE
[18:38:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:25] <wikibugs>	 (03CR) 10Legoktm: add platform engineering folks to snapshot and dumpsdata server access (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn)
[18:40:33] <wikibugs>	 (03PS2) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407
[18:41:22] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2326.codfw.wmnet with reason: REIMAGE
[18:41:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:58] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc3 on pc2009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 318.90 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:43:23] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2326.codfw.wmnet with reason: REIMAGE
[18:43:24] <marostegui>	 I am going to silence all pc codfw replicas for a day
[18:43:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:43:36] <jynus>	 oh, is it happening again?
[18:43:58] <wikibugs>	 (03PS8) 10Legoktm: admin: Add platform engineering team to {snapshot,dumpsdata}-admins [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn)
[18:44:00] <jynus>	 cool to me
[18:44:16] <wikibugs>	 (03PS3) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407
[18:44:24] <marostegui>	 It was a small spike, I guess it will be spiking maybe a few hours
[18:44:27] <marostegui>	 I will downtime to avoid noise
[18:44:33] <jynus>	 +1, thanks
[18:45:16] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2324.codfw.wmnet with reason: REIMAGE
[18:45:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:32] <wikibugs>	 (03PS4) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407
[18:47:43] <jynus>	 alert2001?
[18:48:05] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:49:05] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:50:14] <wikibugs>	 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10EBernhardson) As far as I understand it, it's not possible for the linux kernel to map a physical address back to a single dimm. It just doesn't...
[18:50:24] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: run-puppet-agent --enable flag is broken - https://phabricator.wikimedia.org/T272539 (10Legoktm) p:05Triage→03High a:03jbond
[18:52:07] <wikibugs>	 (03PS5) 10Jbond: O:idp: update apero_cas::service so its a bit more intuitive [puppet] - 10https://gerrit.wikimedia.org/r/658407
[18:53:23] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27650/console" [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond)
[18:54:12] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Machine Learning Platform (Current): Give access to ml-serve* to the non-ops members of the ML team - https://phabricator.wikimedia.org/T272687 (10Legoktm) This was approved in today's SRE meeting with the following notes: * Approved due to the recognition of exceptional circu...
[18:54:37] <MatmaRex>	 hi, anyone around bored and willing to help with a small thing? can you run `var_dump( ChangeTags::listDefinedTags() )` for me on production enwiki, and paste the output somewhere?
[18:55:02] <MatmaRex>	 this is mostly the same data as on https://en.wikipedia.org/wiki/Special:Tags , but i want to know the internal order
[18:55:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "As far as i can tell PCC just shows some reordering" [puppet] - 10https://gerrit.wikimedia.org/r/658407 (owner: 10Jbond)
[18:56:11] <legoktm>	 sure
[18:56:29] <wikibugs>	 (03PS1) 10Ottomata: eventgate-analytics-external - bump to 2021-01-25-183848-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/658410 (https://phabricator.wikimedia.org/T257237)
[18:56:39] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:57:43] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1411.eqiad.wmnet'] `  an...
[18:57:48] <legoktm>	 MatmaRex: https://phabricator.wikimedia.org/P13947
[18:58:03] <MatmaRex>	 thanks legoktm <3
[18:58:33] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1412.eqiad.wmnet'] `  an...
[18:59:27] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate-analytics-external - bump to 2021-01-25-183848-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/658410 (https://phabricator.wikimedia.org/T257237) (owner: 10Ottomata)
[19:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210125T1900).
[19:00:05] <jouncebot>	 tgr: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[19:00:14] <tgr_>	 o/
[19:00:21] <logmsgbot>	 !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[19:00:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:01:10] <Jdlrobson>	 tgr_: can i add a config change to the backport window?
[19:01:18] <tgr_>	 sure
[19:01:27] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10Event-Platform: Reduce cache TTL of schema.wikimedia.org - https://phabricator.wikimedia.org/T267557 (10fdans) 05Open→03Resolved
[19:01:42] <Jdlrobson>	 tgr it's https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/658356
[19:02:02] <wikibugs>	 10SRE, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10fdans) 05Open→03Resolved
[19:02:51] <Jdlrobson>	 have added to the calendar
[19:03:21] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] admin: Add platform engineering team to {snapshot,dumpsdata}-admins [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn)
[19:07:03] <wikibugs>	 (03PS2) 10Gergő Tisza: [beta] GrowthExperiments: set link recommendation feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657292
[19:07:14] <wikibugs>	 10SRE, 10Dumps-Generation, 10SRE-Access-Requests, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team): Add all of CPT to snapshot/dumpsdata admins - https://phabricator.wikimedia.org/T271718 (10Legoktm) 05Open→03Resolved This should rollout over the next 20-30 minutes. Please re-open if...
[19:08:50] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] [beta] GrowthExperiments: set link recommendation feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657292 (owner: 10Gergő Tisza)
[19:09:53] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] GrowthExperiments: set link recommendation feature flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657292 (owner: 10Gergő Tisza)
[19:12:00] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Show listadmins (names or email addresses?) on main page of each mailing list - https://phabricator.wikimedia.org/T272778 (10Legoktm) p:05Triage→03Low >>! In T272778#6772072, @Ciell wrote: > We have several lists (our admin-list for instance) that do not allow open registr...
[19:13:23] <icinga-wm>	 PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:14:21] <wikibugs>	 (03PS3) 10Ottomata: eventgate -  Map from eventgate event and error statsd metrics to prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/657908 (https://phabricator.wikimedia.org/T257237)
[19:14:48] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2326.codfw.wmnet'] `  an...
[19:15:45] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2324.codfw.wmnet'] `  an...
[19:16:08] <wikibugs>	 (03PS2) 10Gergő Tisza: Enables MediaWiki client error instrument on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658356 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan)
[19:16:12] <logmsgbot>	 !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:657292|[beta] GrowthExperiments: set link recommendation feature flags ()]] (duration: 01m 06s)
[19:16:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:21] <wikibugs>	 (03PS4) 10Ottomata: eventgate -  Map from eventgate event and error statsd metrics to prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/657908 (https://phabricator.wikimedia.org/T257237)
[19:16:38] <wikibugs>	 (03PS5) 10Ottomata: eventgate -  Map from eventgate event and error statsd metrics to prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/657908 (https://phabricator.wikimedia.org/T257237)
[19:17:39] <icinga-wm>	 RECOVERY - PHP opcache health on mwdebug1001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[19:18:11] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate -  Map from eventgate event and error statsd metrics to prometheus [deployment-charts] - 10https://gerrit.wikimedia.org/r/657908 (https://phabricator.wikimedia.org/T257237) (owner: 10Ottomata)
[19:18:19] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Enables MediaWiki client error instrument on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658356 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan)
[19:18:23] <wikibugs>	 10SRE, 10DNS, 10Mail, 10Traffic: ITS request to update SPF & DNS Records for Trust & Safety - https://phabricator.wikimedia.org/T272750 (10Legoktm) From the Greenhouse task @Aklapper linked:  >>! In T189065#4031997, faidon wrote: > This has been discussed in bigger requests a couple of times before (T10389...
[19:18:33] <wikibugs>	 10SRE, 10DNS, 10Mail, 10Traffic: ITS request to update SPF & DNS Records for Trust & Safety - https://phabricator.wikimedia.org/T272750 (10Legoktm) p:05Triage→03Medium
[19:18:47] <wikibugs>	 10SRE, 10serviceops: Upgrade docker-registry servers to Debian Buster - https://phabricator.wikimedia.org/T272550 (10Legoktm) p:05Triage→03Low
[19:19:29] <wikibugs>	 (03Merged) 10jenkins-bot: Enables MediaWiki client error instrument on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658356 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan)
[19:20:38] <logmsgbot>	 !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[19:20:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:47] <tgr_>	 Jdlrobson: it's on mwdebug1001
[19:20:53] <Jdlrobson>	 sweet on it
[19:21:23] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 43552 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:22:49] <Jdlrobson>	 tgr_: works!
[19:24:11] <Jdlrobson>	 tgr_:https://logstash.wikimedia.org/app/dashboards#/doc/logstash-*/logstash-2021.01.25?id=U3L_OncBXM-H9NFXjGWr
[19:25:05] <logmsgbot>	 !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:658356|Enables MediaWiki client error instrument on English Wikipedia (T255585)]] (duration: 01m 01s)
[19:25:07] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "Sigh, sorry for forgetting about another piece of this mess... retroactive LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658218 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui)
[19:25:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:09] <stashbot>	 T255585: [EPIC] Extend client-side error logging coverage to include English Wikipedia - https://phabricator.wikimedia.org/T255585
[19:25:26] <tgr_>	 Jdlrobson: it's live. Thanks for pushing JS error logging forward!
[19:25:46] <Jdlrobson>	 sweet... now for the fun game of identifying broken gadgets :)
[19:25:55] <tgr_>	 btw should we add the JS error channel to the things deployers should keep an eye on, or is it too noisy for that?
[19:26:48] <wikibugs>	 10SRE, 10docker-pkg, 10serviceops, 10Technical-Debt: Get rid of the concept of "seed image" in docker-pkg - https://phabricator.wikimedia.org/T272154 (10Legoktm) p:05Triage→03Low
[19:27:06] <wikibugs>	 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Add support for scraping php applications to the kubernetes prometheus scraper - https://phabricator.wikimedia.org/T271822 (10Legoktm) p:05Triage→03Medium
[19:29:46] <logmsgbot>	 !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
[19:29:47] <logmsgbot>	 !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[19:29:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:29:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:11] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:31:55] <wikibugs>	 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10Legoktm) p:05Triage→03Medium Have they announced when the 7.11 release is happening? It's not clear to me how long w...
[19:32:39] <wikibugs>	 (03PS1) 10Ottomata: eventgate-* - bump to 2021-01-25-183848-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/658412 (https://phabricator.wikimedia.org/T257237)
[19:32:42] <wikibugs>	 (03PS1) 10Ottomata: eventgate-main - precache /mediawiki/revision/recommendation-create/1.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/658413 (https://phabricator.wikimedia.org/T262226)
[19:34:47] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 47995128 and 29 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:35:03] <wikibugs>	 10SRE, 10Traffic, 10serviceops, 10Performance Issue: When logged in, loading the frwiki homepage takes a very long time - https://phabricator.wikimedia.org/T270631 (10Legoktm) 05Open→03Resolved a:03Legoktm >>! In T270631#6706316, @Legoktm wrote: > @Thibaut120094 I believe this requires editing https:...
[19:35:55] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:36:17] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 522488 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:36:30] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate-* - bump to 2021-01-25-183848-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/658412 (https://phabricator.wikimedia.org/T257237) (owner: 10Ottomata)
[19:36:39] <logmsgbot>	 !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[19:36:39] <logmsgbot>	 !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
[19:36:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:36:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:18] <tgr_>	 !log Morning deploys done
[19:37:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:26] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate-main - precache /mediawiki/revision/recommendation-create/1.0.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/658413 (https://phabricator.wikimedia.org/T262226) (owner: 10Ottomata)
[19:37:47] <Jdlrobson>	 > btw should we add the JS error channel to the things deployers should keep an eye on, or is it too noisy for that?
[19:37:53] <Jdlrobson>	 @tgr definitely
[19:37:57] <Jdlrobson>	 tgr_: definitely
[19:38:12] <Jdlrobson>	 I'm attending the deploy triage meeting for the train
[19:38:26] <Jdlrobson>	 but yeh I think when backporting JS changes, it's important for us to be checking the graphs post-change
[19:40:08] <logmsgbot>	 !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[19:40:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:41:16] <logmsgbot>	 !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[19:41:16] <logmsgbot>	 !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
[19:41:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:41:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:41:39] <logmsgbot>	 !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
[19:41:39] <logmsgbot>	 !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
[19:41:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:41:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:43:22] <wikibugs>	 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10EBernhardson) We have as long as we want to figure out what to do next, I don't think the day they drop 7.11 changes any...
[19:44:00] <logmsgbot>	 !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
[19:44:00] <logmsgbot>	 !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[19:44:01] <logmsgbot>	 !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
[19:44:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:47:52] <wikibugs>	 (03PS1) 10Bstorm: metrics-server: upgrade to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/658416
[19:48:06] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1411.eqiad.wmnet
[19:48:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:23] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw14124.eqiad.wmnet
[19:48:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:30] <logmsgbot>	 !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
[19:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:48:54] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2326.codfw.wmnet
[19:48:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:49:28] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2324.codfw.wmnet
[19:49:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:50:26] <wikibugs>	 10SRE, 10DBA, 10Platform Engineering Roadmap Decision Making, 10Performance-Team (Radar), 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Krinkle)
[19:52:12] <logmsgbot>	 !log otto@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[19:52:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:53:39] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1411.eqiad.wmnet
[19:53:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:16] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1412.eqiad.wmnet
[19:54:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:55] <logmsgbot>	 !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[19:54:55] <logmsgbot>	 !log otto@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
[19:54:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:54:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:55:43] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Jclark-ctr) replaced Dac cable for  an-worker1119 and an-worker1131  @elukey confirmed both are seeing network
[19:56:51] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:57:50] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:58:00] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2326.codfw.wmnet'] `  Of...
[19:58:14] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[19:59:31] <wikibugs>	 10SRE, 10Wikimedia-Logstash, 10observability, 10Software-Licensing: Elasticsearch and Kibana are switching to non-OSI-approved SSPL licence - https://phabricator.wikimedia.org/T272238 (10AntiCompositeNumber) Elastic does not announce release dates in advance.  CirrusSearch is still on ES 6, which will beco...
[20:00:48] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2324.codfw.wmnet
[20:00:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:40] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:01:43] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2326.codfw.wmnet'] `  Of...
[20:02:25] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:04:02] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:12:37] <wikibugs>	 (03PS1) 10Mforns: Refine SuggestedTagsAction schema using eventlogging_legacy job [puppet] - 10https://gerrit.wikimedia.org/r/658419 (https://phabricator.wikimedia.org/T267351)
[20:12:43] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1410.eqiad.wmnet with reason: REIMAGE
[20:12:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:49] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1410.eqiad.wmnet with reason: REIMAGE
[20:14:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:15:50] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 65633552 and 42 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[20:16:56] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:17:14] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2326.codfw.wmnet with reason: REIMAGE
[20:17:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:00] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 328304 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[20:19:04] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:19:22] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2326.codfw.wmnet with reason: REIMAGE
[20:19:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:21] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "Yes, indeed. looks like a duplicate and this is already done" [puppet] - 10https://gerrit.wikimedia.org/r/641508 (https://phabricator.wikimedia.org/T267744) (owner: 10Herron)
[20:21:04] <wikibugs>	 (03Abandoned) 10Herron: admin: add ldap_only_user entry for tillmletzko-wmde [puppet] - 10https://gerrit.wikimedia.org/r/641508 (https://phabricator.wikimedia.org/T267744) (owner: 10Herron)
[20:21:29] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2322.codfw.wmnet with reason: REIMAGE
[20:21:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:06] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2323.codfw.wmnet with reason: REIMAGE
[20:23:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:33] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2322.codfw.wmnet with reason: REIMAGE
[20:23:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:36] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2323.codfw.wmnet with reason: REIMAGE
[20:25:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:30:18] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:32:08] <wikibugs>	 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10Dzahn) >>! In T272559#6766384, @jbond wrote: > Im not familiar with the frack set up, do they depend on our repo or do they have there own.  My assumption has always been that they ah...
[20:35:27] <wikibugs>	 (03PS1) 10Papaul: DHCP: Add MAC address and partman for cloudgw2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/658422 (https://phabricator.wikimedia.org/T271590)
[20:35:28] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:35:39] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Dzahn) >  Gateway Time-out for url: https://docker-registry.discovery.wmnet  Gotta set HTTP_PROXY/HTTPS_PROXY env variable?
[20:35:41] <logmsgbot>	 !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[20:35:41] <logmsgbot>	 !log otto@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
[20:35:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:50] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1410.eqiad.wmnet'] `  an...
[20:37:12] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address and partman for cloudgw2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/658422 (https://phabricator.wikimedia.org/T271590) (owner: 10Papaul)
[20:40:59] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1410.eqiad.wmnet
[20:41:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:38] <wikibugs>	 (03PS1) 10Papaul: Partman: Add cloudgw2002 [puppet] - 10https://gerrit.wikimedia.org/r/658423 (https://phabricator.wikimedia.org/T271590)
[20:42:28] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Partman: Add cloudgw2002 [puppet] - 10https://gerrit.wikimedia.org/r/658423 (https://phabricator.wikimedia.org/T271590) (owner: 10Papaul)
[20:44:07] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1410.eqiad.wmnet
[20:44:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:45:25] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[20:46:30] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2326.codfw.wmnet'] `  an...
[20:46:57] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2326.codfw.wmnet
[20:46:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:42] <wikibugs>	 (03PS1) 10Papaul: Add clougw2002-dev to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/658424 (https://phabricator.wikimedia.org/T271590)
[20:49:05] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] Add clougw2002-dev to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/658424 (https://phabricator.wikimedia.org/T271590) (owner: 10Papaul)
[20:49:43] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2326.codfw.wmnet
[20:49:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:10] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2322.codfw.wmnet'] `  an...
[20:50:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10wiki_willy) Hi @Jgreen - it looks like we're running a bit tight on space in the Fundraising rack.  In order for us to rack the servers for this install, do you have 1-2 existing se...
[20:52:30] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by pt1979 on cumin2001.codfw.wmnet for hosts: ` cloudgw20...
[20:52:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:52:59] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2323.codfw.wmnet'] `  an...
[20:53:50] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:53:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jclark-ctr)
[20:55:00] <wikibugs>	 (03PS1) 10Mforns: Migrate WebUIActionsTracking schemas to Event Platform on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658426 (https://phabricator.wikimedia.org/T267347)
[20:55:23] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Looking good: https://puppet-compiler.wmflabs.org/compiler1002/27642/" [puppet] - 10https://gerrit.wikimedia.org/r/658211 (https://phabricator.wikimedia.org/T271427) (owner: 10Marostegui)
[20:57:12] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] wmnet: Update s4-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/658213 (https://phabricator.wikimedia.org/T271427) (owner: 10Marostegui)
[21:00:04] <jouncebot>	 chrisalbon and accraze: How many deployers does it take to do Services – Graphoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210125T2100).
[21:02:01] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) ` legoktm@registry2002:~$ time curl "https://docker-registry.discovery.wmnet/v2/_catalog?last=releng%2Fquibble-jessie-php55&n=100" {"re...
[21:07:08] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 111119872 and 60 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:07:25] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1338.eqiad.wmnet with reason: REIMAGE
[21:07:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:07:50] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10Papaul)
[21:08:03] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE
[21:08:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:08:05] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) ` legoktm@registry2002:~$ time curl "https://docker-registry.discovery.wmnet/v2/_catalog?last=releng%2Fquibble-jessie-php55&n=100" <htm...
[21:08:42] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 565368 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:08:56] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Dzahn) Fwiw, i get the same timeout when doing that curl command from registry1002.
[21:09:28] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1338.eqiad.wmnet with reason: REIMAGE
[21:09:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:24] <logmsgbot>	 !log pt1979@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE
[21:11:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:01] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 253265904 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:13:05] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[21:19:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudgw2002-dev.codfw.wmnet'] `  and were **ALL** successful.
[21:19:23] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:20:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10Papaul)
[21:21:45] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Kanban): (Need By: TBD) rack/setup/install cloudgw2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T271590 (10Papaul) 05Open→03Resolved @aborrero  this is complete.  let me know if you have any questions.  Thanks.
[21:30:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:30:38] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:37:08] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:43:22] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: pc3 on pc2009 is OK: OK slave_sql_lag Replication lag: 12.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[21:49:34] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:53:35] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1338.eqiad.wmnet'] `  an...
[21:53:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:54:49] <wikibugs>	 (03PS1) 10Legoktm: docker_registry_ha: Increase nginx proxy timeout to 120s [puppet] - 10https://gerrit.wikimedia.org/r/658436 (https://phabricator.wikimedia.org/T179696)
[21:57:02] <wikibugs>	 10SRE, 10MediaWiki-Containers, 10serviceops, 10Patch-For-Review: Homepage for https://docker-registry.wikimedia.org - https://phabricator.wikimedia.org/T179696 (10Legoktm) In my testing of repeatedly issuing the same curl command over and over, it usually took ~35s to respond, but sometimes it took over 1m...
[21:59:13] <icinga-wm>	 PROBLEM - Check systemd state on registry1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:00:04] <jouncebot>	 Reedy and sbassett: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210125T2200).
[22:02:29] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] wikireplicas: Add new DNS names for multiinstance replicas [puppet] - 10https://gerrit.wikimedia.org/r/657155 (https://phabricator.wikimedia.org/T267376) (owner: 10Bstorm)
[22:04:44] <wikibugs>	 (03PS2) 10Legoktm: Drop obsolete requirements.txt and setup.py [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657954
[22:04:46] <wikibugs>	 (03PS2) 10Legoktm: Split $wmgSiteLogo{1,1_5,2}x to a separate logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657955
[22:04:48] <wikibugs>	 (03PS5) 10Legoktm: Add script to mostly automate logo management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640)
[22:12:23] <wikibugs>	 (03CR) 10Krinkle: Add script to mostly automate logo management (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640) (owner: 10Legoktm)
[22:15:32] <wikibugs>	 (03CR) 10Krinkle: Add script to mostly automate logo management (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640) (owner: 10Legoktm)
[22:17:05] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 0 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[22:17:55] <wikibugs>	 (03PS6) 10Legoktm: Add script to mostly automate logo management [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640)
[22:18:06] <wikibugs>	 (03CR) 10Legoktm: Add script to mostly automate logo management (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/657956 (https://phabricator.wikimedia.org/T98640) (owner: 10Legoktm)
[22:19:25] <wikibugs>	 (03PS3) 10Cwhite: logstash: enable curator to accept custom age filters [puppet] - 10https://gerrit.wikimedia.org/r/657370 (https://phabricator.wikimedia.org/T234565)
[22:23:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:23:35] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 18 down 2 https://wikitech.wikimedia.org/wiki/HAProxy
[22:24:29] <wikibugs>	 (03CR) 10Volans: "reply inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/657802 (owner: 10Jbond)
[22:28:09] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:29:16] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1338.eqiad.wmnet
[22:29:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:29:50] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: enable curator to accept custom age filters [puppet] - 10https://gerrit.wikimedia.org/r/657370 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[22:29:58] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2323.codfw.wmnet
[22:30:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:30:59] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw2322.codfw.wmnet
[22:31:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:31:51] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:31:53] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 225817080 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[22:34:13] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 58848 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[22:34:40] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2323.codfw.wmnet
[22:34:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:50] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2322.codfw.wmnet
[22:34:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:35:20] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:35:23] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2321.codfw.wmnet'] `  Of...
[22:35:59] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:36:04] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2320.codfw.wmnet'] `  Of...
[22:38:27] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:38:36] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2320.codfw.wmnet'] `  Of...
[22:38:45] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:39:29] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin2001.codf...
[22:39:34] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2320.codfw.wmnet'] `  Of...
[22:40:24] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:41:18] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:41:22] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2321.codfw.wmnet'] `  Of...
[22:41:39] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:43:53] <wikibugs>	 (03PS1) 10Aaron Schulz: Enable "coalesceKeys" for global keys for WANCache (III) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658372
[22:44:07] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1338.eqiad.wmnet
[22:44:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:46:12] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:47:13] <icinga-wm>	 RECOVERY - Check systemd state on registry1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:47:30] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia...
[22:50:43] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn)
[22:52:18] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps instances: add a helper script to format & mount a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114)
[22:53:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloud-vps instances: add a helper script to format & mount a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott)
[22:54:08] <wikibugs>	 (03CR) 10Bstorm: "I haven't merged this yet partly because I'm a little fuzzy on the docs. If I merge this, is it basically just a repo update or do I have " [puppet] - 10https://gerrit.wikimedia.org/r/639881 (https://phabricator.wikimedia.org/T263284) (owner: 10Bstorm)
[22:55:52] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 43158016 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[22:56:16] <wikibugs>	 (03PS2) 10Legoktm: docker_registry_ha: Increase nginx proxy timeout to 120s [puppet] - 10https://gerrit.wikimedia.org/r/658436 (https://phabricator.wikimedia.org/T179696)
[22:56:39] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for MewOphaswongse - https://phabricator.wikimedia.org/T272912 (10Aklapper) 05Open→03Stalled Hi @mewoph, thanks for taking the time to report this and welcome to Wikimedia Phabricator! Could you please follow https://phabricator.wikimedia.org/project/...
[22:57:06] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 7749336 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[22:57:28] <wikibugs>	 (03PS2) 10Andrew Bogott: cloud-vps instances: add a helper script to format & mount a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114)
[22:58:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloud-vps instances: add a helper script to format & mount a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott)
[22:59:27] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2320.codfw.wmnet with reason: REIMAGE
[22:59:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:18] <wikibugs>	 (03PS3) 10Andrew Bogott: cloud-vps instances: add a helper script to format & mount a cinder volume [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114)
[23:00:42] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2331.codfw.wmnet with reason: REIMAGE
[23:00:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:01:34] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2320.codfw.wmnet with reason: REIMAGE
[23:01:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:02:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:03:10] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for MewOphaswongse - https://phabricator.wikimedia.org/T272912 (10mewoph)  @Aklapper I just linked my MediaWiki account w/Phabricator. I'm a full time employee so I didn't fill in the contractor-only fields. Thanks!
[23:03:29] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2331.codfw.wmnet with reason: REIMAGE
[23:03:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:03:38] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:05:18] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2319.codfw.wmnet with reason: REIMAGE
[23:05:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:05:51] <wikibugs>	 (03PS1) 10Legoktm: openldap: Prepare cross-validate-accounts for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455
[23:06:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openldap: Prepare cross-validate-accounts for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455 (owner: 10Legoktm)
[23:06:34] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2318.codfw.wmnet with reason: REIMAGE
[23:06:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:07] <wikibugs>	 (03PS2) 10Legoktm: openldap: Prepare cross-validate-accounts for Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/658455
[23:07:20] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2319.codfw.wmnet with reason: REIMAGE
[23:07:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:07:30] <wikibugs>	 (03PS2) 10Legoktm: snapshot: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657916 (https://phabricator.wikimedia.org/T266479)
[23:07:56] <wikibugs>	 (03PS5) 10Dzahn: add deploy1002 and deploy2002 to deployment_hosts for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/635079 (https://phabricator.wikimedia.org/T265963)
[23:09:25] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2318.codfw.wmnet with reason: REIMAGE
[23:09:27] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] snapshot: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/657916 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm)
[23:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:17:18] <wikibugs>	 (03PS1) 10Bstorm: wikireplicas: fix error in VM proxy config [puppet] - 10https://gerrit.wikimedia.org/r/658457 (https://phabricator.wikimedia.org/T271476)
[23:19:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:21:12] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:22:10] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] wikireplicas: fix error in VM proxy config [puppet] - 10https://gerrit.wikimedia.org/r/658457 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm)
[23:24:37] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2331.codfw.wmnet'] `  an...
[23:25:42] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2320.codfw.wmnet'] `  an...
[23:30:04] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2319.codfw.wmnet'] `  an...
[23:31:14] <icinga-wm>	 RECOVERY - Check systemd state on registry2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:31:43] <wikibugs>	 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2318.codfw.wmnet'] `  an...
[23:37:48] <icinga-wm>	 PROBLEM - Check systemd state on registry2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:38:16] <wikibugs>	 (03CR) 10Bstorm: "For what this does, I almost wish it was more of an interactive script rather than argument driven." [puppet] - 10https://gerrit.wikimedia.org/r/658452 (https://phabricator.wikimedia.org/T272114) (owner: 10Andrew Bogott)
[23:47:28] <wikibugs>	 (03PS1) 10Tks4Fish: arbcom_enwiki: Change favicon to a renamed copy of arbcom_ruwiki.ico [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658461 (https://phabricator.wikimedia.org/T272920)
[23:50:18] <wikibugs>	 (03Abandoned) 10Tks4Fish: arbcom_enwiki: Change favicon to a renamed copy of arbcom_ruwiki.ico [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658461 (https://phabricator.wikimedia.org/T272920) (owner: 10Tks4Fish)
[23:55:28] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "verified the IPs match what netbox has" [puppet] - 10https://gerrit.wikimedia.org/r/635079 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn)
[23:56:54] <wikibugs>	 (03PS1) 10Tks4Fish: arbcom_enwiki: Change favicon to a renamed copy of arbcom_ruwiki.ico [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658463 (https://phabricator.wikimedia.org/T272920)
[23:57:33] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] parsoid::testing: switch db_host from m5-master to localhost [puppet] - 10https://gerrit.wikimedia.org/r/654565 (https://phabricator.wikimedia.org/T266509) (owner: 10Dzahn)