[00:08:54] <mutante>	 !log deploy1002 - rsyncing home dirs from deploy1001 
[00:09:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:37] <dwisehaupt>	 ac
[00:22:54] <wikibugs>	 (03PS1) 10Zabe: component: Add 'autoreview' and 'interface-admin' protection level to hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667306 (https://phabricator.wikimedia.org/T275076)
[00:22:56] <wikibugs>	 (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667306 (https://phabricator.wikimedia.org/T275076) (owner: 10Zabe)
[00:26:09] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors #o11y on alert1001 is OK: (C)8 ge (W)1 ge 0.9083 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[00:53:47] <jinxer-wm>	 (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org
[00:55:18] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Include profile::analytics::jupyterhub on an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/667276 (https://phabricator.wikimedia.org/T262847) (owner: 10Ottomata)
[00:58:47] <jinxer-wm>	 (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org
[01:00:30] <wikibugs>	 (03PS1) 10Brennen Bearnes: WIP: logspam-watch: better recency indicators, helptext, and utf-8 [puppet] - 10https://gerrit.wikimedia.org/r/667310
[01:15:54] <wikibugs>	 (03PS2) 10Zabe: component: Add 'autoreview' and 'interface-admin' protection level to hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667306 (https://phabricator.wikimedia.org/T275076)
[01:18:06] <wikibugs>	 (03PS3) 10Zabe: Add 'autoreview' and 'interface-admin' protection level to hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667306 (https://phabricator.wikimedia.org/T275076)
[01:19:10] <wikibugs>	 (03CR) 10Zppix: [C: 03+1] "LGTM, welcome!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667306 (https://phabricator.wikimedia.org/T275076) (owner: 10Zabe)
[02:03:33] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2008 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.224 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[02:40:43] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:43:13] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:09:45] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:15:49] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:20:27] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:39:43] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:42:09] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:49:35] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:52:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:53:47] <jinxer-wm>	 (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org
[03:58:47] <jinxer-wm>	 (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org
[05:20:11] <icinga-wm>	 PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:53:47] <jinxer-wm>	 (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org
[06:58:47] <jinxer-wm>	 (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org
[07:19:35] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:21:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:53:01] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[08:19:47] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:44:31] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:56:31] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:59:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:01:51] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:15:45] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:27:57] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:47:07] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:47:31] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:49:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:53:47] <jinxer-wm>	 (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org
[09:58:47] <jinxer-wm>	 (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org
[09:59:31] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:17:17] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:20:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:23:07] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:30:31] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:32:55] <wikibugs>	 (03PS1) 10Zabe: Set local timezone for trwikivoyage to UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667320 (https://phabricator.wikimedia.org/T275598)
[10:50:27] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:50:53] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:51:45] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[10:58:21] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:02:45] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:15:13] <icinga-wm>	 RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:22:03] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:23:57] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[11:24:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:28:43] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[11:31:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:34:19] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:46:45] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:59:12] <wikibugs>	 (03CR) 10Evrifaessa: [C: 03+1] Set local timezone for trwikivoyage to UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667320 (https://phabricator.wikimedia.org/T275598) (owner: 10Zabe)
[11:59:45] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:02:09] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:11:29] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:18:47] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:38:13] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:45:39] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:53:47] <jinxer-wm>	 (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org
[12:58:47] <jinxer-wm>	 (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org
[13:03:53] <Kizule>	 Hi, could someone do T241648?
[13:03:54] <stashbot>	 T241648: Special:BrokenRedirects shows displays an incorrect state on srwiki - https://phabricator.wikimedia.org/T241648
[13:09:36] <Majavah>	 Kizule: "The following data is cached, and was last updated 2021-02-25T20:16:16." most report specials aren't real time, I think they're updated every few days, but not sure on that
[13:09:54] <Majavah>	 they left :/
[13:09:55] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:34] <Majavah>	 ty Reedy, I was about to comment on that too
[13:11:01] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: m2 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1486.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:14:47] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:15:43] <Reedy>	 Majavah: Yeah, exactly. If the list was huge, and/or really stale, I might've been more inclined to do it
[13:15:51] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: m2 on db2078 is OK: OK slave_sql_lag Replication lag: 0.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:18:15] <Reedy>	 And if it's generating incorrect results... Well, that's a different bug, and running the script shouldn't help
[13:27:13] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:29:35] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:30:01] <p858snake>	 It's not always the clearest that they automatically update/how often
[13:38:46] <Reedy>	 I guess that's because of the detachment between MW doing it automatically, and it being run by a cronjob/similar on a server
[13:41:25] <icinga-wm>	 PROBLEM - puppet last run on wdqs1011 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:45:29] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 64884152 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:46:20] <wikibugs>	 (03PS2) 10Urbanecm: rowiki: Update help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666682 (https://phabricator.wikimedia.org/T275130)
[13:47:55] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 632152 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:48:12] <wikibugs>	 (03PS3) 10Urbanecm: rowiki: Update help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666682 (https://phabricator.wikimedia.org/T275130)
[13:48:20] <wikibugs>	 (03CR) 10Urbanecm: rowiki: Update help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666682 (https://phabricator.wikimedia.org/T275130) (owner: 10Urbanecm)
[14:17:13] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:19:49] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:40:13] <wikibugs>	 (03PS1) 10Zabe: Enable babel categorize on thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667348 (https://phabricator.wikimedia.org/T275283)
[15:19:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:22:21] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:53:47] <jinxer-wm>	 (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org
[15:58:47] <jinxer-wm>	 (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org
[16:27:23] <icinga-wm>	 PROBLEM - SSH on wdqs1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:09:05] <elukey>	 gehel, ryankemper o/ wdqs1011 seems overloaded, no prometheus metrics, no ssh, and I can't get a root shell via mgmt console
[17:09:38] <elukey>	 I am still waiting for the root console and I get
[17:09:39] <elukey>	 [1556638.462659] systemd[1]: Failed to start Journal Service.
[17:09:43] <elukey>	 that is not a good sign :D
[17:10:00] <gehel>	 That's a test server, it can wait monday
[17:10:33] <gehel>	 Can you just downtime it and I'll have a look later today
[17:10:51] <gehel>	 And thanks for looking into it !
[17:11:07] <elukey>	 gehel: ah I was about to ask it, super, downtiming for two days :)
[17:11:44] <gehel>	 We had a suspicion of a bad disk on that server, but could not confirm it with anything. Maybe we'll have better data this time !
[17:50:11] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:54:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:59:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:02:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:35:47] <gehel>	 I misread that server name, wdqs1010 is a test server, wdqs1011 is production. Checking the status (cc elukey)
[18:37:12] <gehel>	 !log powercycling wdqs1011
[18:37:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 3 others: Upgrade firmware on wdqs1009 - https://phabricator.wikimedia.org/T274751 (10Gehel) Note that wdqs1011 had a similar issue today (might not be related at all)
[18:42:09] <icinga-wm>	 RECOVERY - SSH on wdqs1011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:43:57] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1011 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[18:43:57] <icinga-wm>	 RECOVERY - puppet last run on wdqs1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[18:43:57] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.074 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[18:44:32] <gehel>	 !log depooled wdqs1011 to catch up on lag
[18:44:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:47] <jinxer-wm>	 (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org
[18:58:47] <jinxer-wm>	 (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org
[20:22:05] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Fix the problem with gravatar and mailman3 - https://phabricator.wikimedia.org/T256541 (10Ladsgroup) >>! In T256541#6860002, @Joe wrote: >>>! In T256541#6789243, @Ladsgroup wrote: >> So [[https://gitlab.com/mailman/hyperkitty/-/merge_requests/273|the fix is merge...
[21:49:31] <wikibugs>	 (03PS1) 10Andrew Bogott: bootstrapvz: mount volumes with 'discard' [puppet] - 10https://gerrit.wikimedia.org/r/667362 (https://phabricator.wikimedia.org/T275893)
[21:53:46] <jinxer-wm>	 (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org
[21:58:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] bootstrapvz: mount volumes with 'discard' [puppet] - 10https://gerrit.wikimedia.org/r/667362 (https://phabricator.wikimedia.org/T275893) (owner: 10Andrew Bogott)
[21:58:46] <jinxer-wm>	 (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org
[22:17:36] <wikibugs>	 (03PS1) 10Andrew Bogott: labs_bootstrapvz: use mountopts rather than mount_opts [puppet] - 10https://gerrit.wikimedia.org/r/667364
[22:18:23] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] labs_bootstrapvz: use mountopts rather than mount_opts [puppet] - 10https://gerrit.wikimedia.org/r/667364 (owner: 10Andrew Bogott)
[22:18:37] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:20:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:22:10] <wikibugs>	 (03PS1) 10Andrew Bogott: bootstrapvz: further mountopt attempts [puppet] - 10https://gerrit.wikimedia.org/r/667365
[22:25:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] bootstrapvz: further mountopt attempts [puppet] - 10https://gerrit.wikimedia.org/r/667365 (owner: 10Andrew Bogott)
[22:26:08] <icinga-wm>	 ACKNOWLEDGEMENT - WDQS high update lag on wdqs1011 is CRITICAL: 1.65e+04 ge 3600 Gehel catching up on lag after freeze https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[22:30:23] <wikibugs>	 (03PS1) 10Andrew Bogott: bootstrap-vz: remove mountopts from Stretch manifest [puppet] - 10https://gerrit.wikimedia.org/r/667366
[22:30:55] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] bootstrap-vz: remove mountopts from Stretch manifest [puppet] - 10https://gerrit.wikimedia.org/r/667366 (owner: 10Andrew Bogott)
[22:55:36] <wikibugs>	 (03PS1) 10Ladsgroup: mailman3: Add hyperkitty [puppet] - 10https://gerrit.wikimedia.org/r/667367 (https://phabricator.wikimedia.org/T256542)
[23:04:14] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Puppetize mailman3 web and hyperkitty (mailman archiver) - https://phabricator.wikimedia.org/T256542 (10Ladsgroup) With these two packages it works just fine: https://mailman-puppet.wmcloud.org/
[23:32:23] <icinga-wm>	 PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state