[00:08:54] !log deploy1002 - rsyncing home dirs from deploy1001 [00:09:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:37] ac [00:22:54] (03PS1) 10Zabe: component: Add 'autoreview' and 'interface-admin' protection level to hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667306 (https://phabricator.wikimedia.org/T275076) [00:22:56] (03CR) 10Welcome, new contributor!: "Thank you for making your first contribution to Wikimedia! :) To learn how to get your code changes reviewed faster and more likely to get" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667306 (https://phabricator.wikimedia.org/T275076) (owner: 10Zabe) [00:26:09] RECOVERY - Logstash Elasticsearch indexing errors #o11y on alert1001 is OK: (C)8 ge (W)1 ge 0.9083 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [00:53:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [00:55:18] (03CR) 10Ottomata: [C: 03+2] Include profile::analytics::jupyterhub on an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/667276 (https://phabricator.wikimedia.org/T262847) (owner: 10Ottomata) [00:58:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [01:00:30] (03PS1) 10Brennen Bearnes: WIP: logspam-watch: better recency indicators, helptext, and utf-8 [puppet] - 10https://gerrit.wikimedia.org/r/667310 [01:15:54] (03PS2) 10Zabe: component: Add 'autoreview' and 'interface-admin' protection level to hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667306 (https://phabricator.wikimedia.org/T275076) [01:18:06] (03PS3) 10Zabe: Add 'autoreview' and 'interface-admin' protection level to hewikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667306 (https://phabricator.wikimedia.org/T275076) [01:19:10] (03CR) 10Zppix: [C: 03+1] "LGTM, welcome!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667306 (https://phabricator.wikimedia.org/T275076) (owner: 10Zabe) [02:03:33] RECOVERY - WDQS SPARQL on wdqs2008 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.224 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [02:40:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:43:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:09:45] PROBLEM - Check systemd state on ms-be1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:15:49] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:20:27] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.053 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:39:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:42:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:49:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:52:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:53:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [03:58:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [05:20:11] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [06:58:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [07:19:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:21:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:53:01] PROBLEM - WDQS SPARQL on wdqs1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:19:47] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:44:31] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:56:31] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:01:51] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:15:45] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:57] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:07] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:49:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:53:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [09:58:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [09:59:31] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:17] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:23:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:30:31] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:55] (03PS1) 10Zabe: Set local timezone for trwikivoyage to UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667320 (https://phabricator.wikimedia.org/T275598) [10:50:27] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:50:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:51:45] PROBLEM - Query Service HTTP Port on wdqs1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:58:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:02:45] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:13] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:03] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:57] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [11:24:01] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:28:43] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [11:31:19] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:34:19] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:46:45] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:59:12] (03CR) 10Evrifaessa: [C: 03+1] Set local timezone for trwikivoyage to UTC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667320 (https://phabricator.wikimedia.org/T275598) (owner: 10Zabe) [11:59:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:02:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:11:29] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:47] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:13] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:39] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:53:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [12:58:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [13:03:53] Hi, could someone do T241648? [13:03:54] T241648: Special:BrokenRedirects shows displays an incorrect state on srwiki - https://phabricator.wikimedia.org/T241648 [13:09:36] Kizule: "The following data is cached, and was last updated 2021-02-25T20:16:16." most report specials aren't real time, I think they're updated every few days, but not sure on that [13:09:54] they left :/ [13:09:55] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:34] ty Reedy, I was about to comment on that too [13:11:01] PROBLEM - MariaDB Replica Lag: m2 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1486.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:14:47] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:43] Majavah: Yeah, exactly. If the list was huge, and/or really stale, I might've been more inclined to do it [13:15:51] RECOVERY - MariaDB Replica Lag: m2 on db2078 is OK: OK slave_sql_lag Replication lag: 0.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:18:15] And if it's generating incorrect results... Well, that's a different bug, and running the script shouldn't help [13:27:13] PROBLEM - Check systemd state on wdqs1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:35] RECOVERY - Check systemd state on wdqs1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:30:01] It's not always the clearest that they automatically update/how often [13:38:46] I guess that's because of the detachment between MW doing it automatically, and it being run by a cronjob/similar on a server [13:41:25] PROBLEM - puppet last run on wdqs1011 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:45:29] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 64884152 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:46:20] (03PS2) 10Urbanecm: rowiki: Update help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666682 (https://phabricator.wikimedia.org/T275130) [13:47:55] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 632152 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:48:12] (03PS3) 10Urbanecm: rowiki: Update help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666682 (https://phabricator.wikimedia.org/T275130) [13:48:20] (03CR) 10Urbanecm: rowiki: Update help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666682 (https://phabricator.wikimedia.org/T275130) (owner: 10Urbanecm) [14:17:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:19:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:40:13] (03PS1) 10Zabe: Enable babel categorize on thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667348 (https://phabricator.wikimedia.org/T275283) [15:19:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:22:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:53:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [15:58:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [16:27:23] PROBLEM - SSH on wdqs1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:09:05] gehel, ryankemper o/ wdqs1011 seems overloaded, no prometheus metrics, no ssh, and I can't get a root shell via mgmt console [17:09:38] I am still waiting for the root console and I get [17:09:39] [1556638.462659] systemd[1]: Failed to start Journal Service. [17:09:43] that is not a good sign :D [17:10:00] That's a test server, it can wait monday [17:10:33] Can you just downtime it and I'll have a look later today [17:10:51] And thanks for looking into it ! [17:11:07] gehel: ah I was about to ask it, super, downtiming for two days :) [17:11:44] We had a suspicion of a bad disk on that server, but could not confirm it with anything. Maybe we'll have better data this time ! [17:50:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:54:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:59:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:02:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:35:47] I misread that server name, wdqs1010 is a test server, wdqs1011 is production. Checking the status (cc elukey) [18:37:12] !log powercycling wdqs1011 [18:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Wikidata, and 3 others: Upgrade firmware on wdqs1009 - https://phabricator.wikimedia.org/T274751 (10Gehel) Note that wdqs1011 had a similar issue today (might not be related at all) [18:42:09] RECOVERY - SSH on wdqs1011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:43:57] RECOVERY - Query Service HTTP Port on wdqs1011 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [18:43:57] RECOVERY - puppet last run on wdqs1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [18:43:57] RECOVERY - WDQS SPARQL on wdqs1011 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.074 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:44:32] !log depooled wdqs1011 to catch up on lag [18:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [18:58:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [20:22:05] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Fix the problem with gravatar and mailman3 - https://phabricator.wikimedia.org/T256541 (10Ladsgroup) >>! In T256541#6860002, @Joe wrote: >>>! In T256541#6789243, @Ladsgroup wrote: >> So [[https://gitlab.com/mailman/hyperkitty/-/merge_requests/273|the fix is merge... [21:49:31] (03PS1) 10Andrew Bogott: bootstrapvz: mount volumes with 'discard' [puppet] - 10https://gerrit.wikimedia.org/r/667362 (https://phabricator.wikimedia.org/T275893) [21:53:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [21:58:21] (03CR) 10Andrew Bogott: [C: 03+2] bootstrapvz: mount volumes with 'discard' [puppet] - 10https://gerrit.wikimedia.org/r/667362 (https://phabricator.wikimedia.org/T275893) (owner: 10Andrew Bogott) [21:58:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [22:17:36] (03PS1) 10Andrew Bogott: labs_bootstrapvz: use mountopts rather than mount_opts [puppet] - 10https://gerrit.wikimedia.org/r/667364 [22:18:23] (03CR) 10Andrew Bogott: [C: 03+2] labs_bootstrapvz: use mountopts rather than mount_opts [puppet] - 10https://gerrit.wikimedia.org/r/667364 (owner: 10Andrew Bogott) [22:18:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:20:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:22:10] (03PS1) 10Andrew Bogott: bootstrapvz: further mountopt attempts [puppet] - 10https://gerrit.wikimedia.org/r/667365 [22:25:07] (03CR) 10Andrew Bogott: [C: 03+2] bootstrapvz: further mountopt attempts [puppet] - 10https://gerrit.wikimedia.org/r/667365 (owner: 10Andrew Bogott) [22:26:08] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1011 is CRITICAL: 1.65e+04 ge 3600 Gehel catching up on lag after freeze https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [22:30:23] (03PS1) 10Andrew Bogott: bootstrap-vz: remove mountopts from Stretch manifest [puppet] - 10https://gerrit.wikimedia.org/r/667366 [22:30:55] (03CR) 10Andrew Bogott: [C: 03+2] bootstrap-vz: remove mountopts from Stretch manifest [puppet] - 10https://gerrit.wikimedia.org/r/667366 (owner: 10Andrew Bogott) [22:55:36] (03PS1) 10Ladsgroup: mailman3: Add hyperkitty [puppet] - 10https://gerrit.wikimedia.org/r/667367 (https://phabricator.wikimedia.org/T256542) [23:04:14] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Puppetize mailman3 web and hyperkitty (mailman archiver) - https://phabricator.wikimedia.org/T256542 (10Ladsgroup) With these two packages it works just fine: https://mailman-puppet.wmcloud.org/ [23:32:23] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state