[00:47:25] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 7995017240 and 391 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:25] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 6888160384 and 326 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:13] PROBLEM - SSH on logstash1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:50:39] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4306494736 and 388 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:09] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4640154400 and 433 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:19] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2113948856 and 326 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:19] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8015141016 and 618 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:23] RECOVERY - SSH on logstash1008 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:51:39] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3979681800 and 432 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:43] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 7815469056 and 633 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:39] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 68904 and 442 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:45] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 71104 and 510 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:47] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1332200 and 510 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:39] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 32992 and 563 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:57:17] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 33416 and 601 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:57:49] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 101464 and 633 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:59:39] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 38936 and 742 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:01] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 128792 and 765 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:24:01] RECOVERY - snapshot of s7 in eqiad on alert1001 is OK: Last snapshot for s7 at eqiad (db1116.eqiad.wmnet:3317) taken on 2020-12-28 01:57:28 (1013 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [04:21:25] RECOVERY - Wikitech and wt-static content in sync on labweb1002 is OK: wikitech-static OK - wikitech and wikitech-static in sync (6085 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [04:49:29] RECOVERY - Wikitech and wt-static content in sync on labweb1001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (6085 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [05:16:19] RECOVERY - Wikitech and wt-static content in sync on cloudweb2001-dev is OK: wikitech-static OK - wikitech and wikitech-static in sync (6085 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [08:40:09] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 33%, RTA = 2320.05 ms [08:40:47] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 66%, RTA = 3344.11 ms [08:45:51] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 235.17 ms [08:46:29] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 247.90 ms [09:03:56] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10JavaScript, 10Maps (Maps-data): Display map markers on Kartographer maps even in case of mapserver failures - https://phabricator.wikimedia.org/T270865 (10RolandUnger) [09:27:17] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=analytics file=device_smart.prom instance=an-coord1002 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [09:54:30] !log reboot an-coord1002 (puppet in D state after issues with broken disk - host in standby, no traffic) [09:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:33] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:07:11] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 8 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:08:35] RECOVERY - Stale file for node-exporter textfile in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [10:16:12] didn't go really well, it tried to pxe since it probably didn't recognize any hw to boot (I expected at least one disk to work in the raid) [10:16:26] downtimed and stopped in d-i waiting for a new disk [10:23:31] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10elukey) Puppet was stuck in `D` state, so I attempted a graceful reboot to see if the OS could boot on its remaining disks. During boot it seems that the disk/raid contro... [10:25:15] RECOVERY - Device not healthy -SMART- on an-coord1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1002&var-datasource=eqiad+prometheus/ops [12:25:20] (03PS1) 10Arturo Borrero Gonzalez: cloud: drop dumps project backups [puppet] - 10https://gerrit.wikimedia.org/r/652182 (https://phabricator.wikimedia.org/T260692) [12:28:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: drop dumps project backups [puppet] - 10https://gerrit.wikimedia.org/r/652182 (https://phabricator.wikimedia.org/T260692) (owner: 10Arturo Borrero Gonzalez) [13:52:21] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=- method=POST https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [13:54:01] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:19:43] 10Operations, 10Wikimedia-Mailing-lists: wikipedia-mai & wikiur-l mail archives are empty after August 2018 & January 2019 respectively - https://phabricator.wikimedia.org/T270837 (10Dzahn) > Most of the mailing list admins are inactive and don't have any idea about the mailing list setting In that case you... [16:19:25] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Traffic, 10JavaScript, 10Maps (Maps-data): Display map markers on Kartographer maps even in case of mapserver failures - https://phabricator.wikimedia.org/T270865 (10RolandUnger) See T267296, too [16:24:05] PROBLEM - Host mr1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 40%, RTA = 2718.45 ms [16:29:47] RECOVERY - Host mr1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.10 ms [16:43:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:45:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:48:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:50:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:00:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:02:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:34:19] RECOVERY - MariaDB Replica Lag: m1 on db2078 is OK: OK slave_sql_lag Replication lag: 0.06 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [19:00:39] PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1005 is CRITICAL: SSL CRITICAL - Certificate cloudelastic.wikimedia.org valid until 2020-12-31 19:00:36 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Search [19:00:53] PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1006 is CRITICAL: SSL CRITICAL - Certificate cloudelastic.wikimedia.org valid until 2020-12-31 19:00:36 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Search [19:01:05] PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1005 is CRITICAL: SSL CRITICAL - Certificate cloudelastic.wikimedia.org valid until 2020-12-31 19:00:36 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Search [19:01:21] PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1005 is CRITICAL: SSL CRITICAL - Certificate cloudelastic.wikimedia.org valid until 2020-12-31 19:00:36 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Search [19:01:27] PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1006 is CRITICAL: SSL CRITICAL - Certificate cloudelastic.wikimedia.org valid until 2020-12-31 19:00:36 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Search [19:01:27] PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad on cloudelastic1005 is CRITICAL: SSL CRITICAL - Certificate cloudelastic.wikimedia.org valid until 2020-12-31 19:00:36 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Search [19:01:37] PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1006 is CRITICAL: SSL CRITICAL - Certificate cloudelastic.wikimedia.org valid until 2020-12-31 19:00:36 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Search [19:01:41] PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1005 is CRITICAL: SSL CRITICAL - Certificate cloudelastic.wikimedia.org valid until 2020-12-31 19:00:36 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Search [19:01:41] PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1005 is CRITICAL: SSL CRITICAL - Certificate cloudelastic.wikimedia.org valid until 2020-12-31 19:00:36 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Search [19:01:43] PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad on cloudelastic1006 is CRITICAL: SSL CRITICAL - Certificate cloudelastic.wikimedia.org valid until 2020-12-31 19:00:36 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Search [19:01:55] PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1006 is CRITICAL: SSL CRITICAL - Certificate cloudelastic.wikimedia.org valid until 2020-12-31 19:00:36 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Search [19:02:03] PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1006 is CRITICAL: SSL CRITICAL - Certificate cloudelastic.wikimedia.org valid until 2020-12-31 19:00:36 +0000 (expires in 2 days) https://wikitech.wikimedia.org/wiki/Search [19:15:22] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 2 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ipatrol) Assuming the problem is the CVE, Debian [[ https://www.debian.org/security/2020/dsa-4756 | b... [20:12:09] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 2 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Reedy) >>! In T257066#6713124, @Ipatrol wrote: > Assuming the problem is the CVE, Debian [[ https://w... [22:33:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={atlas_exporter,pdu_sentry4} site={eqiad,eqsin} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:35:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets