[00:01:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:06:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:40:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:56:59] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 54576888 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:39] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 198384 and 97 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:03:42] (03PS1) 10HitomiAkane: Move changetags right from users to sysob [trwiki] Bug: T264508 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631956 (https://phabricator.wikimedia.org/T264508) [02:14:50] (03PS2) 10HitomiAkane: Move changetags right from users to sysop [trwiki] Bug: T264508 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631956 (https://phabricator.wikimedia.org/T264508) [02:32:23] (03PS3) 10HitomiAkane: Move changetags right from users to sysop [trwiki] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631956 (https://phabricator.wikimedia.org/T264508) [03:58:31] (03CR) 10Krinkle: [C: 03+1] arclamp: serve SVGs, compressed logs from Swift (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [04:00:54] (03CR) 10Krinkle: [C: 03+1] "I see it working at https://performance.wikimedia.beta.wmflabs.org/arclamp/svgs/daily/2020-10-03.excimer.all.svgz with x-object-meta-mtime" [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [04:16:21] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1749 MB (3% inode=86%): /var/lib/mysql 3352 MB (5% inode=99%): /tmp 1749 MB (3% inode=86%): /var/tmp 1749 MB (3% inode=86%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201004T0700) [07:49:49] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:02:03] PROBLEM - Disk space on an-master1002 is CRITICAL: DISK CRITICAL - free space: /srv 4933 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops [09:37:07] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.4503 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [09:42:09] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [09:51:09] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:52:13] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.4674 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [10:07:23] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0.009143 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [10:15:49] PROBLEM - Packet loss ratio for UDP on logstash1009 is CRITICAL: 0.8117 ge 0.1 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [11:30:21] PROBLEM - Disk space on an-coord1001 is CRITICAL: DISK CRITICAL - free space: / 1678 MB (3% inode=86%): /var/lib/mysql 2887 MB (4% inode=99%): /tmp 1678 MB (3% inode=86%): /var/tmp 1678 MB (3% inode=86%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-coord1001&var-datasource=eqiad+prometheus/ops [11:59:27] (03CR) 10Urbanecm: [C: 03+1] Move changetags right from users to sysop [trwiki] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631956 (https://phabricator.wikimedia.org/T264508) (owner: 10HitomiAkane) [13:07:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:08:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:51:09] RECOVERY - Packet loss ratio for UDP on logstash1009 is OK: (C)0.1 ge (W)0.05 ge 0 https://wikitech.wikimedia.org/wiki/Logstash https://grafana.wikimedia.org/dashboard/db/logstash [15:58:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:59:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:07:55] (03CR) 10Evrifaessa: [C: 03+1] Move changetags right from users to sysop [trwiki] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631956 (https://phabricator.wikimedia.org/T264508) (owner: 10HitomiAkane) [19:01:29] PROBLEM - Disk space on an-master1002 is CRITICAL: DISK CRITICAL - free space: /srv 5322 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-master1002&var-datasource=eqiad+prometheus/ops [20:27:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:29:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:40:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:41:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:50:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:51:35] (03CR) 10Hazard-SJ: [C: 03+1] Require autoconfirmed status to edit Wikidata Properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631809 (https://phabricator.wikimedia.org/T254280) (owner: 10Abián) [21:53:31] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:56:21] (03CR) 10Hazard-SJ: [C: 03+1] Move changetags right from users to sysop [trwiki] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631956 (https://phabricator.wikimedia.org/T264508) (owner: 10HitomiAkane) [22:17:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:23:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:57:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:59:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:06:35] 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Krinkle)