[00:01:15] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [00:02:01] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [00:17:41] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [00:59:35] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 553161552 and 57 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:09] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2350092296 and 171 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:33] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5411409648 and 336 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:37] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1851303024 and 173 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:53] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2233167632 and 210 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:55] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1959130160 and 196 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:01:13] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2088610448 and 221 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:01:15] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 64784 and 123 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:01:49] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 29632 and 156 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:02:33] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 40520 and 201 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:03:55] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 138200 and 283 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:04:09] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 370600 and 298 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:04:33] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 548800 and 322 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:31] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 61888 and 379 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:08:58] (03CR) 10Jforrester: "recheck" [software/censorship-monitoring] - 10https://gerrit.wikimedia.org/r/593240 (owner: 10Ssingh) [01:17:09] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 387417344 and 19 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:17:19] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5782695664 and 327 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:17:25] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 638628552 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:17:25] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2785217320 and 156 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:17:25] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1177351528 and 58 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:20:29] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1168960 and 77 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:20:43] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 87832 and 92 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:20:43] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 87832 and 92 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:22:15] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 583864 and 184 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:22:23] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 284296 and 192 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:10:37] (03CR) 10Jforrester: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/644872 (owner: 10Ayounsi) [05:29:51] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [05:31:25] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:32:55] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 4561.26 ms [05:36:13] PROBLEM - Juniper alarms on mr1-eqsin is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 103.102.166.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [05:37:49] RECOVERY - Juniper alarms on mr1-eqsin is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [05:37:57] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 38, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:38:37] RECOVERY - Host mr1-eqsin IPv6 is UP: PING OK - Packet loss = 0%, RTA = 235.10 ms [05:41:21] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 232.33 ms [06:15:27] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [07:16:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:18:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:27:24] 10Operations, 10MediaWiki-General, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Aklapper) [11:45:21] (03PS1) 10Majavah: Add fiwiki 500k temporary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/652687 (https://phabricator.wikimedia.org/T270974) [11:45:23] (03PS1) 10Majavah: Config for fiwiki 500k temporary logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/652688 (https://phabricator.wikimedia.org/T270974) [14:44:46] 10Operations, 10Wikimedia-Mailing-lists: Publish statistics about number of held messages per mailing list (Jan 2021) - https://phabricator.wikimedia.org/T270977 (10Aklapper) p:05Triage→03Low [17:32:07] PROBLEM - Host ms-be2050 is DOWN: PING CRITICAL - Packet loss = 100% [17:32:13] RECOVERY - Host ms-be2050 is UP: PING OK - Packet loss = 0%, RTA = 33.41 ms [21:44:29] (03PS1) 10Andrew Bogott: Keystone: update otp auth code for Stein [puppet] - 10https://gerrit.wikimedia.org/r/652832 (https://phabricator.wikimedia.org/T261134) [21:46:58] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: update otp auth code for Stein [puppet] - 10https://gerrit.wikimedia.org/r/652832 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [23:04:17] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 53.63 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:07:09] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [23:17:03] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [23:17:27] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 71.29 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1