[00:52:55] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4964575696 and 413 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:11] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 7205717128 and 544 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:17] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 306555712 and 208 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:55] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 35648 and 299 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:09] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 263936 and 374 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:27] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 349168 and 390 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:35] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2066473208 and 136 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:05:43] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2702763376 and 179 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:06:23] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1554727344 and 93 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:07:21] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 79923072 and 49 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:08:49] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 71032 and 131 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:08:57] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 408 and 139 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:08:57] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 408 and 139 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:09:39] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3760 and 181 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:44:43] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [06:47:55] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:49:59] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:51:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:23:22] 10Operations, 10Puppet: Multiple puppet/apt errors - https://phabricator.wikimedia.org/T270940 (10jijiki) [09:23:47] 10Operations, 10Puppet: Multiple puppet/apt errors - https://phabricator.wikimedia.org/T270940 (10jijiki) I have not debugged the issue any further, but it appears to be affecting a small number of hosts [09:24:19] 10Operations, 10Puppet: Multiple puppet/apt errors - https://phabricator.wikimedia.org/T270940 (10jijiki) [09:43:14] 10Operations, 10Puppet: Multiple puppet/apt errors - https://phabricator.wikimedia.org/T270940 (10MoritzMuehlenhoff) This is a long standing race condition in apt updating itself, which is now exposed by the tzdata update which was released for Stretch (and we keep this one updated via package=>latest). Runnin... [10:19:59] PROBLEM - Host ms-be2050 is DOWN: PING CRITICAL - Packet loss = 100% [10:28:57] RECOVERY - Host ms-be2050 is UP: PING OK - Packet loss = 0%, RTA = 33.37 ms [11:36:42] (03CR) 10Jbond: [C: 04-1] "see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651448 (https://phabricator.wikimedia.org/T269150) (owner: 10Elukey) [12:45:16] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10JavaScript: Display map markers on Kartographer maps even in case of mapserver failures - https://phabricator.wikimedia.org/T270865 (10jbond) p:05Triage→03Medium [12:48:28] 10Operations, 10Wikimedia-Mailing-lists: wikipedia-mai & wikiur-l mail archives are empty after August 2018 & January 2019 respectively - https://phabricator.wikimedia.org/T270837 (10jbond) p:05Triage→03Medium [12:48:48] 10Operations, 10Maps, 10Product-Infrastructure-Team-Backlog, 10JavaScript: Display map markers on Kartographer maps even in case of mapserver failures - https://phabricator.wikimedia.org/T270865 (10jbond) @RKemper can you double check i have tagged this correctly? [12:52:34] 10Operations, 10WVUI: Import npm 6.14.8 to buster dist. on apt.wikimedia.org - https://phabricator.wikimedia.org/T270321 (10jbond) p:05Triage→03Medium [12:55:32] 10Operations, 10Puppet: Multiple puppet/apt errors - https://phabricator.wikimedia.org/T270940 (10jbond) 05Open→03Resolved a:03jbond > sudo dpkg --configure -a This wasn;t enough i had to run `apt-get install -f` either way fixed now [12:56:40] 10Operations, 10Performance-Team, 10Traffic: Enable webp thumbnails on all images for non-Commons wikis - https://phabricator.wikimedia.org/T269946 (10jbond) [12:57:21] 10Operations, 10Performance-Team, 10Traffic: Enable webp thumbnails on all images for non-Commons wikis - https://phabricator.wikimedia.org/T269946 (10jbond) It seems this is being tracked by performance team so i have removed the operations tag but please add back if you feel this was an error. [13:29:07] (03PS1) 10Jbond: pki: add default date to cloud [puppet] - 10https://gerrit.wikimedia.org/r/652551 [13:30:48] (03CR) 10Jbond: [C: 03+2] pki: add default date to cloud [puppet] - 10https://gerrit.wikimedia.org/r/652551 (owner: 10Jbond) [13:34:23] (03PS1) 10Jbond: pki: add default vhost for cloud [puppet] - 10https://gerrit.wikimedia.org/r/652554 [13:39:49] (03CR) 10Jbond: [C: 03+2] pki: add default vhost for cloud [puppet] - 10https://gerrit.wikimedia.org/r/652554 (owner: 10Jbond) [14:20:45] (03CR) 10Elukey: admin: deprecate the analytics-users posix group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651448 (https://phabricator.wikimedia.org/T269150) (owner: 10Elukey) [14:30:33] 10Operations, 10Performance-Team, 10Traffic: Enable webp thumbnails on all images for non-Commons wikis - https://phabricator.wikimedia.org/T269946 (10Peachey88) This is potentially a dup of {T27611} which is effectively stalled on {T211661} [14:32:10] (03CR) 10Jbond: [C: 03+1] admin: deprecate the analytics-users posix group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651448 (https://phabricator.wikimedia.org/T269150) (owner: 10Elukey) [14:33:15] (03CR) 10Jbond: [C: 03+1] admin: deprecate the analytics-users posix group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651448 (https://phabricator.wikimedia.org/T269150) (owner: 10Elukey) [18:31:07] PROBLEM - Host ms-be2050 is DOWN: PING CRITICAL - Packet loss = 100% [18:31:31] RECOVERY - Host ms-be2050 is UP: PING OK - Packet loss = 0%, RTA = 33.35 ms [19:01:21] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [19:01:57] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [20:20:47] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [20:21:25] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [21:11:39] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [21:20:55] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [22:11:47] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:40:51] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [22:41:43] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [22:59:41] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:30:07] PROBLEM - Host ms-be2050 is DOWN: PING CRITICAL - Packet loss = 100% [23:30:15] RECOVERY - Host ms-be2050 is UP: PING OK - Packet loss = 0%, RTA = 35.56 ms