[00:04:38] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01009 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:14:47] (03PS1) 10Ladsgroup: Add IntelliJ files to .gitignore [debs/pybal] - 10https://gerrit.wikimedia.org/r/644036 [00:30:16] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 36694688 and 342 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:32:02] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 927120 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:37:16] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 45172632 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:37:32] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 59533832 and 446 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:37:38] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 120543416 and 452 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:37:46] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 96234296 and 460 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:39:02] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 693688 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:39:16] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 652512 and 550 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:39:22] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1737288 and 556 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:39:30] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 820216 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:49:23] (03PS1) 10Ladsgroup: [WIP] Start migrating pybal to python3 [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) [02:22:23] (03PS1) 10Reedy: Remove REL1_34 from $wgExtDistSnapshotRefs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644045 (https://phabricator.wikimedia.org/T268931) [02:26:41] (03PS1) 10Ladsgroup: [DNM] Test if tests are being ran [debs/pybal] - 10https://gerrit.wikimedia.org/r/644046 [02:27:39] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Test if tests are being ran [debs/pybal] - 10https://gerrit.wikimedia.org/r/644046 (owner: 10Ladsgroup) [02:28:34] (03Abandoned) 10Ladsgroup: [DNM] Test if tests are being ran [debs/pybal] - 10https://gerrit.wikimedia.org/r/644046 (owner: 10Ladsgroup) [02:33:36] (03PS2) 10Ladsgroup: [WIP] Start migrating pybal to python3 [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) [02:33:54] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Start migrating pybal to python3 [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) (owner: 10Ladsgroup) [02:40:08] (03PS3) 10Ladsgroup: [WIP] Start migrating pybal to python3 [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) [03:00:32] (03PS4) 10Ladsgroup: [WIP] Start migrating pybal to python3 [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) [03:01:03] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Start migrating pybal to python3 [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) (owner: 10Ladsgroup) [03:16:48] PROBLEM - Number of messages locally queued by purged for processing on cp3058 is CRITICAL: cluster=cache_text instance=cp3058 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3058 [03:16:52] PROBLEM - Number of messages locally queued by purged for processing on cp3064 is CRITICAL: cluster=cache_text instance=cp3064 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3064 [03:17:14] PROBLEM - Number of messages locally queued by purged for processing on cp2029 is CRITICAL: cluster=cache_text instance=cp2029 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2029 [03:17:14] PROBLEM - Number of messages locally queued by purged for processing on cp5007 is CRITICAL: cluster=cache_text instance=cp5007 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [03:17:20] PROBLEM - Number of messages locally queued by purged for processing on cp3062 is CRITICAL: cluster=cache_text instance=cp3062 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3062 [03:17:30] PROBLEM - Number of messages locally queued by purged for processing on cp3060 is CRITICAL: cluster=cache_text instance=cp3060 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [03:17:32] PROBLEM - Number of messages locally queued by purged for processing on cp5009 is CRITICAL: cluster=cache_text instance=cp5009 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [03:17:34] PROBLEM - Number of messages locally queued by purged for processing on cp3056 is CRITICAL: cluster=cache_text instance=cp3056 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [03:17:38] PROBLEM - Number of messages locally queued by purged for processing on cp1075 is CRITICAL: cluster=cache_text instance=cp1075 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1075 [03:17:38] PROBLEM - Number of messages locally queued by purged for processing on cp4030 is CRITICAL: cluster=cache_text instance=cp4030 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4030 [03:17:42] PROBLEM - Number of messages locally queued by purged for processing on cp4028 is CRITICAL: cluster=cache_text instance=cp4028 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [03:17:46] PROBLEM - Number of messages locally queued by purged for processing on cp1087 is CRITICAL: cluster=cache_text instance=cp1087 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [03:17:48] PROBLEM - Number of messages locally queued by purged for processing on cp2027 is CRITICAL: cluster=cache_text instance=cp2027 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2027 [03:17:56] PROBLEM - Number of messages locally queued by purged for processing on cp4032 is CRITICAL: cluster=cache_text instance=cp4032 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4032 [03:17:56] PROBLEM - Number of messages locally queued by purged for processing on cp4029 is CRITICAL: cluster=cache_text instance=cp4029 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4029 [03:17:56] PROBLEM - Number of messages locally queued by purged for processing on cp1089 is CRITICAL: cluster=cache_text instance=cp1089 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1089 [03:17:58] PROBLEM - Number of messages locally queued by purged for processing on cp4027 is CRITICAL: cluster=cache_text instance=cp4027 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4027 [03:18:08] PROBLEM - Number of messages locally queued by purged for processing on cp5011 is CRITICAL: cluster=cache_text instance=cp5011 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [03:18:18] PROBLEM - Number of messages locally queued by purged for processing on cp2033 is CRITICAL: cluster=cache_text instance=cp2033 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033 [03:18:22] PROBLEM - Number of messages locally queued by purged for processing on cp5008 is CRITICAL: cluster=cache_text instance=cp5008 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [03:18:32] PROBLEM - Number of messages locally queued by purged for processing on cp1077 is CRITICAL: cluster=cache_text instance=cp1077 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077 [03:18:34] PROBLEM - Number of messages locally queued by purged for processing on cp4031 is CRITICAL: cluster=cache_text instance=cp4031 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [03:18:38] PROBLEM - Number of messages locally queued by purged for processing on cp3054 is CRITICAL: cluster=cache_text instance=cp3054 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [03:18:42] PROBLEM - Number of messages locally queued by purged for processing on cp2035 is CRITICAL: cluster=cache_text instance=cp2035 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2035 [03:18:46] PROBLEM - Number of messages locally queued by purged for processing on cp1083 is CRITICAL: cluster=cache_text instance=cp1083 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1083 [03:18:56] PROBLEM - Number of messages locally queued by purged for processing on cp1079 is CRITICAL: cluster=cache_text instance=cp1079 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [03:29:24] PROBLEM - Number of messages locally queued by purged for processing on cp5007 is CRITICAL: cluster=cache_text instance=cp5007 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [03:32:04] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 2 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ycrusoe) Hi, I'm one among a presumably significant group of people around the world trying to learn... [03:41:00] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1075 is CRITICAL: 5.681e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1075 [03:43:22] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1077 is CRITICAL: 5.95e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077 [03:43:50] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1083 is CRITICAL: 5.374e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1083 [03:45:34] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1087 is CRITICAL: 4.973e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [03:46:40] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1089 is CRITICAL: 5.035e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1089 [03:48:12] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3064 is CRITICAL: 5.308e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3064 [03:49:08] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2027 is CRITICAL: 5.201e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2027 [03:50:02] PROBLEM - Time elapsed since the last kafka event processed by purged on cp1079 is CRITICAL: 3.857e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [03:52:12] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3060 is CRITICAL: 4.998e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [03:52:16] PROBLEM - Number of messages locally queued by purged for processing on cp3050 is CRITICAL: cluster=cache_text instance=cp3050 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [03:55:02] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3054 is CRITICAL: 5.143e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [03:57:21] wonder what's going on [03:57:24] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3062 is CRITICAL: 4.999e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3062 [03:57:26] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5008 is CRITICAL: 5.352e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [03:57:54] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3058 is CRITICAL: 5.08e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3058 [03:59:04] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2029 is CRITICAL: 5.408e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2029 [03:59:08] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5009 is CRITICAL: 5.268e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [03:59:22] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2033 is CRITICAL: 5.539e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033 [04:01:18] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4029 is CRITICAL: 7.253e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4029 [04:02:18] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2035 is CRITICAL: 4.853e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2035 [04:03:22] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5011 is CRITICAL: 4.87e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [04:10:44] PROBLEM - Time elapsed since the last kafka event processed by purged on cp3056 is CRITICAL: 4.414e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [04:13:08] RECOVERY - Number of messages locally queued by purged for processing on cp3050 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [04:17:38] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4030 is CRITICAL: 4.505e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4030 [04:23:58] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4032 is CRITICAL: 8998 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4032 [04:24:34] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4031 is CRITICAL: 1.82e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [04:27:02] PROBLEM - Number of messages locally queued by purged for processing on cp3050 is CRITICAL: cluster=cache_text instance=cp3050 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [04:34:22] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4027 is CRITICAL: 7.055e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4027 [04:35:30] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4028 is CRITICAL: 1.055e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [04:36:24] PROBLEM - Time elapsed since the last kafka event processed by purged on cp5007 is CRITICAL: 5.581e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [04:40:54] RECOVERY - Number of messages locally queued by purged for processing on cp3050 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [04:46:50] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5007 is OK: (C)5000 gt (W)3000 gt 2749 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [04:49:59] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4027 is OK: (C)5000 gt (W)3000 gt 70.04 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4027 [04:50:36] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4031 is OK: (C)5000 gt (W)3000 gt 61.34 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [04:51:08] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4028 is OK: (C)5000 gt (W)3000 gt 69.37 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [04:51:46] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4032 is OK: (C)5000 gt (W)3000 gt 1480 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4032 [04:52:22] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4030 is OK: (C)5000 gt (W)3000 gt 73.15 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4030 [04:52:28] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3056 is OK: (C)5000 gt (W)3000 gt 261.6 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [04:53:00] RECOVERY - Number of messages locally queued by purged for processing on cp5007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [04:53:16] RECOVERY - Number of messages locally queued by purged for processing on cp4028 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [04:53:28] RECOVERY - Number of messages locally queued by purged for processing on cp4032 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4032 [04:53:28] RECOVERY - Number of messages locally queued by purged for processing on cp4027 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4027 [04:54:04] RECOVERY - Number of messages locally queued by purged for processing on cp4031 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [04:54:54] RECOVERY - Number of messages locally queued by purged for processing on cp4030 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4030 [04:58:18] RECOVERY - Number of messages locally queued by purged for processing on cp3056 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [05:02:32] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5011 is OK: (C)5000 gt (W)3000 gt 406.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [05:03:08] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2035 is OK: (C)5000 gt (W)3000 gt 34.73 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2035 [05:03:52] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4029 is OK: (C)5000 gt (W)3000 gt 87.45 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4029 [05:04:16] RECOVERY - Number of messages locally queued by purged for processing on cp5011 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5011 [05:04:36] RECOVERY - Number of messages locally queued by purged for processing on cp2035 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2035 [05:05:06] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2029 is OK: (C)5000 gt (W)3000 gt 33.06 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2029 [05:05:36] RECOVERY - Number of messages locally queued by purged for processing on cp4029 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4029 [05:07:02] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5009 is OK: (C)5000 gt (W)3000 gt 283.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [05:07:08] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2033 is OK: (C)5000 gt (W)3000 gt 41.33 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033 [05:08:18] RECOVERY - Number of messages locally queued by purged for processing on cp2029 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2029 [05:08:38] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3062 is OK: (C)5000 gt (W)3000 gt 224.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3062 [05:08:52] RECOVERY - Time elapsed since the last kafka event processed by purged on cp5008 is OK: (C)5000 gt (W)3000 gt 236.5 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [05:08:52] RECOVERY - Number of messages locally queued by purged for processing on cp5009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [05:09:08] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3058 is OK: (C)5000 gt (W)3000 gt 298 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3058 [05:09:24] RECOVERY - Number of messages locally queued by purged for processing on cp2033 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033 [05:09:42] RECOVERY - Number of messages locally queued by purged for processing on cp5008 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [05:10:14] RECOVERY - Number of messages locally queued by purged for processing on cp3062 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3062 [05:11:26] RECOVERY - Number of messages locally queued by purged for processing on cp3058 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3058 [05:11:32] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3054 is OK: (C)5000 gt (W)3000 gt 207.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [05:13:16] RECOVERY - Number of messages locally queued by purged for processing on cp3054 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [05:15:06] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1079 is OK: (C)5000 gt (W)3000 gt 236.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [05:15:34] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3060 is OK: (C)5000 gt (W)3000 gt 233.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [05:15:56] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2027 is OK: (C)5000 gt (W)3000 gt 152.9 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2027 [05:16:52] RECOVERY - Number of messages locally queued by purged for processing on cp1079 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [05:17:18] RECOVERY - Number of messages locally queued by purged for processing on cp3060 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [05:19:18] RECOVERY - Number of messages locally queued by purged for processing on cp2027 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2027 [05:20:16] RECOVERY - Time elapsed since the last kafka event processed by purged on cp3064 is OK: (C)5000 gt (W)3000 gt 200.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3064 [05:22:04] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1089 is OK: (C)5000 gt (W)3000 gt 78.26 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1089 [05:22:44] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1087 is OK: (C)5000 gt (W)3000 gt 117.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [05:23:38] RECOVERY - Number of messages locally queued by purged for processing on cp3064 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3064 [05:24:24] RECOVERY - Number of messages locally queued by purged for processing on cp1087 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [05:24:34] RECOVERY - Number of messages locally queued by purged for processing on cp1089 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1089 [05:27:24] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1077 is OK: (C)5000 gt (W)3000 gt 112.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077 [05:27:54] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1083 is OK: (C)5000 gt (W)3000 gt 187.3 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1083 [05:30:20] RECOVERY - Number of messages locally queued by purged for processing on cp1077 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077 [05:30:36] RECOVERY - Number of messages locally queued by purged for processing on cp1083 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1083 [05:32:02] RECOVERY - Time elapsed since the last kafka event processed by purged on cp1075 is OK: (C)5000 gt (W)3000 gt 92.57 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1075 [05:34:40] RECOVERY - Number of messages locally queued by purged for processing on cp1075 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1075 [06:49:05] (03CR) 10Xqt: [WIP] Start migrating pybal to python3 (033 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) (owner: 10Ladsgroup) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201129T0800) [09:18:40] PROBLEM - snapshot of x1 in codfw on alert1001 is CRITICAL: snapshot for x1 at codfw taken more than 3 days ago: Most recent backup 2020-11-26 08:45:26 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [09:39:46] 10Operations, 10Analytics: Backport kafkacat 1.6.0 from bullseye to buster-backports or buster-wikimedia - https://phabricator.wikimedia.org/T268936 (10elukey) [09:48:24] PROBLEM - Host an-presto1004 is DOWN: PING CRITICAL - Packet loss = 100% [09:49:24] this is me --^ Checking what's happening, the host was in the d-i for some reason [09:54:00] RECOVERY - Host an-presto1004 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [09:56:57] I'll check tomorrow, the host doesn't recognize the disks, now it is in d-i (so no real recovery) [10:01:39] (03PS5) 10Ladsgroup: [WIP] Start migrating pybal to python3 [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) [10:02:08] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Start migrating pybal to python3 [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) (owner: 10Ladsgroup) [10:25:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:27:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:40:59] (03PS1) 10Ladsgroup: Move tests to a proper directory structure [debs/pybal] - 10https://gerrit.wikimedia.org/r/644050 [10:44:10] (03PS2) 10Ladsgroup: Move tests to a proper directory structure [debs/pybal] - 10https://gerrit.wikimedia.org/r/644050 [11:16:22] (03PS3) 10Ladsgroup: Move tests to a proper directory structure [debs/pybal] - 10https://gerrit.wikimedia.org/r/644050 [11:17:15] (03CR) 10jerkins-bot: [V: 04-1] Move tests to a proper directory structure [debs/pybal] - 10https://gerrit.wikimedia.org/r/644050 (owner: 10Ladsgroup) [11:18:38] (03PS4) 10Ladsgroup: Move tests to a proper directory structure [debs/pybal] - 10https://gerrit.wikimedia.org/r/644050 [11:25:59] (03CR) 10Ladsgroup: [WIP] Start migrating pybal to python3 (033 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/644041 (https://phabricator.wikimedia.org/T200319) (owner: 10Ladsgroup) [12:12:35] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Publish Wikibase tarball releases on releases.wikimedia.org - https://phabricator.wikimedia.org/T268818 (10Aklapper) [12:18:42] PROBLEM - snapshot of s1 in codfw on alert1001 is CRITICAL: snapshot for s1 at codfw taken more than 3 days ago: Most recent backup 2020-11-26 11:57:44 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [12:27:44] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 127649904 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:27:56] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 139030440 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:28:02] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1340240576 and 292 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:28:28] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 188441152 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:32:50] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 835593184 and 47 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:56] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 84000 and 28 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:35:02] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:35:26] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5304 and 58 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:36:20] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 74632 and 113 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:36:28] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 26488 and 121 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:49:42] PROBLEM - snapshot of s8 in codfw on alert1001 is CRITICAL: snapshot for s8 at codfw taken more than 3 days ago: Most recent backup 2020-11-26 12:15:28 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [14:52:20] PROBLEM - snapshot of s2 in codfw on alert1001 is CRITICAL: snapshot for s2 at codfw taken more than 3 days ago: Most recent backup 2020-11-26 14:34:21 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [15:15:14] (03PS1) 10Vlad.shapik: Expiration date: OAuth 2.0 access tokens have effectively infinite expiration date [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644056 (https://phabricator.wikimedia.org/T265075) [15:18:03] (03CR) 10Vlad.shapik: "Have a look, please." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644056 (https://phabricator.wikimedia.org/T265075) (owner: 10Vlad.shapik) [15:52:02] PROBLEM - tilerator on maps2009 is CRITICAL: connect to address 10.192.16.107 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [15:52:32] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:36] PROBLEM - snapshot of s4 in codfw on alert1001 is CRITICAL: snapshot for s4 at codfw taken more than 3 days ago: Most recent backup 2020-11-26 15:59:28 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [16:45:51] 10Operations, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10WMDE-leszek) [16:52:52] (03PS1) 10Andrew Bogott: Keystone: turn off INFO-level logging [puppet] - 10https://gerrit.wikimedia.org/r/644063 (https://phabricator.wikimedia.org/T268175) [16:53:57] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: turn off INFO-level logging [puppet] - 10https://gerrit.wikimedia.org/r/644063 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [17:00:34] (03PS1) 10Andrew Bogott: designate: set log levels to recommended upstream defaults [puppet] - 10https://gerrit.wikimedia.org/r/644064 (https://phabricator.wikimedia.org/T268175) [17:01:20] (03CR) 10Andrew Bogott: [C: 03+2] designate: set log levels to recommended upstream defaults [puppet] - 10https://gerrit.wikimedia.org/r/644064 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [17:30:20] PROBLEM - tilerator on maps2010 is CRITICAL: connect to address 10.192.48.166 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [17:31:28] PROBLEM - Check systemd state on maps2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:56:50] PROBLEM - snapshot of s7 in codfw on alert1001 is CRITICAL: snapshot for s7 at codfw taken more than 3 days ago: Most recent backup 2020-11-26 17:28:34 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [18:09:48] (03PS1) 10Andrew Bogott: OpenStack Glance: further attempt to quiet down logging a bit [puppet] - 10https://gerrit.wikimedia.org/r/644066 (https://phabricator.wikimedia.org/T268175) [18:11:05] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Glance: further attempt to quiet down logging a bit [puppet] - 10https://gerrit.wikimedia.org/r/644066 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [18:20:34] PROBLEM - Host an-presto1004 is DOWN: PING CRITICAL - Packet loss = 100% [18:25:32] this is me --^ [18:28:56] 10Operations, 10ops-eqiad: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10elukey) [18:29:12] 10Operations, 10ops-eqiad, 10Analytics: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10elukey) [18:29:57] ACKNOWLEDGEMENT - SSH on an-presto1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Elukey T268951 https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:29:57] ACKNOWLEDGEMENT - Host an-presto1004 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T268951 [18:58:16] PROBLEM - snapshot of s5 in codfw on alert1001 is CRITICAL: snapshot for s5 at codfw taken more than 3 days ago: Most recent backup 2020-11-26 18:46:32 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [19:00:50] 10Operations, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10Aklapper) Thanks for filing this! I archived also https://phabricator.wikimedia.org/tag/user-pablo-wmde/ , wondering what to do with open tasks having no othe... [19:01:02] 10Operations, 10Gerrit-Privilege-Requests, 10LDAP-Access-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10Aklapper) [19:47:16] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01009 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:14:46] (03PS1) 10Urbanecm: Enable RelatedArticles on ptwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/644070 (https://phabricator.wikimedia.org/T268945) [20:23:00] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 2 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ankry) >>! In T257066#6654054, @Ycrusoe wrote: > Hi, > > I'm one among a presumably significant grou... [20:30:22] PROBLEM - snapshot of s6 in codfw on alert1001 is CRITICAL: snapshot for s6 at codfw taken more than 3 days ago: Most recent backup 2020-11-26 20:24:11 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [21:01:06] PROBLEM - snapshot of s3 in codfw on alert1001 is CRITICAL: snapshot for s3 at codfw taken more than 3 days ago: Most recent backup 2020-11-26 20:39:27 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting