[00:04:26] <wikibugs>	 (03PS4) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138)
[00:05:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[00:07:26] <wikibugs>	 (03PS5) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138)
[00:08:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[00:09:52] <ryankemper>	 !log T269204 reimaging the following instances to debian buster: `wdqs1004`, `wdqs2001`, `wdqs1003`
[00:09:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:10:00] <stashbot>	 T269204: Some wdqs metrics changed when switching to python3 - https://phabricator.wikimedia.org/T269204
[00:10:57] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10colewhite) a:03colewhite
[00:13:25] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] Assign urlshortener-create-url permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645309 (https://phabricator.wikimedia.org/T229633) (owner: 10Ammarpad)
[00:14:40] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] "no op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645153 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712)
[00:15:30] <wikibugs>	 (03Merged) 10jenkins-bot: static.php - fix a typo (guruanteed -> guaranteed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645153 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712)
[00:16:20] <wikibugs>	 (03PS1) 10Cwhite: profile: add dot_expander filter script [puppet] - 10https://gerrit.wikimedia.org/r/645459 (https://phabricator.wikimedia.org/T234565)
[00:17:38] <Urbanecm>	 !log deploy1001 stagging dir is DIRTY: /srv/mediawiki-staging (master u+1): last commit bce412514eadaa47dbede56c4b4918da492443ce, author Mukunda Modell (cc twentyafterfour)
[00:17:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:18:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile: add dot_expander filter script [puppet] - 10https://gerrit.wikimedia.org/r/645459 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[00:18:42] <twentyafterfour>	 Urbanecm: that commit is merged just not pulled to the deploy server
[00:19:45] <Urbanecm>	 twentyafterfour: it was at that server, but it wasn't fetched. I just ran git fetch to fetch the no-op patch I merged a few lines above, but wanted to log this (unexpected) state just in case.
[00:21:13] <wikibugs>	 (03PS2) 10Ryan Kemper: [wdqs] proper selector for machines running the streaming-updater [puppet] - 10https://gerrit.wikimedia.org/r/643941 (https://phabricator.wikimedia.org/T266986) (owner: 10DCausse)
[00:23:13] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26969/console" [puppet] - 10https://gerrit.wikimedia.org/r/643941 (https://phabricator.wikimedia.org/T266986) (owner: 10DCausse)
[00:24:56] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] [wdqs] proper selector for machines running the streaming-updater [puppet] - 10https://gerrit.wikimedia.org/r/643941 (https://phabricator.wikimedia.org/T266986) (owner: 10DCausse)
[00:26:15] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime
[00:26:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:27:01] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 67101440 and 111 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:27:18] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime
[00:27:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:27:24] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.hosts.downtime
[00:27:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:28:19] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[00:28:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:28:54] <wikibugs>	 (03PS6) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138)
[00:29:46] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 74561024 and 274 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:30:09] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[00:30:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:30:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[00:32:11] <logmsgbot>	 !log ryankemper@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[00:32:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:32:37] <wikibugs>	 (03PS7) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138)
[00:34:54] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[00:35:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:35:03] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[00:35:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:35:16] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1881992 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:35:28] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1353244736 and 104 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:35:43] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload
[00:35:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:37:00] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 249120184 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:38:14] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1271936 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:40:00] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 520245040 and 47 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:40:16] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 62602952 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:40:18] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 267941768 and 314 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:40:18] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 118540680 and 314 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:40:33] <Urbanecm>	 !log End of mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log in a tmux at mwmaint1002 (wiki=eswiki; T246539)
[00:40:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:40:41] <stashbot>	 T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539
[00:40:52] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 137900728 and 349 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:42:40] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 222144480 and 457 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:42:50] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 45891928 and 468 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:43:20] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2170325048 and 96 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:44:12] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 855184568 and 70 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:44:14] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5733760008 and 402 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:44:38] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 370428400 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:44:56] <wikibugs>	 (03PS2) 10Cwhite: profile: add dot_expander filter script [puppet] - 10https://gerrit.wikimedia.org/r/645459 (https://phabricator.wikimedia.org/T234565)
[00:45:14] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[00:45:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:45:22] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 98 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:45:29] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[00:45:34] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer
[00:45:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:45:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5032 and 113 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:45:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 70568 and 113 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:45:38] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 181371112 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:45:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:46:02] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3008 and 139 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:46:04] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 191208 and 141 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:47:04] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 26248 and 200 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:48:14] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 410047368 and 16 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:48:33] <wikibugs>	 (03PS8) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138)
[00:50:52] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 831032 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:50:52] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1689232 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:51:36] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1226232 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:51:40] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1100120 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:53:58] <wikibugs>	 (03PS9) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138)
[00:56:13] <wikibugs>	 (03PS1) 10Cwhite: Revert "[wdqs] proper selector for machines running the streaming-updater" [puppet] - 10https://gerrit.wikimedia.org/r/645356
[00:56:18] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 893430128 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:56:22] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 724534008 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:56:30] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] Revert "[wdqs] proper selector for machines running the streaming-updater" [puppet] - 10https://gerrit.wikimedia.org/r/645356 (owner: 10Cwhite)
[00:56:40] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 71286456 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:56:40] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1643938336 and 76 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:57:28] <shdubsh>	 ryankemper: FYI, I'm reverting the recent wdqs jmx-exporter changes as puppet is failing to apply on the prometheus hosts.
[00:57:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "[wdqs] proper selector for machines running the streaming-updater" [puppet] - 10https://gerrit.wikimedia.org/r/645356 (owner: 10Cwhite)
[00:57:44] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 72680 and 66 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:57:50] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 126128 and 70 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:58:04] <ryankemper>	 shdubsh: ack, thanks for catching that
[00:58:10] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 85408 and 91 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:58:10] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 85408 and 91 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:58:48] <ryankemper>	 My fault there, looks like I didn't target the correct instances with PCC
[00:59:43] <wikibugs>	 (03PS2) 10Cwhite: Revert "[wdqs] proper selector for machines running the streaming-updater" [puppet] - 10https://gerrit.wikimedia.org/r/645356
[00:59:53] <wikibugs>	 (03CR) 10Cwhite: [V: 03+2 C: 03+2] Revert "[wdqs] proper selector for machines running the streaming-updater" [puppet] - 10https://gerrit.wikimedia.org/r/645356 (owner: 10Cwhite)
[01:04:21] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "finally compiles https://puppet-compiler.wmflabs.org/compiler1003/26974/wikistats-dancing-goat.wikistats.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn)
[01:13:08] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 165160 and 988 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:21:52] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 74488 and 1513 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:47:43] <wikibugs>	 10Operations, 10Analytics, 10SRE-Access-Requests: Kerberos Password - https://phabricator.wikimedia.org/T269472 (10Reedy) a:05Miriam→03None
[01:50:58] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Jdforrester-WMF)
[01:58:53] <logmsgbot>	 !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[01:58:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:06:31] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0)
[02:06:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:12:23] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)
[02:12:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:13:42] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[02:13:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:42:51] <wikibugs>	 (03CR) 10Jforrester: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester)
[03:43:51] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[03:43:57] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[03:43:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:44:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:44:09] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer
[03:44:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:49:10] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[03:49:12] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[03:49:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:49:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:49:27] <logmsgbot>	 !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[03:49:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:05:33] <ryankemper>	 !log T269204 reimaging the following instances to debian buster (one each from `[public, internal] x [eqiad, codfw]`):  `wdqs1005`, `wdqs2002`, `wdqs1008`, `wdqs2005`
[04:05:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:05:42] <stashbot>	 T269204: Some wdqs metrics changed when switching to python3 - https://phabricator.wikimedia.org/T269204
[04:13:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:21:35] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime
[04:21:36] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime
[04:21:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:21:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:22:52] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.hosts.downtime
[04:22:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:22:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:23:31] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[04:23:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:23:37] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.hosts.downtime
[04:23:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:25:30] <logmsgbot>	 !log ryankemper@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[04:25:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:26:36] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[04:26:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:28:37] <logmsgbot>	 !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[04:28:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:30:02] <wikibugs>	 (03CR) 10Reedy: [C: 03+1] httpd: make it possible to configure server admin email address [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy)
[04:44:12] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[04:44:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:44:22] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer
[04:44:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:44:27] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer
[04:44:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:44:34] <logmsgbot>	 !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer
[04:44:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:49:26] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.41:80]) https://wikitech.wikimedia.org/wiki/PyBal
[04:53:30] <icinga-wm>	 PROBLEM - LVS wdqs-internal eqiad port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.41 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[04:54:56] <rzl>	 ryankemper: just falling asleep -- need a hand with anything?
[04:56:20] <ryankemper>	 rzl: this is the last round of nodes for tonight, let me get that alert acked and take a look
[04:56:32] <ryankemper>	 rzl: I should be fine
[04:57:30] <rzl>	 okay -- I'm going to check back out then
[04:58:04] <ryankemper>	 rzl: please do, sorry for missing the page :/
[04:58:20] <rzl>	 no worries, it happens :)
[04:59:07] <icinga-wm>	 ACKNOWLEDGEMENT - LVS wdqs-internal eqiad port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.41 and port 80: Connection refused Ryan Kemper phabricator.wikimedia.org/T269204 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[05:00:19] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_updater site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:02:36] <icinga-wm>	 ACKNOWLEDGEMENT - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_updater site=codfw Ryan Kemper related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/643941 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:11:06] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.41:80]) Ryan Kemper http://phabricator.wikimedia.org/T269204 https://wikitech.wikimedia.org/wiki/PyBal
[05:11:06] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.41:80]) Ryan Kemper http://phabricator.wikimedia.org/T269204 https://wikitech.wikimedia.org/wiki/PyBal
[05:39:58] <ryankemper>	 !log restarted pybal on `lvs1016` per the instructions in https://wikitech.wikimedia.org/wiki/PyBal#Services_known_to_PyBal_but_not_to_IPVS
[05:40:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:42:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:43:21] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs1003.eqiad.wmnet are marked down but pooled Ryan Kemper needs to be depooled https://wikitech.wikimedia.org/wiki/PyBal
[05:46:37] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:49:17] <ryankemper>	 !log restarted pybal on `lvs1015` per the instructions in https://wikitech.wikimedia.org/wiki/PyBal#Services_known_to_PyBal_but_not_to_IPVS
[05:49:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:20] <logmsgbot>	 !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[06:01:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:05:37] <mutante>	 was asleep when it paged but I see it now and the backlog. ACKing it
[06:09:35] <logmsgbot>	 !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[06:09:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:11:54] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[06:12:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:07] <icinga-wm>	 RECOVERY - LVS wdqs-internal eqiad port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[06:12:42] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)
[06:12:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:21] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[06:14:37] <mutante>	 ryankemper: looking good now. thanks and good weekend then
[06:15:35] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:48:39] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1054 is CRITICAL: CRITICAL - load average: 108.14, 101.52, 87.93 https://wikitech.wikimedia.org/wiki/Swift
[11:05:31] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1054 is OK: OK - load average: 64.72, 70.14, 78.46 https://wikitech.wikimedia.org/wiki/Swift
[12:16:21] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 237400624 and 111 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:16:21] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 251832000 and 111 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:20:19] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 460944152 and 32 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:20:19] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18702336 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:20:23] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 218595336 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:21:13] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 39330624 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:22:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1779896 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:23:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 136313008 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:23:51] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 55447568 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:25:15] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 38 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:25:31] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 54 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:26:03] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 48 and 85 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:26:03] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 48 and 85 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:26:03] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 48 and 85 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:26:15] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 26264 and 98 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:26:25] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 67676536 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:26:29] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 121549688 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:26:31] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 113740496 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:27:37] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 178 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:27:57] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 153994328 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:30:23] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 80712 and 75 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:31:17] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 32088 and 130 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:31:19] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 56 and 132 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:31:21] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 56 and 134 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:30:48] <wikibugs>	 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Seppl2013) For the extensions https://www.mediawiki.org/wiki/Extension:Diagrams and https://www.mediawiki.org/wiki/Extension:Piwo the...
[14:48:35] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1054 is CRITICAL: CRITICAL - load average: 101.78, 100.24, 85.51 https://wikitech.wikimedia.org/wiki/Swift
[15:07:51] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1054 is OK: OK - load average: 51.78, 65.70, 77.29 https://wikitech.wikimedia.org/wiki/Swift
[15:48:25] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1054 is CRITICAL: CRITICAL - load average: 105.67, 100.94, 88.96 https://wikitech.wikimedia.org/wiki/Swift
[16:12:13] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be1054 is OK: OK - load average: 30.73, 53.67, 74.06 https://wikitech.wikimedia.org/wiki/Swift
[16:25:53] <godog>	 !log swift disable sdg1 on ms-be1054
[16:26:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:27] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:54:43] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1053 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:57:05] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:16:21] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:44:49] <icinga-wm>	 PROBLEM - Check systemd state on cp1089 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:03:41] <icinga-wm>	 RECOVERY - Check systemd state on cp1089 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:38:57] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 9115 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:48:17] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 5698 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:51:55] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f1c52112518: Failed to establish a new connection: [Errno 111] Connection
[23:51:55] <icinga-wm>	 ://wikitech.wikimedia.org/wiki/Search%23Administration
[23:52:29] <icinga-wm>	 PROBLEM - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:53:07] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on alert1001 is OK: (C)5000 gt (W)1000 gt 3 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops