[00:04:26] (03PS4) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) [00:05:52] (03CR) 10jerkins-bot: [V: 04-1] wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [00:07:26] (03PS5) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) [00:08:52] (03CR) 10jerkins-bot: [V: 04-1] wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [00:09:52] !log T269204 reimaging the following instances to debian buster: `wdqs1004`, `wdqs2001`, `wdqs1003` [00:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:00] T269204: Some wdqs metrics changed when switching to python3 - https://phabricator.wikimedia.org/T269204 [00:10:57] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10colewhite) a:03colewhite [00:13:25] (03CR) 10Urbanecm: [C: 03+1] Assign urlshortener-create-url permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645309 (https://phabricator.wikimedia.org/T229633) (owner: 10Ammarpad) [00:14:40] (03CR) 10Urbanecm: [C: 03+2] "no op for prod" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645153 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [00:15:30] (03Merged) 10jenkins-bot: static.php - fix a typo (guruanteed -> guaranteed) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645153 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [00:16:20] (03PS1) 10Cwhite: profile: add dot_expander filter script [puppet] - 10https://gerrit.wikimedia.org/r/645459 (https://phabricator.wikimedia.org/T234565) [00:17:38] !log deploy1001 stagging dir is DIRTY: /srv/mediawiki-staging (master u+1): last commit bce412514eadaa47dbede56c4b4918da492443ce, author Mukunda Modell (cc twentyafterfour) [00:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:03] (03CR) 10jerkins-bot: [V: 04-1] profile: add dot_expander filter script [puppet] - 10https://gerrit.wikimedia.org/r/645459 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [00:18:42] Urbanecm: that commit is merged just not pulled to the deploy server [00:19:45] twentyafterfour: it was at that server, but it wasn't fetched. I just ran git fetch to fetch the no-op patch I merged a few lines above, but wanted to log this (unexpected) state just in case. [00:21:13] (03PS2) 10Ryan Kemper: [wdqs] proper selector for machines running the streaming-updater [puppet] - 10https://gerrit.wikimedia.org/r/643941 (https://phabricator.wikimedia.org/T266986) (owner: 10DCausse) [00:23:13] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26969/console" [puppet] - 10https://gerrit.wikimedia.org/r/643941 (https://phabricator.wikimedia.org/T266986) (owner: 10DCausse) [00:24:56] (03CR) 10Ryan Kemper: [V: 03+1 C: 03+2] [wdqs] proper selector for machines running the streaming-updater [puppet] - 10https://gerrit.wikimedia.org/r/643941 (https://phabricator.wikimedia.org/T266986) (owner: 10DCausse) [00:26:15] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime [00:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:01] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 67101440 and 111 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:27:18] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime [00:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:24] !log ryankemper@cumin2001 START - Cookbook sre.hosts.downtime [00:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:19] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:54] (03PS6) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) [00:29:46] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 74561024 and 274 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:30:09] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:22] (03CR) 10jerkins-bot: [V: 04-1] wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [00:32:11] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:37] (03PS7) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) [00:34:54] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:03] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [00:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:16] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1881992 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:35:28] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1353244736 and 104 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:35:43] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [00:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:00] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 249120184 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:38:14] RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1271936 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:00] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 520245040 and 47 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:16] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 62602952 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:18] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 267941768 and 314 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:18] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 118540680 and 314 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:33] !log End of mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log in a tmux at mwmaint1002 (wiki=eswiki; T246539) [00:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:41] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [00:40:52] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 137900728 and 349 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:40] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 222144480 and 457 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:50] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 45891928 and 468 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:20] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2170325048 and 96 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:12] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 855184568 and 70 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:14] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5733760008 and 402 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:38] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 370428400 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:56] (03PS2) 10Cwhite: profile: add dot_expander filter script [puppet] - 10https://gerrit.wikimedia.org/r/645459 (https://phabricator.wikimedia.org/T234565) [00:45:14] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:22] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 98 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:29] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [00:45:34] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [00:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:36] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5032 and 113 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:36] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 70568 and 113 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:38] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 181371112 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:02] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3008 and 139 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:04] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 191208 and 141 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:04] RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 26248 and 200 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:14] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 410047368 and 16 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:33] (03PS8) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) [00:50:52] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 831032 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:52] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1689232 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:36] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1226232 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:40] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1100120 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:58] (03PS9) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) [00:56:13] (03PS1) 10Cwhite: Revert "[wdqs] proper selector for machines running the streaming-updater" [puppet] - 10https://gerrit.wikimedia.org/r/645356 [00:56:18] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 893430128 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:22] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 724534008 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:30] (03CR) 10Cwhite: [C: 03+2] Revert "[wdqs] proper selector for machines running the streaming-updater" [puppet] - 10https://gerrit.wikimedia.org/r/645356 (owner: 10Cwhite) [00:56:40] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 71286456 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:40] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1643938336 and 76 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:57:28] ryankemper: FYI, I'm reverting the recent wdqs jmx-exporter changes as puppet is failing to apply on the prometheus hosts. [00:57:42] (03CR) 10jerkins-bot: [V: 04-1] Revert "[wdqs] proper selector for machines running the streaming-updater" [puppet] - 10https://gerrit.wikimedia.org/r/645356 (owner: 10Cwhite) [00:57:44] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 72680 and 66 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:57:50] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 126128 and 70 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:04] shdubsh: ack, thanks for catching that [00:58:10] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 85408 and 91 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:10] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 85408 and 91 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:48] My fault there, looks like I didn't target the correct instances with PCC [00:59:43] (03PS2) 10Cwhite: Revert "[wdqs] proper selector for machines running the streaming-updater" [puppet] - 10https://gerrit.wikimedia.org/r/645356 [00:59:53] (03CR) 10Cwhite: [V: 03+2 C: 03+2] Revert "[wdqs] proper selector for machines running the streaming-updater" [puppet] - 10https://gerrit.wikimedia.org/r/645356 (owner: 10Cwhite) [01:04:21] (03CR) 10Dzahn: [V: 03+1] "finally compiles https://puppet-compiler.wmflabs.org/compiler1003/26974/wikistats-dancing-goat.wikistats.eqiad.wmflabs/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [01:13:08] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 165160 and 988 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:21:52] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 74488 and 1513 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:47:43] 10Operations, 10Analytics, 10SRE-Access-Requests: Kerberos Password - https://phabricator.wikimedia.org/T269472 (10Reedy) a:05Miriam→03None [01:50:58] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade MediaWiki appservers to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Jdforrester-WMF) [01:58:53] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [01:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:31] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) [02:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:12:23] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [02:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:13:42] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [02:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:51] (03CR) 10Jforrester: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [03:43:51] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [03:43:57] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [03:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:44:09] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [03:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:10] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [03:49:12] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [03:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:27] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [03:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:05:33] !log T269204 reimaging the following instances to debian buster (one each from `[public, internal] x [eqiad, codfw]`): `wdqs1005`, `wdqs2002`, `wdqs1008`, `wdqs2005` [04:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:05:42] T269204: Some wdqs metrics changed when switching to python3 - https://phabricator.wikimedia.org/T269204 [04:13:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=blazegraph site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:21:35] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime [04:21:36] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime [04:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:22:52] !log ryankemper@cumin2001 START - Cookbook sre.hosts.downtime [04:22:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:31] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [04:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:23:37] !log ryankemper@cumin2001 START - Cookbook sre.hosts.downtime [04:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:30] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [04:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:26:36] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [04:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:28:37] !log ryankemper@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [04:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:30:02] (03CR) 10Reedy: [C: 03+1] httpd: make it possible to configure server admin email address [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [04:44:12] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [04:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:22] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-transfer [04:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:27] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [04:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:34] !log ryankemper@cumin2001 START - Cookbook sre.wdqs.data-transfer [04:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:26] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.41:80]) https://wikitech.wikimedia.org/wiki/PyBal [04:53:30] PROBLEM - LVS wdqs-internal eqiad port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.41 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [04:54:56] ryankemper: just falling asleep -- need a hand with anything? [04:56:20] rzl: this is the last round of nodes for tonight, let me get that alert acked and take a look [04:56:32] rzl: I should be fine [04:57:30] okay -- I'm going to check back out then [04:58:04] rzl: please do, sorry for missing the page :/ [04:58:20] no worries, it happens :) [04:59:07] ACKNOWLEDGEMENT - LVS wdqs-internal eqiad port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.41 and port 80: Connection refused Ryan Kemper phabricator.wikimedia.org/T269204 https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [05:00:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_updater site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:02:36] ACKNOWLEDGEMENT - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=jmx_wdqs_updater site=codfw Ryan Kemper related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/643941 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:11:06] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.41:80]) Ryan Kemper http://phabricator.wikimedia.org/T269204 https://wikitech.wikimedia.org/wiki/PyBal [05:11:06] ACKNOWLEDGEMENT - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.41:80]) Ryan Kemper http://phabricator.wikimedia.org/T269204 https://wikitech.wikimedia.org/wiki/PyBal [05:39:58] !log restarted pybal on `lvs1016` per the instructions in https://wikitech.wikimedia.org/wiki/PyBal#Services_known_to_PyBal_but_not_to_IPVS [05:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:13] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:43:21] ACKNOWLEDGEMENT - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-internal_80: Servers wdqs1003.eqiad.wmnet are marked down but pooled Ryan Kemper needs to be depooled https://wikitech.wikimedia.org/wiki/PyBal [05:46:37] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:49:17] !log restarted pybal on `lvs1015` per the instructions in https://wikitech.wikimedia.org/wiki/PyBal#Services_known_to_PyBal_but_not_to_IPVS [05:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:20] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [06:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:37] was asleep when it paged but I see it now and the backlog. ACKing it [06:09:35] !log ryankemper@cumin2001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [06:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:54] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [06:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:07] RECOVERY - LVS wdqs-internal eqiad port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.034 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [06:12:42] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [06:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:21] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [06:14:37] ryankemper: looking good now. thanks and good weekend then [06:15:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:48:39] PROBLEM - very high load average likely xfs on ms-be1054 is CRITICAL: CRITICAL - load average: 108.14, 101.52, 87.93 https://wikitech.wikimedia.org/wiki/Swift [11:05:31] RECOVERY - very high load average likely xfs on ms-be1054 is OK: OK - load average: 64.72, 70.14, 78.46 https://wikitech.wikimedia.org/wiki/Swift [12:16:21] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 237400624 and 111 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:16:21] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 251832000 and 111 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:20:19] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 460944152 and 32 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:20:19] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18702336 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:20:23] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 218595336 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:21:13] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 39330624 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:22:45] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1779896 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:23:05] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 136313008 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:23:51] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 55447568 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:25:15] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 38 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:25:31] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 54 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:26:03] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 48 and 85 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:26:03] RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 48 and 85 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:26:03] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 48 and 85 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:26:15] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 26264 and 98 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:26:25] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 67676536 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:26:29] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 121549688 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:26:31] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 113740496 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:27:37] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 178 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:27:57] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 153994328 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:30:23] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 80712 and 75 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:17] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 32088 and 130 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:19] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 56 and 132 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:21] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 56 and 134 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:30:48] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, and 2 others: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10Seppl2013) For the extensions https://www.mediawiki.org/wiki/Extension:Diagrams and https://www.mediawiki.org/wiki/Extension:Piwo the... [14:48:35] PROBLEM - very high load average likely xfs on ms-be1054 is CRITICAL: CRITICAL - load average: 101.78, 100.24, 85.51 https://wikitech.wikimedia.org/wiki/Swift [15:07:51] RECOVERY - very high load average likely xfs on ms-be1054 is OK: OK - load average: 51.78, 65.70, 77.29 https://wikitech.wikimedia.org/wiki/Swift [15:48:25] PROBLEM - very high load average likely xfs on ms-be1054 is CRITICAL: CRITICAL - load average: 105.67, 100.94, 88.96 https://wikitech.wikimedia.org/wiki/Swift [16:12:13] RECOVERY - very high load average likely xfs on ms-be1054 is OK: OK - load average: 30.73, 53.67, 74.06 https://wikitech.wikimedia.org/wiki/Swift [16:25:53] !log swift disable sdg1 on ms-be1054 [16:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:27] PROBLEM - Check systemd state on ms-be1051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:54:43] PROBLEM - Check systemd state on ms-be1053 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:05] RECOVERY - Check systemd state on ms-be1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:21] RECOVERY - Check systemd state on ms-be1051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:44:49] PROBLEM - Check systemd state on cp1089 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:03:41] RECOVERY - Check systemd state on cp1089 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:57] PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 9115 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:48:17] PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 5698 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:51:55] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f1c52112518: Failed to establish a new connection: [Errno 111] Connection [23:51:55] ://wikitech.wikimedia.org/wiki/Search%23Administration [23:52:29] PROBLEM - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:07] RECOVERY - MediaWiki memcached error rate on alert1001 is OK: (C)5000 gt (W)1000 gt 3 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops