[00:47:36] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3978274640 and 225 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:36] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 11737699120 and 688 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:38] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5374144472 and 300 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:04] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 252298512 and 72 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:16] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2306370376 and 232 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:30] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 227368784 and 148 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:46] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 13576 and 154 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:18] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4963093216 and 431 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:58] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 432640 and 227 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:51:10] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 104976 and 239 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:38] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 105560 and 326 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:52:40] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 221840 and 329 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:20] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 158440 and 489 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:00] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 104136 and 529 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:08] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 310323104 and 30 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:08] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 662386672 and 43 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:58] PROBLEM - Host mw2220.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [01:01:48] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 14024 and 104 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:01:48] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 14024 and 104 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:23:05] PROBLEM - snapshot of s7 in eqiad on alert1001 is CRITICAL: snapshot for s7 at eqiad taken more than 3 days ago: Most recent backup 2020-12-24 02:00:21 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [03:07:57] PROBLEM - Wikitech and wt-static content in sync on cloudweb2001-dev is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (202844s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [04:13:59] PROBLEM - Wikitech and wt-static content in sync on labweb1002 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (202844s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [04:41:41] PROBLEM - Wikitech and wt-static content in sync on labweb1001 is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (202844s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [08:58:05] PROBLEM - Check systemd state on ms-be1023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:17:01] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1023 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:25:41] RECOVERY - Check systemd state on ms-be1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:37] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1023 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:58:49] (03PS1) 10David Caro: wmcs.backup: move all but dumps to cloudvirt1025 [puppet] - 10https://gerrit.wikimedia.org/r/651992 [18:00:02] (03CR) 10David Caro: [C: 03+2] wmcs.backup: move all but dumps to cloudvirt1025 [puppet] - 10https://gerrit.wikimedia.org/r/651992 (owner: 10David Caro) [18:27:25] (03CR) 10Andrew Bogott: "It's probably worth checking in with the dumps project people about backups -- if it's really just dumps then backing their stuff up at al" [puppet] - 10https://gerrit.wikimedia.org/r/651992 (owner: 10David Caro) [18:57:04] (03PS3) 10David Caro: [wmcs][backup] Add command to remove/print dangling snapshots [puppet] - 10https://gerrit.wikimedia.org/r/650535 (https://phabricator.wikimedia.org/T270478) [18:57:06] (03PS3) 10David Caro: [wmcs][backup] Remove all temp files after usage [puppet] - 10https://gerrit.wikimedia.org/r/650542 (https://phabricator.wikimedia.org/T270478) [18:57:08] (03PS3) 10David Caro: wmcs.backups: Add a images summary command [puppet] - 10https://gerrit.wikimedia.org/r/651166 [18:57:10] (03PS2) 10David Caro: wmcs.backup: Add a method to create a vm backup [puppet] - 10https://gerrit.wikimedia.org/r/651507 [18:57:12] (03PS3) 10David Caro: wmcs.backup: Remove all dangling snapshots [puppet] - 10https://gerrit.wikimedia.org/r/651537 [18:57:14] (03PS2) 10David Caro: wmcs.backup: Add a way to remove old backups and snapshots [puppet] - 10https://gerrit.wikimedia.org/r/651550 [18:57:16] (03PS2) 10David Caro: wmcs.backup: Add command to backup all assigned vms [puppet] - 10https://gerrit.wikimedia.org/r/651761 [18:57:18] (03PS2) 10David Caro: wmcs.backup: add a command to remove non-handled backups [puppet] - 10https://gerrit.wikimedia.org/r/651776 [19:00:41] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state