[00:04:03] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01198 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[00:13:53] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[00:15:27] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[00:17:33] <wikibugs>	 (03PS3) 10CRusnov: base/phase.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/630697 (https://phabricator.wikimedia.org/T247364)
[00:19:19] <wikibugs>	 (03CR) 10CRusnov: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/630697 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[00:21:10] <wikibugs>	 (03PS2) 10CRusnov: scripts/interface_automation.py: Clarify statusoverride flag [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627567
[00:26:01] <wikibugs>	 (03PS2) 10CRusnov: modules/tcpircbot: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/628436 (https://phabricator.wikimedia.org/T247364)
[00:30:19] <wikibugs>	 (03PS2) 10CRusnov: base/firewall/check_conntrack.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/630690 (https://phabricator.wikimedia.org/T247364)
[00:30:42] <wikibugs>	 (03CR) 10CRusnov: base/firewall/check_conntrack.py: Port to Python3 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/630690 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[00:35:25] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 40867408 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:37:19] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 216345960 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:38:13] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19699416 and 380 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:38:53] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1640416 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:38:57] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 269875032 and 424 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:39:01] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19307328 and 428 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:40:45] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1634216 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:40:49] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 834551032 and 39 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:42:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 47994184 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:42:31] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 102 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:43:25] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 991264 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:43:49] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1893496 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:44:07] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1974848 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:45:55] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 89868648 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:48:33] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 788867160 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:48:59] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1413630360 and 77 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:49:17] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 44021048 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:53:41] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 111464 and 63 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:54:07] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 102232 and 88 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:54:25] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 920 and 107 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:54:29] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4736 and 110 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:59:57] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:48:41] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:00:33] <icinga-wm>	 RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:00:33] <wikibugs>	 (03PS1) 10Andrew Bogott: OpenStack central logging: use LOG_LOCAL0 for everything [puppet] - 10https://gerrit.wikimedia.org/r/643995 (https://phabricator.wikimedia.org/T268175)
[02:05:45] <icinga-wm>	 PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:59:09] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:29:52] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T268907 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform
[03:29:57] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268907 (10ops-monitoring-bot)
[03:32:48] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01009 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[03:47:36] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:11:59] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268907 (10Peachey88)
[04:12:03] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10Peachey88)
[04:12:32] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:39:31] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Developer Productivity, 10Patch-For-Review: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10Krenair) ` alex@alex-laptop:~$ ssh deployment-puppetdb03 Linux deployment-puppetdb03 4.19.0-11-amd64 #1 SMP Debian...
[04:58:36] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:48:04] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:29:52] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:45:20] <wikibugs>	 (03PS1) 10Ladsgroup: thumbor: Migrate hiera to lookup [puppet] - 10https://gerrit.wikimedia.org/r/644001 (https://phabricator.wikimedia.org/T209953)
[07:47:09] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201128T0800)
[08:02:34] <wikibugs>	 (03PS1) 10Ladsgroup: kafka: Migrate hiera() to lookup() and setting datatype in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/644002 (https://phabricator.wikimedia.org/T209953)
[08:09:04] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/26752/" [puppet] - 10https://gerrit.wikimedia.org/r/644001 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[08:12:59] <wikibugs>	 (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/26751/" [puppet] - 10https://gerrit.wikimedia.org/r/644002 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[08:28:34] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:30:25] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/643995 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott)
[08:33:08] <wikibugs>	 (03CR) 10Ayounsi: GeoDNS: Update entry for Wikia (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/643983 (owner: 10TK-999)
[08:47:38] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:02:50] <wikibugs>	 (03CR) 10Elukey: "Very weird, running PCC yields to:" [puppet] - 10https://gerrit.wikimedia.org/r/644002 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[09:06:16] <wikibugs>	 (03CR) 10Elukey: kafka: Migrate hiera() to lookup() and setting datatype in monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644002 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[09:17:03] <wikibugs>	 (03PS2) 10Ladsgroup: kafka: Migrate hiera() to lookup() and setting datatype in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/644002 (https://phabricator.wikimedia.org/T209953)
[09:17:09] <wikibugs>	 (03CR) 10Ladsgroup: kafka: Migrate hiera() to lookup() and setting datatype in monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644002 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup)
[09:49:26] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:01:18] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:49:38] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:01:44] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:58] <icinga-wm>	 PROBLEM - Host mw1304.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[12:33:26] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 297308104 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:34:02] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 184143704 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:35:46] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1836552 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:36:56] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1399640 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:42:24] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Sure, why not" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627567 (owner: 10CRusnov)
[15:04:10] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002782 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[15:30:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] OpenStack central logging: use LOG_LOCAL0 for everything [puppet] - 10https://gerrit.wikimedia.org/r/643995 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott)
[20:20:16] <icinga-wm>	 PROBLEM - Host an-presto1004 is DOWN: PING CRITICAL - Packet loss = 100%
[20:22:18] <icinga-wm>	 RECOVERY - Host an-presto1004 is UP: PING WARNING - Packet loss = 33%, RTA = 0.20 ms
[22:46:34] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01009 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet