[00:04:03] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01198 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:13:53] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [00:15:27] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [00:17:33] (03PS3) 10CRusnov: base/phase.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/630697 (https://phabricator.wikimedia.org/T247364) [00:19:19] (03CR) 10CRusnov: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/630697 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:21:10] (03PS2) 10CRusnov: scripts/interface_automation.py: Clarify statusoverride flag [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627567 [00:26:01] (03PS2) 10CRusnov: modules/tcpircbot: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/628436 (https://phabricator.wikimedia.org/T247364) [00:30:19] (03PS2) 10CRusnov: base/firewall/check_conntrack.py: Port to Python3 [puppet] - 10https://gerrit.wikimedia.org/r/630690 (https://phabricator.wikimedia.org/T247364) [00:30:42] (03CR) 10CRusnov: base/firewall/check_conntrack.py: Port to Python3 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/630690 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:35:25] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 40867408 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:37:19] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 216345960 and 12 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:38:13] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19699416 and 380 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:38:53] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1640416 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:38:57] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 269875032 and 424 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:39:01] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19307328 and 428 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:45] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1634216 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:40:49] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 834551032 and 39 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:05] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 47994184 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:42:31] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 102 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:25] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 991264 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:43:49] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1893496 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:44:07] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1974848 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:55] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 89868648 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:33] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 788867160 and 45 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:48:59] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1413630360 and 77 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:17] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 44021048 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:41] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 111464 and 63 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:07] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 102232 and 88 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:25] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 920 and 107 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:29] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4736 and 110 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:59:57] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:41] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:33] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:33] (03PS1) 10Andrew Bogott: OpenStack central logging: use LOG_LOCAL0 for everything [puppet] - 10https://gerrit.wikimedia.org/r/643995 (https://phabricator.wikimedia.org/T268175) [02:05:45] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:59:09] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:29:52] ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T268907 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [03:29:57] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268907 (10ops-monitoring-bot) [03:32:48] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01009 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [03:47:36] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:59] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268907 (10Peachey88) [04:12:03] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10Peachey88) [04:12:32] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:39:31] 10Puppet, 10Beta-Cluster-Infrastructure, 10Developer Productivity, 10Patch-For-Review: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10Krenair) ` alex@alex-laptop:~$ ssh deployment-puppetdb03 Linux deployment-puppetdb03 4.19.0-11-amd64 #1 SMP Debian... [04:58:36] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:48:04] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:29:52] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:45:20] (03PS1) 10Ladsgroup: thumbor: Migrate hiera to lookup [puppet] - 10https://gerrit.wikimedia.org/r/644001 (https://phabricator.wikimedia.org/T209953) [07:47:09] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201128T0800) [08:02:34] (03PS1) 10Ladsgroup: kafka: Migrate hiera() to lookup() and setting datatype in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/644002 (https://phabricator.wikimedia.org/T209953) [08:09:04] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/26752/" [puppet] - 10https://gerrit.wikimedia.org/r/644001 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:12:59] (03CR) 10Ladsgroup: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/26751/" [puppet] - 10https://gerrit.wikimedia.org/r/644002 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [08:28:34] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:25] (03CR) 10Ayounsi: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/643995 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [08:33:08] (03CR) 10Ayounsi: GeoDNS: Update entry for Wikia (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/643983 (owner: 10TK-999) [08:47:38] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:02:50] (03CR) 10Elukey: "Very weird, running PCC yields to:" [puppet] - 10https://gerrit.wikimedia.org/r/644002 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [09:06:16] (03CR) 10Elukey: kafka: Migrate hiera() to lookup() and setting datatype in monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644002 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [09:17:03] (03PS2) 10Ladsgroup: kafka: Migrate hiera() to lookup() and setting datatype in monitoring [puppet] - 10https://gerrit.wikimedia.org/r/644002 (https://phabricator.wikimedia.org/T209953) [09:17:09] (03CR) 10Ladsgroup: kafka: Migrate hiera() to lookup() and setting datatype in monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644002 (https://phabricator.wikimedia.org/T209953) (owner: 10Ladsgroup) [09:49:26] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:18] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:49:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:44] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:58] PROBLEM - Host mw1304.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:33:26] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 297308104 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:02] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 184143704 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:35:46] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1836552 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:36:56] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1399640 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:42:24] (03CR) 10Volans: [C: 03+1] "Sure, why not" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627567 (owner: 10CRusnov) [15:04:10] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002782 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:30:37] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack central logging: use LOG_LOCAL0 for everything [puppet] - 10https://gerrit.wikimedia.org/r/643995 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [20:20:16] PROBLEM - Host an-presto1004 is DOWN: PING CRITICAL - Packet loss = 100% [20:22:18] RECOVERY - Host an-presto1004 is UP: PING WARNING - Packet loss = 33%, RTA = 0.20 ms [22:46:34] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01009 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet