[16:00:33] Test logging [16:03:26] 10Operations, 10WM-Bot, 10User-MacFan4000: No wm-bot logs for #wikimedia-operations since 2020-11-18 - https://phabricator.wikimedia.org/T268889 (10MacFan4000) 05Open→03Resolved a:03MacFan4000 After using @system-rejoin-all for all wm-bot instances, it seems to be working now. [16:09:47] (03PS9) 10Elukey: Enable kerberos in kerberos::systemd_timer by default [puppet] - 10https://gerrit.wikimedia.org/r/642446 (https://phabricator.wikimedia.org/T268220) [16:13:15] (03PS10) 10Elukey: Enable kerberos in kerberos::systemd_timer by default [puppet] - 10https://gerrit.wikimedia.org/r/642446 (https://phabricator.wikimedia.org/T268220) [16:17:03] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26749/console" [puppet] - 10https://gerrit.wikimedia.org/r/642446 (https://phabricator.wikimedia.org/T268220) (owner: 10Elukey) [16:26:23] (03PS11) 10Elukey: Enable kerberos in kerberos::systemd_timer by default [puppet] - 10https://gerrit.wikimedia.org/r/642446 (https://phabricator.wikimedia.org/T268220) [16:30:09] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26750/console" [puppet] - 10https://gerrit.wikimedia.org/r/642446 (https://phabricator.wikimedia.org/T268220) (owner: 10Elukey) [17:08:11] (03PS1) 10JMeybohm: Add charts for calico and calico-crds [deployment-charts] - 10https://gerrit.wikimedia.org/r/643974 (https://phabricator.wikimedia.org/T267653) [17:30:02] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [17:30:03] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:42] (03PS1) 10Ayounsi: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 [17:49:56] (03CR) 10jerkins-bot: [V: 04-1] Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi) [17:52:11] (03PS1) 10Jcrespo: [WIP] We continue with swift listing tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 [17:52:40] (03CR) 10jerkins-bot: [V: 04-1] [WIP] We continue with swift listing tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 (owner: 10Jcrespo) [17:59:57] 10Operations, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Goal, 10Patch-For-Review: Prepare a proof of concept of the minimum setup capable of backup and recover testwiki media files - https://phabricator.wikimedia.org/T264189 (10jcrespo) There are 78950900 files on swift from commons wiki, countin... [18:24:29] (03CR) 10Volans: [C: 04-1] "Small typo, optional question inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/643945 (owner: 10Jbond) [18:28:56] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01009 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:31:36] (03CR) 10Volans: "Some questions inline." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi) [18:48:40] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.45 [software/spicerack] - 10https://gerrit.wikimedia.org/r/643982 [18:54:10] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:19:26] (03PS1) 10TK-999: GeoDNS: Update entry for Wikia [dns] - 10https://gerrit.wikimedia.org/r/643983 [21:00:40] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:27] ACKNOWLEDGEMENT - HP RAID on ms-be1030 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T268896 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [21:04:31] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268896 (10ops-monitoring-bot) [21:47:21] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:51] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:59:41] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:45:05] (03CR) 10DannyS712: [C: 04-1] Add log channel Wikibase.IdGenerator (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/643874 (https://phabricator.wikimedia.org/T268625) (owner: 10Lucas Werkmeister (WMDE)) [23:47:45] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:10] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268896 (10Peachey88) [23:57:13] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1030 - https://phabricator.wikimedia.org/T268036 (10Peachey88)