[00:06:55] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:09:13] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:12:33] <icinga-wm>	 RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[00:19:43] <icinga-wm>	 PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[01:22:49] <icinga-wm>	 PROBLEM - Host ms-be2028 is DOWN: PING CRITICAL - Packet loss = 100%
[03:43:57] <icinga-wm>	 RECOVERY - Host ms-be2028 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms
[03:44:27] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:45:19] <icinga-wm>	 RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:46:47] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:55:03] <icinga-wm>	 PROBLEM - HP RAID on ms-be2028 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[03:55:06] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be2028 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T279245 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardwa
[03:55:06] <icinga-wm>	 on_Gathering
[03:55:10] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T279245 (10ops-monitoring-bot)
[04:15:35] <wikibugs>	 (03PS1) 10Andrew Bogott: Ussury policy.yaml files: remove redundant rules [puppet] - 10https://gerrit.wikimedia.org/r/676778 (https://phabricator.wikimedia.org/T261136)
[04:19:53] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: define a limited set of OPENSTACK_IMAGE_BACKENDs [puppet] - 10https://gerrit.wikimedia.org/r/676779 (https://phabricator.wikimedia.org/T261138)
[04:20:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Ussury policy.yaml files: remove redundant rules [puppet] - 10https://gerrit.wikimedia.org/r/676778 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott)
[04:20:59] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon: define a limited set of OPENSTACK_IMAGE_BACKENDs [puppet] - 10https://gerrit.wikimedia.org/r/676779 (https://phabricator.wikimedia.org/T261138) (owner: 10Andrew Bogott)
[05:47:42] <icinga-wm>	 PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 10.21 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[06:06:54] <icinga-wm>	 RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 3.023 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[07:03:48] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 98 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[07:19:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:21:40] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:06:20] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[11:45:00] <wikibugs>	 10SRE, 10LDAP-Access-Requests: CAS SSO for reedy - https://phabricator.wikimedia.org/T279244 (10Urbanecm) >  was trying to access racktables [...] I need U2F enabling?  According to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/3ca366ff6ba1ed57a19de43ec8001193eeeb4ce6/hieradata/role/common...
[11:49:15] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-2] "blocked on community consensus, placing a blocking -2 for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676716 (https://phabricator.wikimedia.org/T279226) (owner: 10Zabe)
[14:45:31] <logmsgbot>	 !log andrew@deploy1002 Started deploy [horizon/deploy@df2b0b4]: upgrade labtesthorizon to the Wallaby branch
[14:45:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:07] <logmsgbot>	 !log andrew@deploy1002 Finished deploy [horizon/deploy@df2b0b4]: upgrade labtesthorizon to the Wallaby branch (duration: 01m 36s)
[14:47:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "I've confirmed that 'which python3' succeeds on all cloud-vps hosts (other than the ones that are so broken that Cumin can't reach them)." [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[17:32:30] <wikibugs>	 (03PS1) 10Andrew Bogott: eqiad1 designate -> Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676815 (https://phabricator.wikimedia.org/T261136)
[17:34:04] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:36:26] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:36:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] eqiad1 designate -> Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676815 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott)
[17:53:12] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=cloud_dev_pdns site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:57:56] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets