[00:06:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:09:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:12:33] RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:19:43] PROBLEM - SSH on ms-be2028 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:22:49] PROBLEM - Host ms-be2028 is DOWN: PING CRITICAL - Packet loss = 100% [03:43:57] RECOVERY - Host ms-be2028 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [03:44:27] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:19] RECOVERY - SSH on ms-be2028 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:46:47] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:55:03] PROBLEM - HP RAID on ms-be2028 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [03:55:06] ACKNOWLEDGEMENT - HP RAID on ms-be2028 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:2 - OK: 1I:1:1, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Temporarily Disabled - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T279245 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardwa [03:55:06] on_Gathering [03:55:10] 10SRE, 10ops-codfw: Degraded RAID on ms-be2028 - https://phabricator.wikimedia.org/T279245 (10ops-monitoring-bot) [04:15:35] (03PS1) 10Andrew Bogott: Ussury policy.yaml files: remove redundant rules [puppet] - 10https://gerrit.wikimedia.org/r/676778 (https://phabricator.wikimedia.org/T261136) [04:19:53] (03PS1) 10Andrew Bogott: Horizon: define a limited set of OPENSTACK_IMAGE_BACKENDs [puppet] - 10https://gerrit.wikimedia.org/r/676779 (https://phabricator.wikimedia.org/T261138) [04:20:49] (03CR) 10Andrew Bogott: [C: 03+2] Ussury policy.yaml files: remove redundant rules [puppet] - 10https://gerrit.wikimedia.org/r/676778 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [04:20:59] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: define a limited set of OPENSTACK_IMAGE_BACKENDs [puppet] - 10https://gerrit.wikimedia.org/r/676779 (https://phabricator.wikimedia.org/T261138) (owner: 10Andrew Bogott) [05:47:42] PROBLEM - Persistent high iowait on labstore1006 is CRITICAL: 10.21 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [06:06:54] RECOVERY - Persistent high iowait on labstore1006 is OK: (C)10 ge (W)5 ge 3.023 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/dashboard/db/labs-monitoring [07:03:48] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 98 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [07:19:20] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:21:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:06:20] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:45:00] 10SRE, 10LDAP-Access-Requests: CAS SSO for reedy - https://phabricator.wikimedia.org/T279244 (10Urbanecm) > was trying to access racktables [...] I need U2F enabling? According to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/3ca366ff6ba1ed57a19de43ec8001193eeeb4ce6/hieradata/role/common... [11:49:15] (03CR) 10Urbanecm: [C: 04-2] "blocked on community consensus, placing a blocking -2 for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/676716 (https://phabricator.wikimedia.org/T279226) (owner: 10Zabe) [14:45:31] !log andrew@deploy1002 Started deploy [horizon/deploy@df2b0b4]: upgrade labtesthorizon to the Wallaby branch [14:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:07] !log andrew@deploy1002 Finished deploy [horizon/deploy@df2b0b4]: upgrade labtesthorizon to the Wallaby branch (duration: 01m 36s) [14:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:49] (03CR) 10Andrew Bogott: [C: 03+1] "I've confirmed that 'which python3' succeeds on all cloud-vps hosts (other than the ones that are so broken that Cumin can't reach them)." [puppet] - 10https://gerrit.wikimedia.org/r/670933 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [17:32:30] (03PS1) 10Andrew Bogott: eqiad1 designate -> Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676815 (https://phabricator.wikimedia.org/T261136) [17:34:04] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:36:26] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:36:47] (03CR) 10Andrew Bogott: [C: 03+2] eqiad1 designate -> Ussuri [puppet] - 10https://gerrit.wikimedia.org/r/676815 (https://phabricator.wikimedia.org/T261136) (owner: 10Andrew Bogott) [17:53:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=cloud_dev_pdns site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:57:56] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets