[00:00:38] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:32] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:04:40] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:08:10] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:11:54] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 145, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:12:00] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:16:16] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:42:38] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 166192216 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:47:42] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 19520 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:17:56] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:38:28] 10SRE, 10serviceops, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) 05Stalled→03Resolved The reasons have been identified in https://phabricator.wikimedia.org/T253673#6569013 [05:45:40] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:45:50] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:46] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:50] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:22] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:24] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:50:38] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:50:40] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210424T0700) [08:37:59] (03CR) 10Ladsgroup: [C: 03+1] lists: Fix mailman3 apache config [puppet] - 10https://gerrit.wikimedia.org/r/681785 (https://phabricator.wikimedia.org/T278612) (owner: 10Legoktm) [09:57:40] PROBLEM - Check for expired certificates debmonitor_discovery_wmnet on pki2001 is CRITICAL: CRITICAL - 1593 certs expiry in 2 days, 104 certs expiry in 1 days https://wikitech.wikimedia.org/wiki/PKI/Debugging [09:58:16] PROBLEM - Check for expired certificates debmonitor_discovery_wmnet on pki1001 is CRITICAL: CRITICAL - 1540 certs expiry in 2 days, 149 certs expiry in 1 days https://wikitech.wikimedia.org/wiki/PKI/Debugging [10:01:30] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.065 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:15:25] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/682235 [13:49:22] RECOVERY - Disk space on mwlog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops [13:53:08] PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:57:10] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10JoKalliauer) >>! In T280718#7023966, @akosiaris wrote: > [...] I propose we remove that file from `mediawik-config... [14:54:26] RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:57:31] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [16:04:58] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [16:05:54] RECOVERY - WDQS high update lag on wdqs1004 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.158e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:07:28] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 3 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [17:09:13] (03PS1) 10Southparkfan: Add WMCS specific cloud role for syslog server [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) [17:12:08] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Mailman mailing list messages should link to the web archive version in the footer - https://phabricator.wikimedia.org/T64949 (10Ladsgroup) Agreed. Let's close it once migrated to mailman3 and people can reopen if they feel strongly otherwise. [17:14:13] (03CR) 10Southparkfan: Add WMCS specific cloud role for syslog server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [17:34:01] (03PS1) 10Ladsgroup: snapshot: Migrate cronjobs in commonsdumps to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682260 (https://phabricator.wikimedia.org/T273673) [17:34:02] (03PS1) 10Ladsgroup: snapshot: Migrate cronjobs in wikidatadumps to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682261 (https://phabricator.wikimedia.org/T273673) [17:40:50] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:31:31] 10SRE, 10Thumbor, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10AntiCompositeNumber) The current version of Thumbor in https://apt-browser.toolforge.org/stretch-wiki... [18:38:46] 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10AntiCompositeNumber) >>! In T280718#7031176, @JoKalliauer wrote: > >>>! In T280718#7023966, @akosiaris wrote: >> [... [18:42:10] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:01:44] PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:19:01] (03PS1) 10Andrew Bogott: Turn on debug logging for nova-fullstack [puppet] - 10https://gerrit.wikimedia.org/r/682265 (https://phabricator.wikimedia.org/T280514) [20:20:28] (03CR) 10Andrew Bogott: [C: 03+2] Turn on debug logging for nova-fullstack [puppet] - 10https://gerrit.wikimedia.org/r/682265 (https://phabricator.wikimedia.org/T280514) (owner: 10Andrew Bogott) [20:49:26] PROBLEM - snapshot of s1 in codfw on alert1001 is CRITICAL: snapshot for s1 at codfw taken more than 3 days ago: Most recent backup 2021-04-21 20:39:26 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [21:03:10] RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:20:50] PROBLEM - snapshot of s8 in codfw on alert1001 is CRITICAL: snapshot for s8 at codfw taken more than 3 days ago: Most recent backup 2021-04-21 20:58:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [22:12:06] PROBLEM - Host labstore1007 is DOWN: PING CRITICAL - Packet loss = 100% [22:16:16] ACKNOWLEDGEMENT - SSH on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Bstorm investigating error https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:16:16] ACKNOWLEDGEMENT - NFS on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Bstorm investigating error https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [22:16:17] ACKNOWLEDGEMENT - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Bstorm investigating error https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29 [22:16:18] ACKNOWLEDGEMENT - Host labstore1007 is DOWN: PING CRITICAL - Packet loss = 100% Bstorm investigating error [22:24:18] !log Rebooting labstore1007 from ilo after crash [22:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:06] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:40] RECOVERY - Host labstore1007 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [22:29:46] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:31:04] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:44:40] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:14:22] (03PS1) 10Bstorm: dumps-distribution: fail over cloud NFS primary to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/682273 (https://phabricator.wikimedia.org/T281045) [23:16:55] (03CR) 10Bstorm: [C: 03+2] dumps-distribution: fail over cloud NFS primary to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/682273 (https://phabricator.wikimedia.org/T281045) (owner: 10Bstorm) [23:19:50] PROBLEM - Disk space on mwlog1001 is CRITICAL: DISK CRITICAL - free space: /srv 274062 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops [23:29:16] PROBLEM - snapshot of s2 in codfw on alert1001 is CRITICAL: snapshot for s2 at codfw taken more than 3 days ago: Most recent backup 2021-04-21 23:09:05 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting