[00:00:38] <icinga-wm>	 RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:32] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:04:40] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:08:10] <icinga-wm>	 PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:11:54] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 145, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:12:00] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:16:16] <icinga-wm>	 RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[02:42:38] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 166192216 and 14 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:47:42] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 19520 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:17:56] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[05:38:28] <wikibugs>	 10SRE, 10serviceops, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) 05Stalled→03Resolved The reasons have been identified in https://phabricator.wikimedia.org/T253673#6569013
[05:45:40] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:45:50] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:09:46] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:09:50] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:38:22] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:38:24] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:50:38] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:50:40] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210424T0700)
[08:37:59] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] lists: Fix mailman3 apache config [puppet] - 10https://gerrit.wikimedia.org/r/681785 (https://phabricator.wikimedia.org/T278612) (owner: 10Legoktm)
[09:57:40] <icinga-wm>	 PROBLEM - Check for expired certificates debmonitor_discovery_wmnet on pki2001 is CRITICAL: CRITICAL - 1593 certs expiry in 2 days, 104 certs expiry in 1 days https://wikitech.wikimedia.org/wiki/PKI/Debugging
[09:58:16] <icinga-wm>	 PROBLEM - Check for expired certificates debmonitor_discovery_wmnet on pki1001 is CRITICAL: CRITICAL - 1540 certs expiry in 2 days, 149 certs expiry in 1 days https://wikitech.wikimedia.org/wiki/PKI/Debugging
[10:01:30] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.065 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[11:15:25] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/682235
[13:49:22] <icinga-wm>	 RECOVERY - Disk space on mwlog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops
[13:53:08] <icinga-wm>	 PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:57:10] <wikibugs>	 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10JoKalliauer)  >>! In T280718#7023966, @akosiaris wrote: > [...] I propose we remove that file from `mediawik-config...
[14:54:26] <icinga-wm>	 RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:57:31] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[16:04:58] <icinga-wm>	 PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[16:05:54] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs1004 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.158e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:07:28] <icinga-wm>	 RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 3 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test
[17:09:13] <wikibugs>	 (03PS1) 10Southparkfan: Add WMCS specific cloud role for syslog server [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717)
[17:12:08] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Mailman mailing list messages should link to the web archive version in the footer - https://phabricator.wikimedia.org/T64949 (10Ladsgroup) Agreed. Let's close it once migrated to mailman3 and people can reopen if they feel strongly otherwise.
[17:14:13] <wikibugs>	 (03CR) 10Southparkfan: Add WMCS specific cloud role for syslog server (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan)
[17:34:01] <wikibugs>	 (03PS1) 10Ladsgroup: snapshot: Migrate cronjobs in commonsdumps to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682260 (https://phabricator.wikimedia.org/T273673)
[17:34:02] <wikibugs>	 (03PS1) 10Ladsgroup: snapshot: Migrate cronjobs in wikidatadumps to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682261 (https://phabricator.wikimedia.org/T273673)
[17:40:50] <icinga-wm>	 PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:31:31] <wikibugs>	 10SRE, 10Thumbor, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10AntiCompositeNumber) The current version of Thumbor in https://apt-browser.toolforge.org/stretch-wiki...
[18:38:46] <wikibugs>	 10SRE, 10Wikimedia-SVG-rendering, 10serviceops-radar: Re-evaluate whether keeping around https://noc.wikimedia.org/conf/fc-list is a good practive - https://phabricator.wikimedia.org/T280718 (10AntiCompositeNumber) >>! In T280718#7031176, @JoKalliauer wrote: >  >>>! In T280718#7023966, @akosiaris wrote: >> [...
[18:42:10] <icinga-wm>	 RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:01:44] <icinga-wm>	 PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:19:01] <wikibugs>	 (03PS1) 10Andrew Bogott: Turn on debug logging for nova-fullstack [puppet] - 10https://gerrit.wikimedia.org/r/682265 (https://phabricator.wikimedia.org/T280514)
[20:20:28] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Turn on debug logging for nova-fullstack [puppet] - 10https://gerrit.wikimedia.org/r/682265 (https://phabricator.wikimedia.org/T280514) (owner: 10Andrew Bogott)
[20:49:26] <icinga-wm>	 PROBLEM - snapshot of s1 in codfw on alert1001 is CRITICAL: snapshot for s1 at codfw taken more than 3 days ago: Most recent backup 2021-04-21 20:39:26 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[21:03:10] <icinga-wm>	 RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:20:50] <icinga-wm>	 PROBLEM - snapshot of s8 in codfw on alert1001 is CRITICAL: snapshot for s8 at codfw taken more than 3 days ago: Most recent backup 2021-04-21 20:58:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting
[22:12:06] <icinga-wm>	 PROBLEM - Host labstore1007 is DOWN: PING CRITICAL - Packet loss = 100%
[22:16:16] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Bstorm investigating error https://wikitech.wikimedia.org/wiki/SSH/monitoring
[22:16:16] <icinga-wm>	 ACKNOWLEDGEMENT - NFS on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Bstorm investigating error https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore
[22:16:17] <icinga-wm>	 ACKNOWLEDGEMENT - HTTP on labstore1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Bstorm investigating error https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29
[22:16:18] <icinga-wm>	 ACKNOWLEDGEMENT - Host labstore1007 is DOWN: PING CRITICAL - Packet loss = 100% Bstorm investigating error
[22:24:18] <bstorm>	 !log Rebooting labstore1007 from ilo after crash
[22:24:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:26:06] <icinga-wm>	 PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:27:40] <icinga-wm>	 RECOVERY - Host labstore1007 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[22:29:46] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:31:04] <icinga-wm>	 RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:44:40] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:14:22] <wikibugs>	 (03PS1) 10Bstorm: dumps-distribution: fail over cloud NFS primary to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/682273 (https://phabricator.wikimedia.org/T281045)
[23:16:55] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] dumps-distribution: fail over cloud NFS primary to labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/682273 (https://phabricator.wikimedia.org/T281045) (owner: 10Bstorm)
[23:19:50] <icinga-wm>	 PROBLEM - Disk space on mwlog1001 is CRITICAL: DISK CRITICAL - free space: /srv 274062 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops
[23:29:16] <icinga-wm>	 PROBLEM - snapshot of s2 in codfw on alert1001 is CRITICAL: snapshot for s2 at codfw taken more than 3 days ago: Most recent backup 2021-04-21 23:09:05 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting