[00:05:02] PROBLEM - Check systemd state on clouddb1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:18] PROBLEM - Check systemd state on clouddb1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:36] PROBLEM - Check systemd state on clouddb1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:48] PROBLEM - Check systemd state on clouddb1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:02] PROBLEM - Check systemd state on clouddb1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:20] PROBLEM - Check systemd state on clouddb1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:54] PROBLEM - Check systemd state on clouddb1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:02] PROBLEM - Check systemd state on clouddb1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:16] ACKNOWLEDGEMENT - Check systemd state on clouddb1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:16] ACKNOWLEDGEMENT - Check systemd state on clouddb1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:16] ACKNOWLEDGEMENT - Check systemd state on clouddb1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:17] ACKNOWLEDGEMENT - Check systemd state on clouddb1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:18] ACKNOWLEDGEMENT - Check systemd state on clouddb1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:19] ACKNOWLEDGEMENT - Check systemd state on clouddb1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:20] ACKNOWLEDGEMENT - Check systemd state on clouddb1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:21] ACKNOWLEDGEMENT - Check systemd state on clouddb1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:13:10] PROBLEM - Varnish frontend child restarted on cp1087 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1087&var-datasource=eqiad+prometheus/ops [02:20:22] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [02:29:54] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1081 is OK: HTTP OK: HTTP/1.0 200 OK - 23620 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [03:20:20] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:33:12] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.174 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:57:50] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210207T0800) [08:42:50] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [08:52:48] 10SRE, 10Discovery-Search, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10Gehel) [10:03:10] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [11:18:03] 10SRE, 10MediaWiki-Site-system, 10Patch-For-Review, 10SEO: URLs for the same title without extra query parameters should have the same canonical link - https://phabricator.wikimedia.org/T67402 (10Techspit) Canonical url has to provide the real and original post or page. For example, https://techspit.com/bl... [12:23:02] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [14:36:03] (03PS1) 10Elukey: cumin: make the hadoop-ui alias more granular [puppet] - 10https://gerrit.wikimedia.org/r/662117 [14:54:50] (03PS2) 10Elukey: cumin: make some hadoop aliases more granular [puppet] - 10https://gerrit.wikimedia.org/r/662117 [14:55:40] (03PS1) 10Elukey: sre.hadoop.init-hadoop-worker: fix wipe argument [cookbooks] - 10https://gerrit.wikimedia.org/r/662118 [14:55:42] (03PS1) 10Elukey: sre.hadoop: add more hadoop cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/662119 [15:05:57] (03PS3) 10Elukey: cumin: make some hadoop aliases more granular [puppet] - 10https://gerrit.wikimedia.org/r/662117 [15:07:11] (03PS4) 10Elukey: cumin: make some hadoop aliases more granular [puppet] - 10https://gerrit.wikimedia.org/r/662117 [15:08:14] (03PS2) 10Elukey: sre.hadoop: add more hadoop cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/662119 [16:13:45] PROBLEM - Host ncredir-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [16:15:23] RECOVERY - Host ncredir-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.62 ms [16:16:56] Got a report in Discord from a user in Europe that Wikimedia sites were not loading two minutes ago, and are now very slow [16:17:36] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 50.1 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:19:19] AntiComposite: can conrifm [16:19:21] AntiComposite: we are on it [16:19:49] Can even get static to load marostegui to do trace routes etc [16:20:00] * RhinosF1|NotHere can never remember the commands [16:20:28] RhinosF1|NotHere: No worries, we believe to know the nature of the incident. Thanks though, much appreciate [16:20:30] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 76.54 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:21:02] If you need anything do shout marostegui [16:21:11] RhinosF1|NotHere: thanks [16:28:06] (03PS1) 10Ayounsi: Add prepending to esams/knams transits [homer/public] - 10https://gerrit.wikimedia.org/r/662123 [16:32:34] hey RhinosF1|NotHere we belive things should be working now, could you check/ask? [16:32:54] jynus: it looks fine to me [16:33:10] great :-D [16:34:01] (03CR) 10Ayounsi: [C: 03+2] Add prepending to esams/knams transits [homer/public] - 10https://gerrit.wikimedia.org/r/662123 (owner: 10Ayounsi) [16:34:06] * RhinosF1|NotHere had to try and guess the name of a file to test InstantCommons when it was down and it seems to actually behave now rather than cause crashes [16:34:11] (03CR) 10Ayounsi: [C: 03+2] "Pushed from laptop." [homer/public] - 10https://gerrit.wikimedia.org/r/662123 (owner: 10Ayounsi) [16:34:32] (03Merged) 10jenkins-bot: Add prepending to esams/knams transits [homer/public] - 10https://gerrit.wikimedia.org/r/662123 (owner: 10Ayounsi) [16:34:46] users in Discord also report that everything appears fine now [16:35:02] great! [16:35:29] thanks, it helps to confirm what we see on our metrics to be 100% sure [16:35:31] Thanks for the quick response [16:36:08] Also thanks to whoever whenever made InstantCommons not crash your whole wiki during a wikimedia outage. [16:37:34] Please enjoy the rest of the weekend if everything is calm. You deserve it. [16:43:09] can someone update the topic back? [16:57:13] (03PS1) 10Cwhite: mw_rc_irc: add metrics endpoint to udpmxircecho [puppet] - 10https://gerrit.wikimedia.org/r/662124 (https://phabricator.wikimedia.org/T216611) [16:57:15] (03PS1) 10Cwhite: profile: add prometheus job for udpmxircecho [puppet] - 10https://gerrit.wikimedia.org/r/662125 (https://phabricator.wikimedia.org/T216611) [16:58:58] (03CR) 10jerkins-bot: [V: 04-1] profile: add prometheus job for udpmxircecho [puppet] - 10https://gerrit.wikimedia.org/r/662125 (https://phabricator.wikimedia.org/T216611) (owner: 10Cwhite) [17:00:57] (03PS2) 10Cwhite: profile: add prometheus job for udpmxircecho [puppet] - 10https://gerrit.wikimedia.org/r/662125 (https://phabricator.wikimedia.org/T216611) [17:22:14] (03CR) 10Cwhite: logstash: add ulogd ecs filter + tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) (owner: 10Filippo Giunchedi) [18:49:04] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [18:51:40] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [22:12:40] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 272199024 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:15:18] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 953032 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:44:32] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [22:47:04] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [22:58:52] !log Reset password for TheresNoTime (T274087) [22:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:50] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [23:38:08] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1