[00:05:02] <icinga-wm>	 PROBLEM - Check systemd state on clouddb1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:18] <icinga-wm>	 PROBLEM - Check systemd state on clouddb1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:36] <icinga-wm>	 PROBLEM - Check systemd state on clouddb1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:48] <icinga-wm>	 PROBLEM - Check systemd state on clouddb1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:06:02] <icinga-wm>	 PROBLEM - Check systemd state on clouddb1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:06:20] <icinga-wm>	 PROBLEM - Check systemd state on clouddb1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:06:54] <icinga-wm>	 PROBLEM - Check systemd state on clouddb1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:02] <icinga-wm>	 PROBLEM - Check systemd state on clouddb1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:16] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on clouddb1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:16] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on clouddb1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:16] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on clouddb1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:17] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on clouddb1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:18] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on clouddb1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:19] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on clouddb1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:20] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on clouddb1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:21:21] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on clouddb1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Bstorm logrotate has a flutter and needs puppet adjustments T274044 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:13:10] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp1087 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp1087&var-datasource=eqiad+prometheus/ops
[02:20:22] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[02:29:54] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1081 is OK: HTTP OK: HTTP/1.0 200 OK - 23620 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[03:20:20] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:33:12] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 8.174 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[05:57:50] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 101 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210207T0800)
[08:42:50] <icinga-wm>	 PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[08:52:48] <wikibugs>	 10SRE, 10Discovery-Search, 10CAS-SSO, 10Patch-For-Review: Investigate CAS Session logout - https://phabricator.wikimedia.org/T273867 (10Gehel)
[10:03:10] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 102 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[11:18:03] <wikibugs>	 10SRE, 10MediaWiki-Site-system, 10Patch-For-Review, 10SEO: URLs for the same title without extra query parameters should have the same canonical link - https://phabricator.wikimedia.org/T67402 (10Techspit) Canonical url has to provide the real and original post or page. For example, https://techspit.com/bl...
[12:23:02] <icinga-wm>	 RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[14:36:03] <wikibugs>	 (03PS1) 10Elukey: cumin: make the hadoop-ui alias more granular [puppet] - 10https://gerrit.wikimedia.org/r/662117
[14:54:50] <wikibugs>	 (03PS2) 10Elukey: cumin: make some hadoop aliases more granular [puppet] - 10https://gerrit.wikimedia.org/r/662117
[14:55:40] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.init-hadoop-worker: fix wipe argument [cookbooks] - 10https://gerrit.wikimedia.org/r/662118
[14:55:42] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop: add more hadoop cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/662119
[15:05:57] <wikibugs>	 (03PS3) 10Elukey: cumin: make some hadoop aliases more granular [puppet] - 10https://gerrit.wikimedia.org/r/662117
[15:07:11] <wikibugs>	 (03PS4) 10Elukey: cumin: make some hadoop aliases more granular [puppet] - 10https://gerrit.wikimedia.org/r/662117
[15:08:14] <wikibugs>	 (03PS2) 10Elukey: sre.hadoop: add more hadoop cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/662119
[16:13:45] <icinga-wm>	 PROBLEM - Host ncredir-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100%
[16:15:23] <icinga-wm>	 RECOVERY - Host ncredir-lb.esams.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 83.62 ms
[16:16:56] <AntiComposite>	 Got a report in Discord from a user in Europe that Wikimedia sites were not loading two minutes ago, and are now very slow
[16:17:36] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 50.1 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:19:19] <RhinosF1|NotHere>	 AntiComposite: can conrifm
[16:19:21] <marostegui>	 AntiComposite: we are on it
[16:19:49] <RhinosF1|NotHere>	 Can even get static to load marostegui to do trace routes etc
[16:20:00] * RhinosF1|NotHere can never remember the commands
[16:20:28] <marostegui>	 RhinosF1|NotHere: No worries, we believe to know the nature of the incident. Thanks though, much appreciate
[16:20:30] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 76.54 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:21:02] <RhinosF1|NotHere>	 If you need anything do shout marostegui
[16:21:11] <marostegui>	 RhinosF1|NotHere: thanks
[16:28:06] <wikibugs>	 (03PS1) 10Ayounsi: Add prepending to esams/knams transits [homer/public] - 10https://gerrit.wikimedia.org/r/662123
[16:32:34] <jynus>	 hey RhinosF1|NotHere we belive things should be working now, could you check/ask?
[16:32:54] <RhinosF1|NotHere>	 jynus: it looks fine to me
[16:33:10] <jynus>	 great :-D
[16:34:01] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add prepending to esams/knams transits [homer/public] - 10https://gerrit.wikimedia.org/r/662123 (owner: 10Ayounsi)
[16:34:06] * RhinosF1|NotHere had to try and guess the name of a file to test InstantCommons when it was down and it seems to actually behave now rather than cause crashes
[16:34:11] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "Pushed from laptop." [homer/public] - 10https://gerrit.wikimedia.org/r/662123 (owner: 10Ayounsi)
[16:34:32] <wikibugs>	 (03Merged) 10jenkins-bot: Add prepending to esams/knams transits [homer/public] - 10https://gerrit.wikimedia.org/r/662123 (owner: 10Ayounsi)
[16:34:46] <AntiComposite>	 users in Discord also report that everything appears fine now
[16:35:02] <jynus>	 great!
[16:35:29] <jynus>	 thanks, it helps to confirm what we see on our metrics to be 100% sure
[16:35:31] <RhinosF1|NotHere>	 Thanks for the quick response
[16:36:08] <RhinosF1|NotHere>	 Also thanks to whoever whenever made InstantCommons not crash your whole wiki during a wikimedia outage.
[16:37:34] <RhinosF1|NotHere>	 Please enjoy the rest of the weekend if everything is calm. You deserve it.
[16:43:09] <Majavah>	 can someone update the topic back?
[16:57:13] <wikibugs>	 (03PS1) 10Cwhite: mw_rc_irc: add metrics endpoint to udpmxircecho [puppet] - 10https://gerrit.wikimedia.org/r/662124 (https://phabricator.wikimedia.org/T216611)
[16:57:15] <wikibugs>	 (03PS1) 10Cwhite: profile: add prometheus job for udpmxircecho [puppet] - 10https://gerrit.wikimedia.org/r/662125 (https://phabricator.wikimedia.org/T216611)
[16:58:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile: add prometheus job for udpmxircecho [puppet] - 10https://gerrit.wikimedia.org/r/662125 (https://phabricator.wikimedia.org/T216611) (owner: 10Cwhite)
[17:00:57] <wikibugs>	 (03PS2) 10Cwhite: profile: add prometheus job for udpmxircecho [puppet] - 10https://gerrit.wikimedia.org/r/662125 (https://phabricator.wikimedia.org/T216611)
[17:22:14] <wikibugs>	 (03CR) 10Cwhite: logstash: add ulogd ecs filter + tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647265 (https://phabricator.wikimedia.org/T234565) (owner: 10Filippo Giunchedi)
[18:49:04] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[18:51:40] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[22:12:40] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 272199024 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[22:15:18] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 953032 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[22:44:32] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[22:47:04] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[22:58:52] <Urbanecm>	 !log Reset password for TheresNoTime (T274087)
[22:58:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:32:50] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[23:38:08] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1