[01:13:13] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:19] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:10] 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (101233thehongkonger) Quick fly-by, but this image corresponds to an internet meme in Hong Kong, though seems unlikely it aff... [03:54:27] 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Krinkle) The UA in question used an empty user agent string. See also [User-Agent policy](https://meta.wikimedia.org/wiki/... [04:07:39] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:12:25] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:47] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:42:31] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:55] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:05] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:42:36] 10SRE, 10Traffic, 10Patch-For-Review, 10Services (watching), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10DannyS712) [07:43:23] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:48:29] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210214T0800) [09:14:10] !log joal@deploy1001 Started deploy [analytics/refinery@dd5f947]: Hotfix analytics deployment [analytics/refinery@dd5f947] [09:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:00] !log joal@deploy1001 Finished deploy [analytics/refinery@dd5f947]: Hotfix analytics deployment [analytics/refinery@dd5f947] (duration: 12m 52s) [09:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:22] !log joal@deploy1001 Started deploy [analytics/refinery@dd5f947] (thin): Hotfix analytics deployment - THIN [analytics/refinery@dd5f947] [09:27:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:28] !log joal@deploy1001 Finished deploy [analytics/refinery@dd5f947] (thin): Hotfix analytics deployment - THIN [analytics/refinery@dd5f947] (duration: 00m 06s) [09:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:47] PROBLEM - k8s API server requests latencies on neon is CRITICAL: instance=10.64.0.40 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:29:29] RECOVERY - k8s API server requests latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [10:12:49] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:17:59] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:07:39] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:47] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:37] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:43] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:44] (03PS1) 10Urbanecm: ukwikisource: Finish removal of NS Translations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664053 (https://phabricator.wikimedia.org/T270628) [12:22:04] (03Abandoned) 10MarcoAurelio: admin: Update `GenSysadminTable.py` [puppet] - 10https://gerrit.wikimedia.org/r/663074 (owner: 10MarcoAurelio) [13:00:51] PROBLEM - Number of messages locally queued by purged for processing on cp5002 is CRITICAL: cluster=cache_upload instance=cp5002 job=purged layer=frontend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002 [13:01:23] PROBLEM - Number of messages locally queued by purged for processing on cp5006 is CRITICAL: cluster=cache_upload instance=cp5006 job=purged layer=frontend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [13:01:35] PROBLEM - Number of messages locally queued by purged for processing on cp5005 is CRITICAL: cluster=cache_upload instance=cp5005 job=purged layer=frontend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005 [13:02:33] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:03:05] RECOVERY - Number of messages locally queued by purged for processing on cp5006 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006 [13:03:58] PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_upload layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [13:04:44] PROBLEM - LVS upload-https eqsin port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 connect failed - 1961 bytes in 3.919 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:05:43] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:06:19] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:07:03] PROBLEM - PyBal backends health check on lvs5002 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5004.eqsin.wmnet are marked down but pooled: uploadlb_443: Servers cp5001.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:07:27] * akosiaris around [13:07:29] <_joe_> o/ [13:07:37] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:07:43] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb_443: Servers cp5005.eqsin.wmnet, cp5006.eqsin.wmnet, cp5002.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:07:45] <_joe_> akosiaris: I bet we need to rolling restart eqsin upload varnishes [13:07:51] I can ack the alams [13:07:55] <_joe_> I'll start with 5001 [13:07:58] samething as last weekend? [13:08:08] <_joe_> yes [13:08:10] RECOVERY - LVS upload-https eqsin port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 1038 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:08:18] PROBLEM - LVS upload-https eqsin port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv4 #page on upload-lb.eqsin.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 connect failed - 1961 bytes in 4.226 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:08:23] <_joe_> I can't ssh to cp5001 [13:08:24] Acked [13:08:36] I did, I 'll restart varnish [13:09:03] should I iump in? I am near a laptop but not on one [13:09:05] <_joe_> yeah I can't ssh into any eqsin server, moving to my desktop in a minute [13:09:07] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:09:09] PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) timed out before a response was received: /private-info/info.json (private tile service info for osm-intl) is CRITICAL: Test private tile service info for osm-intl returned the unexpected status 502 (expecting: 400): /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) is CRITICAL: Test [13:09:09] rker with an icon returned the unexpected status 502 (expecting: 200): /v4/marker/pin-m+ffffff.png (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 502 (expecting: 200): /_info (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 502 (expecting: 200) https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:09:15] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:09:22] !log restart varnish-fe on cp5001 [13:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:31] RECOVERY - Number of messages locally queued by purged for processing on cp5002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002 [13:10:00] RECOVERY - LVS upload-https eqsin port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv4 #page on upload-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 1025 bytes in 3.889 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:10:01] I'm on mobile, I can be at my laptop in ~20m if needed [13:10:13] <_joe_> akosiaris: I'll do 5004 [13:10:21] do you need me? I'm very groggy but can jump in if necessary [13:10:28] effie: rzl: volans: No need for now, I am restarting varnishe-fes already [13:10:38] okay good, thanks Alex [13:10:42] thank you <3 [13:10:44] <_joe_> !log restarted varnish-fe on cp5004 [13:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:50] going back to sleep but call if you need more hands [13:10:52] akosiaris: ack, feel free to ping for help anytime [13:10:52] is there a secondary action? [13:11:07] <_joe_> effie: can you look at traffic in eqsin? [13:11:09] something we need to do after the restarts? [13:11:12] <_joe_> did we have a traffic surge [13:11:20] we need to know why this keeps happening [13:11:29] <_joe_> tes [13:11:32] ok, I am bring my laptop and have a look at turnilo [13:11:33] <_joe_> *yes [13:11:48] Do we need an incident doc for this? [13:12:13] <_joe_> akosiaris: are you restarting other servers? [13:12:15] RECOVERY - PyBal backends health check on lvs5002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:12:20] _joe_: yes [13:12:31] I got 2 3 5 6 already [13:12:31] <_joe_> please log it :) [13:12:39] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5004 is OK: HTTP OK: HTTP/1.1 200 OK - 411 bytes in 0.439 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:12:42] <_joe_> that's all of the upload ones [13:12:47] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5001 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.439 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:12:53] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:13:10] <_joe_> we usually wait a few minutes between restarts [13:13:22] sobanski: there should be one from the last few times [13:13:23] yup, -b 1 -s 120 [13:13:33] <_joe_> oh ok it's still running [13:13:52] !log sudo cumin -b 1 -s 120 'cp500[2,3,5,6].eqsin.wmnet' 'systemctl restart varnish-frontend.service' [13:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:18] ok, let look at traffic now [13:14:22] https://docs.google.com/document/d/1n4-a9RWohbfT7yu0Ti5-ObvsC4IlCKrcYZr2qw-Jrkw/edit [13:14:29] <_joe_> the usual provfile https://grafana.wikimedia.org/d/000000304/prometheus-varnish-dc-stats?viewPanel=18&orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-cluster=cache_upload&var-layer=frontend [13:14:36] <_joe_> traffic is recovering [13:15:27] * jbond42 seems things are comming back but around if needed [13:16:21] <_joe_> I still see 5003, 5005 and 5006 not healty [13:16:41] yeah it's at 25% [13:17:07] <_joe_> 5003 restarting just now [13:17:07] RECOVERY - Number of messages locally queued by purged for processing on cp5005 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005 [13:17:17] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5002 is OK: HTTP OK: HTTP/1.1 200 OK - 412 bytes in 0.439 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:17:35] RECOVERY - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:17:41] cdanis: I'm not sure klaxon does work at showing actual pages. I get a big red Internal Server Error on https://klaxon.wikimedia.org/ under Recent Alerts. (I was being nosey to see how it looked in a real alert). [13:17:49] <_joe_> traffic is back to normal [13:18:49] <_joe_> conjecturing a bit (sorry): I think the traffic surge hits one server, sends it belly up, then moves to the other ones progressively [13:19:01] <_joe_> and in the span of 2-3 minutes, they're all in that state [13:19:27] I will take a look after some coffee Rhinos [13:19:56] it has worked in the past; not sure what’s up [13:20:00] Cool cdanis [13:20:23] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5003 is OK: HTTP OK: HTTP/1.1 200 OK - 411 bytes in 0.576 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:22:17] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5005 is OK: HTTP OK: HTTP/1.1 200 OK - 411 bytes in 0.595 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:22:33] 100% done [13:23:44] <_joe_> yup, all traffic recovered, no more errors in lvs [13:23:49] <_joe_> we can go back to our sunday [13:24:41] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 412 bytes in 0.439 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:24:59] +1 [13:25:01] Whose responsible for grafana? [13:26:03] observability team [13:26:06] o11y [13:26:20] RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 [13:26:23] RhinosF1|NotHere: tag the task with "observability" in phab and they 'll pick it up. Unless it's urgent, in which case, how can we help? [13:26:39] akosiaris: can I drop you a quick PM? [13:26:45] sure [13:42:55] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:05] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:47] (03PS1) 10CDanis: Fix cases of incidents missing the 'service' field [software/klaxon] - 10https://gerrit.wikimedia.org/r/664056 (https://phabricator.wikimedia.org/T274734) [13:55:31] (03CR) 10CDanis: [C: 03+2] Fix cases of incidents missing the 'service' field [software/klaxon] - 10https://gerrit.wikimedia.org/r/664056 (https://phabricator.wikimedia.org/T274734) (owner: 10CDanis) [13:56:29] (03Merged) 10jenkins-bot: Fix cases of incidents missing the 'service' field [software/klaxon] - 10https://gerrit.wikimedia.org/r/664056 (https://phabricator.wikimedia.org/T274734) (owner: 10CDanis) [13:58:30] RhinosF1|NotHere: klaxon fixed [14:01:56] cdanis: fantastic. [14:02:10] Have a good whatever is left of Sunday now [15:33:59] akosiaris: I sent you an update [15:44:44] Can someone please look into T274736 [15:44:55] cdanis: ^ [15:55:56] noted [15:57:52] apergos: thanks. Please pm if you have questions or ask on task. [15:58:06] 👍 [16:13:47] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:57] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:31] 10SRE, 10Graphoid, 10Projects-Cleanup, 10serviceops, and 3 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jdforrester-WMF) >>! In T242855#6826152, @hashar wrote: > Tagging #cleanup for the repositories archival. > > I guess we can empty up `mediawiki/service/graphoid.git` with a... [19:23:46] (03PS21) 10Kosta Harlan: linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) [19:23:56] (03CR) 10Kosta Harlan: linkrecommendation: Cron job to load datasets (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [19:23:58] 10SRE, 10Graphoid, 10Projects-Cleanup, 10serviceops, and 3 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jdforrester-WMF) [19:24:14] 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jdforrester-WMF) [19:38:41] 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jdforrester-WMF) [19:44:22] 10SRE, 10Graphoid, 10Services (watching): Graphoid returns a 400 on MW API time-out - https://phabricator.wikimedia.org/T134237 (10Jdforrester-WMF) 05Open→03Declined The Graphoid service has been undeployed and the repo is being archived, per {T274738}. [20:13:39] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:49] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:37:53] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:43:01] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:23:43] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 263 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:25:27] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 19 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:43:17] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:48:27] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:28:11] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [23:33:23] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET