[01:13:13] <icinga-wm>	 RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:18:19] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:17:10] <wikibugs>	 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (101233thehongkonger) Quick fly-by, but this image corresponds to an internet meme in Hong Kong, though seems unlikely it aff...
[03:54:27] <wikibugs>	 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Krinkle) The UA in question used an empty user agent string. See also [User-Agent policy](https://meta.wikimedia.org/wiki/...
[04:07:39] <icinga-wm>	 RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:12:25] <icinga-wm>	 PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:37:47] <icinga-wm>	 RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:42:31] <icinga-wm>	 PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:12:55] <icinga-wm>	 RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:18:05] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:42:36] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Services (watching), 10Sustainability (MediaWiki-MultiDC): Create HTTP verb and sticky cookie DC routing in VCL - https://phabricator.wikimedia.org/T91820 (10DannyS712)
[07:43:23] <icinga-wm>	 RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:48:29] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210214T0800)
[09:14:10] <logmsgbot>	 !log joal@deploy1001 Started deploy [analytics/refinery@dd5f947]: Hotfix analytics deployment [analytics/refinery@dd5f947]
[09:14:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:00] <logmsgbot>	 !log joal@deploy1001 Finished deploy [analytics/refinery@dd5f947]: Hotfix analytics deployment [analytics/refinery@dd5f947] (duration: 12m 52s)
[09:27:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:22] <logmsgbot>	 !log joal@deploy1001 Started deploy [analytics/refinery@dd5f947] (thin): Hotfix analytics deployment - THIN [analytics/refinery@dd5f947]
[09:27:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:28] <logmsgbot>	 !log joal@deploy1001 Finished deploy [analytics/refinery@dd5f947] (thin): Hotfix analytics deployment - THIN [analytics/refinery@dd5f947] (duration: 00m 06s)
[09:27:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:47] <icinga-wm>	 PROBLEM - k8s API server requests latencies on neon is CRITICAL: instance=10.64.0.40 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[09:29:29] <icinga-wm>	 RECOVERY - k8s API server requests latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[10:12:49] <icinga-wm>	 RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:17:59] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:07:39] <icinga-wm>	 RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:12:47] <icinga-wm>	 PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:37] <icinga-wm>	 RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:48:43] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:00:44] <wikibugs>	 (03PS1) 10Urbanecm: ukwikisource: Finish removal of NS Translations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664053 (https://phabricator.wikimedia.org/T270628)
[12:22:04] <wikibugs>	 (03Abandoned) 10MarcoAurelio: admin: Update `GenSysadminTable.py` [puppet] - 10https://gerrit.wikimedia.org/r/663074 (owner: 10MarcoAurelio)
[13:00:51] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp5002 is CRITICAL: cluster=cache_upload instance=cp5002 job=purged layer=frontend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002
[13:01:23] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp5006 is CRITICAL: cluster=cache_upload instance=cp5006 job=purged layer=frontend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006
[13:01:35] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp5005 is CRITICAL: cluster=cache_upload instance=cp5005 job=purged layer=frontend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005
[13:02:33] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:03:05] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp5006 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5006
[13:03:58] <icinga-wm>	 PROBLEM - ATS TLS has reduced HTTP availability #page on alert1001 is CRITICAL: cluster=cache_upload layer=tls https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1
[13:04:44] <icinga-wm>	 PROBLEM - LVS upload-https eqsin port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 connect failed - 1961 bytes in 3.919 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:05:43] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:06:19] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:07:03] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5002 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb6_443: Servers cp5004.eqsin.wmnet are marked down but pooled: uploadlb_443: Servers cp5001.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:07:27] * akosiaris around
[13:07:29] <_joe_>	 o/
[13:07:37] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:07:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - uploadlb_443: Servers cp5005.eqsin.wmnet, cp5006.eqsin.wmnet, cp5002.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:07:45] <_joe_>	 akosiaris: I bet we need to rolling restart eqsin upload varnishes
[13:07:51] <sobanski>	 I can ack the alams
[13:07:55] <_joe_>	 I'll start with 5001
[13:07:58] <akosiaris>	 samething as last weekend?
[13:08:08] <_joe_>	 yes
[13:08:10] <icinga-wm>	 RECOVERY - LVS upload-https eqsin port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv6 #page on upload-lb.eqsin.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 1038 bytes in 1.177 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:08:18] <icinga-wm>	 PROBLEM - LVS upload-https eqsin port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv4 #page on upload-lb.eqsin.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 connect failed - 1961 bytes in 4.226 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:08:23] <_joe_>	 I can't ssh to cp5001
[13:08:24] <sobanski>	 Acked
[13:08:36] <akosiaris>	 I did, I 'll restart varnish
[13:09:03] <effie>	 should I iump in? I am near a laptop but not on one
[13:09:05] <_joe_>	 yeah I can't ssh into any eqsin server, moving to my desktop in a minute
[13:09:07] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:09:09] <icinga-wm>	 PROBLEM - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) timed out before a response was received: /private-info/info.json (private tile service info for osm-intl) is CRITICAL: Test private tile service info for osm-intl returned the unexpected status 502 (expecting: 400): /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) is CRITICAL: Test 
[13:09:09] <icinga-wm>	 rker with an icon returned the unexpected status 502 (expecting: 200): /v4/marker/pin-m+ffffff.png (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 502 (expecting: 200): /_info (Untitled test) is CRITICAL: Test Untitled test returned the unexpected status 502 (expecting: 200) https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:09:15] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp5001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:09:22] <akosiaris>	 !log restart varnish-fe on cp5001
[13:09:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:31] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp5002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5002
[13:10:00] <icinga-wm>	 RECOVERY - LVS upload-https eqsin port 443/tcp - Images and other media- upload.eqiad.wikimedia.org IPv4 #page on upload-lb.eqsin.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 1025 bytes in 3.889 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[13:10:01] <volans>	 I'm on mobile, I can be at my laptop in ~20m if needed
[13:10:13] <_joe_>	 akosiaris: I'll do 5004
[13:10:21] <rzl>	 do you need me? I'm very groggy but can jump in if necessary
[13:10:28] <akosiaris>	 effie: rzl: volans: No need for now, I am restarting varnishe-fes already
[13:10:38] <cdanis>	 okay good, thanks Alex 
[13:10:42] <rzl>	 thank you <3
[13:10:44] <_joe_>	 !log restarted varnish-fe on cp5004
[13:10:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:50] <rzl>	 going back to sleep but call if you need more hands
[13:10:52] <volans>	 akosiaris: ack, feel free to ping for help anytime
[13:10:52] <effie>	 is there a secondary action?
[13:11:07] <_joe_>	 effie: can you look at traffic in eqsin?
[13:11:09] <effie>	 something we need to do after the restarts?
[13:11:12] <_joe_>	 did we have a traffic surge
[13:11:20] <cdanis>	 we need to know why this keeps happening 
[13:11:29] <_joe_>	 tes
[13:11:32] <effie>	 ok, I am bring my laptop and have a look at turnilo
[13:11:33] <_joe_>	 *yes
[13:11:48] <sobanski>	 Do we need an incident doc for this?
[13:12:13] <_joe_>	 akosiaris: are you restarting other servers?
[13:12:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:12:20] <akosiaris>	 _joe_: yes
[13:12:31] <akosiaris>	 I got 2 3 5 6 already
[13:12:31] <_joe_>	 please log it :)
[13:12:39] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5004 is OK: HTTP OK: HTTP/1.1 200 OK - 411 bytes in 0.439 second response time https://wikitech.wikimedia.org/wiki/Varnish
[13:12:42] <_joe_>	 that's all of the upload ones
[13:12:47] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5001 is OK: HTTP OK: HTTP/1.1 200 OK - 410 bytes in 0.439 second response time https://wikitech.wikimedia.org/wiki/Varnish
[13:12:53] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:13:10] <_joe_>	 we usually wait a few minutes between restarts
[13:13:22] <cdanis>	 sobanski: there should be one from the last few times
[13:13:23] <akosiaris>	 yup, -b 1 -s 120
[13:13:33] <_joe_>	 oh ok it's still running
[13:13:52] <akosiaris>	 !log sudo cumin -b 1 -s 120 'cp500[2,3,5,6].eqsin.wmnet' 'systemctl restart varnish-frontend.service'
[13:13:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:18] <akosiaris>	 ok, let look at traffic now
[13:14:22] <cdanis>	 https://docs.google.com/document/d/1n4-a9RWohbfT7yu0Ti5-ObvsC4IlCKrcYZr2qw-Jrkw/edit
[13:14:29] <_joe_>	 the usual provfile https://grafana.wikimedia.org/d/000000304/prometheus-varnish-dc-stats?viewPanel=18&orgId=1&var-datasource=eqsin%20prometheus%2Fops&var-cluster=cache_upload&var-layer=frontend
[13:14:36] <_joe_>	 traffic is recovering
[13:15:27] * jbond42 seems things are comming back but around if needed
[13:16:21] <_joe_>	 I still see 5003, 5005 and 5006 not healty
[13:16:41] <akosiaris>	 yeah it's at 25% 
[13:17:07] <_joe_>	 5003 restarting just now
[13:17:07] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp5005 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5005
[13:17:17] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5002 is OK: HTTP OK: HTTP/1.1 200 OK - 412 bytes in 0.439 second response time https://wikitech.wikimedia.org/wiki/Varnish
[13:17:35] <icinga-wm>	 RECOVERY - Maps edge eqsin on upload-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook
[13:17:41] <RhinosF1|NotHere>	 cdanis: I'm not sure klaxon does work at showing actual pages. I get a big red Internal Server Error on https://klaxon.wikimedia.org/ under Recent Alerts. (I was being nosey to see how it looked in a real alert).
[13:17:49] <_joe_>	 traffic is back to normal
[13:18:49] <_joe_>	 conjecturing a bit (sorry): I think the traffic surge hits one server, sends it belly up, then moves to the other ones progressively
[13:19:01] <_joe_>	 and in the span of 2-3 minutes, they're all in that state
[13:19:27] <cdanis>	 I will take a look after some coffee Rhinos 
[13:19:56] <cdanis>	 it has worked in the past; not sure what’s up 
[13:20:00] <RhinosF1|NotHere>	 Cool cdanis
[13:20:23] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5003 is OK: HTTP OK: HTTP/1.1 200 OK - 411 bytes in 0.576 second response time https://wikitech.wikimedia.org/wiki/Varnish
[13:22:17] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5005 is OK: HTTP OK: HTTP/1.1 200 OK - 411 bytes in 0.595 second response time https://wikitech.wikimedia.org/wiki/Varnish
[13:22:33] <akosiaris>	 100% done 
[13:23:44] <_joe_>	 yup, all traffic recovered, no more errors in lvs
[13:23:49] <_joe_>	 we can go back to our sunday
[13:24:41] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp5006 is OK: HTTP OK: HTTP/1.1 200 OK - 412 bytes in 0.439 second response time https://wikitech.wikimedia.org/wiki/Varnish
[13:24:59] <akosiaris>	 +1
[13:25:01] <RhinosF1|NotHere>	 Whose responsible for grafana?
[13:26:03] <cdanis>	 observability team
[13:26:06] <Majavah>	 o11y
[13:26:20] <icinga-wm>	 RECOVERY - ATS TLS has reduced HTTP availability #page on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1
[13:26:23] <akosiaris>	 RhinosF1|NotHere: tag the task with "observability" in phab and they 'll pick it up. Unless it's urgent, in which case, how can we help?
[13:26:39] <RhinosF1|NotHere>	 akosiaris: can I drop you a quick PM?
[13:26:45] <akosiaris>	 sure
[13:42:55] <icinga-wm>	 RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:48:05] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:53:47] <wikibugs>	 (03PS1) 10CDanis: Fix cases of incidents missing the 'service' field [software/klaxon] - 10https://gerrit.wikimedia.org/r/664056 (https://phabricator.wikimedia.org/T274734)
[13:55:31] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] Fix cases of incidents missing the 'service' field [software/klaxon] - 10https://gerrit.wikimedia.org/r/664056 (https://phabricator.wikimedia.org/T274734) (owner: 10CDanis)
[13:56:29] <wikibugs>	 (03Merged) 10jenkins-bot: Fix cases of incidents missing the 'service' field [software/klaxon] - 10https://gerrit.wikimedia.org/r/664056 (https://phabricator.wikimedia.org/T274734) (owner: 10CDanis)
[13:58:30] <cdanis>	 RhinosF1|NotHere: klaxon fixed
[14:01:56] <RhinosF1|NotHere>	 cdanis: fantastic.
[14:02:10] <RhinosF1|NotHere>	 Have a good whatever is left of Sunday now
[15:33:59] <RhinosF1|NotHere>	 akosiaris: I sent you an update
[15:44:44] <RhinosF1|NotHere>	 Can someone please look into T274736
[15:44:55] <RhinosF1|NotHere>	 cdanis: ^
[15:55:56] <apergos>	 noted
[15:57:52] <RhinosF1|NotHere>	 apergos: thanks. Please pm if you have questions or ask on task.
[15:58:06] <apergos>	 👍
[16:13:47] <icinga-wm>	 RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:57] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:23:31] <wikibugs>	 10SRE, 10Graphoid, 10Projects-Cleanup, 10serviceops, and 3 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jdforrester-WMF) >>! In T242855#6826152, @hashar wrote: > Tagging #cleanup for the repositories archival. >  > I guess we can empty up `mediawiki/service/graphoid.git` with a...
[19:23:46] <wikibugs>	 (03PS21) 10Kosta Harlan: linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893)
[19:23:56] <wikibugs>	 (03CR) 10Kosta Harlan: linkrecommendation: Cron job to load datasets (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan)
[19:23:58] <wikibugs>	 10SRE, 10Graphoid, 10Projects-Cleanup, 10serviceops, and 3 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jdforrester-WMF)
[19:24:14] <wikibugs>	 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jdforrester-WMF)
[19:38:41] <wikibugs>	 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jdforrester-WMF)
[19:44:22] <wikibugs>	 10SRE, 10Graphoid, 10Services (watching): Graphoid returns a 400 on MW API time-out - https://phabricator.wikimedia.org/T134237 (10Jdforrester-WMF) 05Open→03Declined The Graphoid service has been undeployed and the repo is being archived, per {T274738}.
[20:13:39] <icinga-wm>	 RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:18:49] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:37:53] <icinga-wm>	 RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:43:01] <icinga-wm>	 PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:23:43] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 263 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:25:27] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 19 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:43:17] <icinga-wm>	 RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:48:27] <icinga-wm>	 PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:28:11] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[23:33:23] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET