[00:03:14] (03PS5) 10Dzahn: mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 [00:03:46] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/628460 (owner: 10Dzahn) [00:03:59] (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 (owner: 10Dzahn) [00:04:31] (03CR) 10Dzahn: [C: 03+2] add testvm5001 to test install5001 [dns] - 10https://gerrit.wikimedia.org/r/630313 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [00:04:37] (03PS2) 10Dzahn: add testvm5001 to test install5001 [dns] - 10https://gerrit.wikimedia.org/r/630313 (https://phabricator.wikimedia.org/T252526) [00:07:02] (03Abandoned) 10Dzahn: mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 (owner: 10Dzahn) [00:09:02] (03PS2) 10Dzahn: mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317 [00:10:03] (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn) [00:18:19] (03PS2) 10Dzahn: cache::ssl::unified: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628460 [00:20:53] (03PS3) 10Dzahn: mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317 [00:21:52] (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn) [00:33:29] (03PS4) 10Dzahn: mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317 [00:40:40] (03PS2) 10Dzahn: start DHCP service on install5001, stop it on bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/629849 (https://phabricator.wikimedia.org/T252526) [00:43:22] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [00:43:22] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [00:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:41] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [00:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:44:53] (03PS1) 10Dzahn: Revert "add testvm5001 to test install5001" [dns] - 10https://gerrit.wikimedia.org/r/630232 [00:46:30] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [00:46:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:47:26] (03CR) 10Dzahn: [C: 03+2] Revert "add testvm5001 to test install5001" [dns] - 10https://gerrit.wikimedia.org/r/630232 (owner: 10Dzahn) [00:50:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:26] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [00:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:19] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [00:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:51] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [00:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:23] (03PS4) 10HitomiAkane: Creation of patroller group on arz.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) [01:06:16] So, ORES is throwing a bunch of errors. It isn't down but it isn't happy. I've reached out to Aaron about possible causes. Course, if anyone here has ideas I am all ears [01:06:40] https://logstash.wikimedia.org/app/kibana#/dashboard/ORES?_g=h@44136fa&_a=h@0002e3b [01:11:53] !log โœ”๏ธ cdanis@ores2001.codfw.wmnet ~ ๐Ÿ•˜๐Ÿบ sudo systemctl restart celery-ores-worker.service [01:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:12:24] chrisalbon: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=13&orgId=1&refresh=1m&from=now-12h&to=now-1m makes me suspect something getting deadlocked in celery [01:13:35] Yeah agreed. Based on my understanding of ORES if we restart celery that should resolve it [01:13:59] does ORES take a while to start up (to read models or something)? [01:14:11] (I'm worried about making things worse if I restart too many at once) [01:14:36] A few minutes, nothing crazy [01:15:24] do you know if it has a readiness probe defined, or a simple query close to such? [01:15:35] I don't [01:17:19] !log โŒcdanis@ores2001.codfw.wmnet ~ ๐Ÿ•ค๐Ÿบ sudo systemctl restart uwsgi-ores.service [01:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:19] rescheduled the icinga check for ores.wikimedia.org [01:19:28] turned green [01:20:37] sure, I think the underlying service is still unhealthy [01:21:16] I'm going to give it a few more minutes and then restart things across the whole cluster; the 'busy web workers' graph correlates quite well with the slowdown [01:22:33] Yeah, based upon what janruben said in #wikimedia-ai is has been getting worse over the last few hours which correlates well to the busy web worker surge [01:23:49] โœ”๏ธ [01:25:14] I'm not convinced that my restarts are helping. [01:25:18] can confirm. there are a couple "systemctl start celery-ores-worker " etc in SAL from the past [01:25:58] all I see from ores2001 is that it still seems slow to answer queries, and the web worker count is also growing [01:28:10] yeah I see that too https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=13&orgId=1&refresh=1m&from=now-12h&to=now-1m [01:28:12] glanced at two older incident reports about ORES and celery-workers but they don't seem to apply [01:29:48] ores is active/passive? [01:30:23] active/passive? [01:30:31] conftool says ores.discovery.wmnet is depooled in eqiad, but, I still see plenty of traffic going to eqiad in grafana [01:31:02] all the busy web workers are on codfw hosts though [01:31:04] chrisalbon: we say active/active when a service is serving from all DCs simultaneously [01:31:10] ah got it [01:31:24] and active/passive when there's only supposed to be one active DC at a time (e.g. mediawiki) [01:32:29] so, I think something deeper is wrong here. I don't think there were any deploys recently, so I'm wondering if something has changed about the traffic they're receiving [01:32:54] yeah there haven't been any deploys for a month [01:32:58] I don't even know what the ORES API looks like though; is it the kind of thing where it supports queries that vary in cost by orders of magnitude? [01:34:08] I don't think it should [01:37:58] You can check out the API endpoint here https://ores.wikimedia.org/v3/scores/enwiki/?models=damaging&revids=964118401&format=json [01:39:26] the monitoring for ORES Redis in codfw has a bunch of data points missing [01:39:51] found the "ORES advanced metrics" dashboard and "overload errors". but it is 0 for everything [01:40:23] also 5xx rate and "unconvential responses" [01:42:00] I actually suspect that some of the Redises crashed and didn't quite come back up right [01:42:49] but I'm a bit leery to percussively stop everything [01:43:31] oh. max number of clients reached... and I bet the prometheus exporter tries to send the INFO command, which doesn't work [01:43:32] this is for ORES redis: https://grafana.wikimedia.org/d/RLhtAw6mz/ores-redis?orgId=1&refresh=1m [01:43:42] right [01:44:01] chrisalbon: do you know if ORES can/should be serving from both codfw and eqiad at the same time? [01:45:21] https://phabricator.wikimedia.org/T159615 [01:45:25] I'm going to percussively restart ORES in codfw. [01:45:27] [spec] Active-active setup for ORES across datacenters (eqiad, codfw) [01:45:28] Yeah it should [01:46:15] "The prefacing rule is now updating ORES in both datacenters. " [01:48:20] !log cdanis@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=ores,name=eqiad [01:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:40] !log cdanis@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=ores,name=codfw [01:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:00] ok, first let's try moving the traffic [01:49:24] hopefully this doesn't add too much latency for MW, although I thought it was changeprop or the jobqueue that makes the calls to ORES [01:50:19] btw, looking at the past 7 days of latency, this *almost* happened yesterday, but didn't quite get as bad [01:50:33] yeah I just noticed that [01:50:50] do you know if it's easy to grab stack traces from running celery workers? [01:51:15] from a previous incident report: "disables changeprop so that it doesn't impact ores" [01:51:17] I don't sorry [01:51:40] what I'm suspecting is happening is that there's no timeouts for the redis requests, or a timeout so long that the query gets killed by the client or at the traffic layer before it has a chance to get through the queueing in redis [01:51:53] and so then once you get into this state you never have any 'good' throughput [01:54:00] okay, eqiad seems to be happily handling this traffic [01:54:31] it's been long enough for TTLs for records pointing to codfw to clear, going to percussively restart things there [01:55:01] sounds good, yeah looks good, API is up and serving fine [01:55:03] the redises there are *still* hung on too many clients, btw, and it shouldn't've been getting new queries for at least 3 minutes now [01:55:18] so once we get into this state, it seems we don't work our way out of it [01:55:26] Good catch [01:56:05] !log โŒcdanis@cumin2001.codfw.wmnet ~ ๐Ÿ•™๐Ÿบ sudo cumin 'A:ores and A:codfw' 'systemctl restart celery-ores-worker.service uwsgi-ores.service ' [01:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:29] (03PS1) 10Dzahn: add testvm5001.eqsin.wmnet [dns] - 10https://gerrit.wikimedia.org/r/630319 [01:59:51] (03CR) 10jerkins-bot: [V: 04-1] add testvm5001.eqsin.wmnet [dns] - 10https://gerrit.wikimedia.org/r/630319 (owner: 10Dzahn) [02:00:02] the number of steady-state connections from the ORES hosts in codfw (which were just restarted and haven't gotten traffic yet) is about 1/3rd to 1/2 of what seems to be the ceiling of allowed # of redis connections, btw [02:00:17] so that's a second thing that should probably be fixed, not a comfortable amount of headroom [02:00:54] https://phabricator.wikimedia.org/P12803 [02:01:36] okay, I think ores@codfw is healthier now [02:02:40] okay thanks, ill make tickets for those two fixes so we don't return to this situation on a future friday night [02:02:49] I'm actually going to restore the traffic flow as it was before, depooling eqiad and repooling codfw, because I'm not 100% sure it is meant to be active/active, and that's not the kind of change I want to make on a Friday night :) [02:03:15] cool [02:03:36] thanks! can you file a third to verify that ores dnsdisc should have both DCs pooled? [02:03:47] yep [02:03:55] (03PS2) 10Dzahn: add testvm5001.eqsin.wmnet [dns] - 10https://gerrit.wikimedia.org/r/630319 [02:04:03] !log cdanis@cumin2001 conftool action : set/pooled=true; selector: dnsdisc=ores,name=codfw [02:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:25] !log cdanis@cumin2001 conftool action : set/pooled=false; selector: dnsdisc=ores,name=eqiad [02:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:59] (03CR) 10Dzahn: [C: 03+2] add testvm5001.eqsin.wmnet [dns] - 10https://gerrit.wikimedia.org/r/630319 (owner: 10Dzahn) [02:13:09] things look pretty good now. I'm going back to my evening :) [02:13:53] yep they look good, thanks cdanis! I owe you a beer next time we are in person [02:14:07] ๐Ÿป to 2023! [02:17:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [02:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:24] (03PS1) 10Dzahn: DHCP: add testvm5001 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/630320 (https://phabricator.wikimedia.org/T252526) [05:06:05] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [05:06:05] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [05:13:11] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:14:47] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:17:33] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [05:17:33] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.57 ms [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200926T0700) [08:34:36] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10jijiki) [10:35:52] 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Aklapper) @Esanders: Is the #Editing-Team still going to investigate this problem, despite lack of expert... [12:51:56] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10CDanis) "busy web workers" graph, which correlates quite well with the slowdown: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=13&orgId=1&f... [16:01:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:03:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:39:25] (03PS3) 10ArielGlenn: new util to display info about revisions for one or more pages from XML input [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/630267 (https://phabricator.wikimedia.org/T263319) [16:39:37] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 128 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:42:49] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 12 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:43:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:45:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:51:14] cdanis looks like there is a build up again. still small now though [17:52:03] https://usercontent.irccloud-cdn.com/file/3Ll5oDrI/Screenshot%20from%202020-09-26%2010-51-29.png [17:54:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:58:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:06:38] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) every time it starts at 16:00 [18:10:34] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) {F32364643} [18:44:01] Any SRE around? [18:46:47] I'm not confident I want to do what I am thinking of doing without some SRE to back me up [18:54:16] When an SRE gets on: ORES is filling up with busy workings like it did yesterday. Here is cdanis 's solution that works (at least for 24 hours): https://phabricator.wikimedia.org/T263910 [18:54:16] I don't have permissions to move traffic to eqiad and I don't have cumin, but I _think_ I could stop ORES (sudo systemctl stop uwsgi-ores.service && sudo systemctl stop celery-ores-worker.service) on each ORES codw box, wait ~10 minutes for redis connections to time out, then restart everything (sudo systemctl start uwsgi-ores.service && sudo systemctl start celery-ores-worker.service) [18:54:17] But honestly that is a guess and I hesitate to do that [19:15:42] I think, given that I'm bouncing kids on my lap right now, I'll continue to monitor it and if it really gets back I'll consider doing my nuclear option (described above) [19:16:02] s/back/bad [19:18:09] chrisalbon: you don't have to wait 10 minutes -- stopping the services should close the sockets, which IIRC was sufficient last night [19:18:13] chrisalbon: hey Amir here [19:18:27] I have done this before, you just need to restart uwsgi [19:18:40] sudo service uwsgi-ores restart [19:19:18] even there is something that can do it rollingly (I just made up that word) on the nine nodes of codfw [19:19:33] okay cool let me try that now [19:20:03] I assume, we should check the config of ores, I have a feeling something is still trying to connect to eqiad instead of rolling over to codfw (like redis connections) [19:20:13] but that can wait until Monday [19:20:15] !log sudo service uwsgi-ores restart [19:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:23] did that work? I see cdanis do it yesterday [19:20:27] Amir1: given the timing of everything I think it's more about external traffic patterns [19:20:37] yeah I think someone is hitting it hard [19:20:46] yesterday I did the failover to eqiad yesterday so I could restart codfw all at once without worrying about nuking all in-flight queries [19:20:59] it is likely, they should be some limits on req/ip, we can reduce that limit [19:21:07] chrisalbon: also to set expectations -- AFAIK we don't have any SLO around ORES, and even for services where we have established an SLO, SRE can't be in the habit of routinely restarting things by hand; it's just not sustainable [19:21:14] (The poolcounter thingy) [19:21:54] cdanis: definitely, the ores code needs lots of love [19:22:07] (or rewrite, depends on the perspective) [19:22:40] yeah, we just hired our own SRE (Tobias) and im going to hire a second in Q3, but right now all you have is me [19:22:41] fear my code [19:23:03] :) [19:23:09] chrisalbon: whatever you're going to write, I guarantee, I have seen worse :D [19:23:27] I can help as much as possible, let me know (onboarding, etc.) [19:23:33] btw chrisalbon I'm happy to discuss designing for reliability and production best practices and such -- engaging early on stuff like that is how SRE is "supposed" to work [19:24:02] chrisalbon: for the problem at hand, do we know is it just a set of IPs or it's distributed? [19:24:22] We can get the numbers out of hadoop (or turnilo if it can be found there) [19:24:49] Awesome, okay I `sudo service uwsgi-ores restart` on all the ORES 200X boxes [19:24:52] lets see what that does [19:25:22] cdanis My plan is to make a replacement for ORES instead of keeping the beast limping along, so I'd love that conversation. [19:26:07] if it's a set of ips, we can reduce the poolcounter limit https://github.com/wikimedia/ores/blob/master/config/00-main.yaml#L64 [19:26:27] (and the next line) [19:26:38] you can simply override it here: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/ores/deploy/+/refs/heads/master/config/00-main.yaml [19:26:41] Amir1: I don't think we know yet [19:27:16] judging by the 1600 start time on all 3 days, my guess would be that it's some characteristic of external traffic, but, whatever is happening is not obvious from turnilo (there's not one IP address or user-agent or AS number, etc) [19:27:45] https://usercontent.irccloud-cdn.com/file/zzVHoHe3/Screenshot%20from%202020-09-26%2012-27-32.png [19:28:11] ๐Ÿ‘‰ ๐Ÿฆ‹ [19:28:19] lol [19:28:42] cdanis yeah that me my thought too, especially since we haven't touched ORES in at least a month [19:28:55] so one thing is, ores itself should restart its workers after a set of requests, since it has a big memleak baked in (don't get mee started) [19:29:50] we can reduce that number [19:30:09] oh interesting [19:30:21] cdanis: can I see the turnlio? [19:30:27] I want to mess with it a bit [19:30:42] alright, thanks all. I'll keep monitoring. Thanks for all your knowledge on this. [19:31:06] yeah, we just need to trade off how much effort we spend bailing water and how much effort we spend building a new boat [19:31:39] as I said before, I'm more than happy bailing water while you're making a new boat [19:31:40] Amir1: https://w.wiki/dsV the TTFB filter is to highlight the slow requests [19:41:56] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) It started happening again, I went into each Ores200X box and manually `sudo service uwsgi-ores restart` to restart it.{F32364681} [22:03:05] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-me [22:04:43] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [22:22:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:23:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets