[00:03:14] <wikibugs>	 (03PS5) 10Dzahn: mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301
[00:03:46] <wikibugs>	 (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/628460 (owner: 10Dzahn)
[00:03:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 (owner: 10Dzahn)
[00:04:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add testvm5001 to test install5001 [dns] - 10https://gerrit.wikimedia.org/r/630313 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn)
[00:04:37] <wikibugs>	 (03PS2) 10Dzahn: add testvm5001 to test install5001 [dns] - 10https://gerrit.wikimedia.org/r/630313 (https://phabricator.wikimedia.org/T252526)
[00:07:02] <wikibugs>	 (03Abandoned) 10Dzahn: mariadb::core_test: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630301 (owner: 10Dzahn)
[00:09:02] <wikibugs>	 (03PS2) 10Dzahn: mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317
[00:10:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn)
[00:18:19] <wikibugs>	 (03PS2) 10Dzahn: cache::ssl::unified: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628460
[00:20:53] <wikibugs>	 (03PS3) 10Dzahn: mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317
[00:21:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn)
[00:33:29] <wikibugs>	 (03PS4) 10Dzahn: mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317
[00:40:40] <wikibugs>	 (03PS2) 10Dzahn: start DHCP service on install5001, stop it on bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/629849 (https://phabricator.wikimedia.org/T252526)
[00:43:22] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm
[00:43:22] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
[00:43:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:43:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:43:41] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm
[00:43:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:44:53] <wikibugs>	 (03PS1) 10Dzahn: Revert "add testvm5001 to test install5001" [dns] - 10https://gerrit.wikimedia.org/r/630232
[00:46:30] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
[00:46:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:47:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "add testvm5001 to test install5001" [dns] - 10https://gerrit.wikimedia.org/r/630232 (owner: 10Dzahn)
[00:50:21] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:50:26] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm
[00:50:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:55:19] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
[00:55:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:56:51] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm
[00:56:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:03:23] <wikibugs>	 (03PS4) 10HitomiAkane: Creation of patroller group on arz.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218)
[01:06:16] <chrisalbon>	 So, ORES is throwing a bunch of errors. It isn't down but it isn't happy. I've reached out to Aaron about possible causes. Course, if anyone here has ideas I am all ears
[01:06:40] <chrisalbon>	 https://logstash.wikimedia.org/app/kibana#/dashboard/ORES?_g=h@44136fa&_a=h@0002e3b
[01:11:53] <cdanis>	 !log ✔️ cdanis@ores2001.codfw.wmnet ~ 🕘🍺 sudo systemctl restart celery-ores-worker.service 
[01:11:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:12:24] <cdanis>	 chrisalbon: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=13&orgId=1&refresh=1m&from=now-12h&to=now-1m makes me suspect something getting deadlocked in celery
[01:13:35] <chrisalbon>	 Yeah agreed. Based on my understanding of ORES if we restart celery that should resolve it
[01:13:59] <cdanis>	 does ORES take a while to start up (to read models or something)?
[01:14:11] <cdanis>	 (I'm worried about making things worse if I restart too many at once)
[01:14:36] <chrisalbon>	 A few minutes, nothing crazy
[01:15:24] <cdanis>	 do you know if it has a readiness probe defined, or a simple query close to such?
[01:15:35] <chrisalbon>	 I don't
[01:17:19] <cdanis>	 !log ❌cdanis@ores2001.codfw.wmnet ~ 🕤🍺 sudo systemctl restart uwsgi-ores.service
[01:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:19:19] <mutante>	 rescheduled the icinga check for ores.wikimedia.org
[01:19:28] <mutante>	 turned green
[01:20:37] <cdanis>	 sure, I think the underlying service is still unhealthy
[01:21:16] <cdanis>	 I'm going to give it a few more minutes and then restart things across the whole cluster; the 'busy web workers' graph correlates quite well with the slowdown
[01:22:33] <chrisalbon>	 Yeah, based upon what janruben said in #wikimedia-ai is has been getting worse over the last few hours which correlates well to the busy web worker surge
[01:23:49] <mutante>	 ✔️ 
[01:25:14] <cdanis>	 I'm not convinced that my restarts are helping.
[01:25:18] <mutante>	 can confirm. there are a couple "systemctl start celery-ores-worker " etc in SAL from the past
[01:25:58] <cdanis>	 all I see from ores2001 is that it still seems slow to answer queries, and the web worker count is also growing
[01:28:10] <chrisalbon>	 yeah I see that too https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=13&orgId=1&refresh=1m&from=now-12h&to=now-1m
[01:28:12] <mutante>	 glanced at two older incident reports about ORES and celery-workers but they don't seem to apply 
[01:29:48] <cdanis>	 ores is active/passive?
[01:30:23] <chrisalbon>	 active/passive?
[01:30:31] <cdanis>	 conftool says ores.discovery.wmnet is depooled in eqiad, but, I still see plenty of traffic going to eqiad in grafana
[01:31:02] <mutante>	 all the busy web workers are on codfw hosts though
[01:31:04] <cdanis>	 chrisalbon: we say active/active when a service is serving from all DCs simultaneously
[01:31:10] <chrisalbon>	 ah got it
[01:31:24] <cdanis>	 and active/passive when there's only supposed to be one active DC at a time (e.g. mediawiki)
[01:32:29] <cdanis>	 so, I think something deeper is wrong here.  I don't think there were any deploys recently, so I'm wondering if something has changed about the traffic they're receiving
[01:32:54] <chrisalbon>	 yeah there haven't been any deploys for a month
[01:32:58] <cdanis>	 I don't even know what the ORES API looks like though; is it the kind of thing where it supports queries that vary in cost by orders of magnitude?
[01:34:08] <chrisalbon>	 I don't think it should
[01:37:58] <chrisalbon>	 You can check out the API endpoint here https://ores.wikimedia.org/v3/scores/enwiki/?models=damaging&revids=964118401&format=json
[01:39:26] <cdanis>	 the monitoring for ORES Redis in codfw has a bunch of data points missing
[01:39:51] <mutante>	 found the "ORES advanced metrics" dashboard and "overload errors". but it is 0 for everything
[01:40:23] <mutante>	 also 5xx rate and "unconvential responses"
[01:42:00] <cdanis>	 I actually suspect that some of the Redises crashed and didn't quite come back up right
[01:42:49] <cdanis>	 but I'm a bit leery to percussively stop everything
[01:43:31] <cdanis>	 oh.  max number of clients reached... and I bet the prometheus exporter tries to send the INFO command, which doesn't work
[01:43:32] <mutante>	 this is for ORES redis: https://grafana.wikimedia.org/d/RLhtAw6mz/ores-redis?orgId=1&refresh=1m
[01:43:42] <chrisalbon>	 right
[01:44:01] <cdanis>	 chrisalbon: do you know if ORES can/should be serving from both codfw and eqiad at the same time?
[01:45:21] <mutante>	 https://phabricator.wikimedia.org/T159615
[01:45:25] <cdanis>	 I'm going to percussively restart ORES in codfw.
[01:45:27] <mutante>	 [spec] Active-active setup for ORES across datacenters (eqiad, codfw)
[01:45:28] <chrisalbon>	 Yeah it should
[01:46:15] <mutante>	 "The prefacing rule is now updating ORES in both datacenters. "
[01:48:20] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=ores,name=eqiad
[01:48:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:48:40] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=ores,name=codfw
[01:48:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:49:00] <cdanis>	 ok, first let's try moving the traffic
[01:49:24] <cdanis>	 hopefully this doesn't add too much latency for MW, although I thought it was changeprop or the jobqueue that makes the calls to ORES
[01:50:19] <cdanis>	 btw, looking at the past 7 days of latency, this *almost* happened yesterday, but didn't quite get as bad
[01:50:33] <chrisalbon>	 yeah I just noticed that
[01:50:50] <cdanis>	 do you know if it's easy to grab stack traces from running celery workers?
[01:51:15] <mutante>	 from a previous incident report: "disables changeprop so that it doesn't impact ores"
[01:51:17] <chrisalbon>	 I don't sorry
[01:51:40] <cdanis>	 what I'm suspecting is happening is that there's no timeouts for the redis requests, or a timeout so long that the query gets killed by the client or at the traffic layer before it has a chance to get through the queueing in redis 
[01:51:53] <cdanis>	 and so then once you get into this state you never have any 'good' throughput
[01:54:00] <cdanis>	 okay, eqiad seems to be happily handling this traffic
[01:54:31] <cdanis>	 it's been long enough for TTLs for records pointing to codfw to clear, going to percussively restart things there
[01:55:01] <chrisalbon>	 sounds good, yeah looks good, API is up and serving fine
[01:55:03] <cdanis>	 the redises there are *still* hung on too many clients, btw, and it shouldn't've been getting new queries for at least 3 minutes now
[01:55:18] <cdanis>	 so once we get into this state, it seems we don't work our way out of it
[01:55:26] <chrisalbon>	 Good catch
[01:56:05] <cdanis>	 !log ❌cdanis@cumin2001.codfw.wmnet ~ 🕙🍺 sudo cumin 'A:ores and A:codfw'  'systemctl restart celery-ores-worker.service uwsgi-ores.service '
[01:56:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:59:29] <wikibugs>	 (03PS1) 10Dzahn: add testvm5001.eqsin.wmnet [dns] - 10https://gerrit.wikimedia.org/r/630319
[01:59:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] add testvm5001.eqsin.wmnet [dns] - 10https://gerrit.wikimedia.org/r/630319 (owner: 10Dzahn)
[02:00:02] <cdanis>	 the number of steady-state connections from the ORES hosts in codfw (which were just restarted and haven't gotten traffic yet) is about 1/3rd to 1/2 of what seems to be the ceiling of allowed # of redis connections, btw
[02:00:17] <cdanis>	 so that's a second thing that should probably be fixed, not a comfortable amount of headroom
[02:00:54] <cdanis>	 https://phabricator.wikimedia.org/P12803
[02:01:36] <cdanis>	 okay, I think ores@codfw is healthier now
[02:02:40] <chrisalbon>	 okay thanks, ill make tickets for those two fixes so we don't return to this situation on a future friday night
[02:02:49] <cdanis>	 I'm actually going to restore the traffic flow as it was before, depooling eqiad and repooling codfw, because I'm not 100% sure it is meant to be active/active, and that's not the kind of change I want to make on a Friday night :)
[02:03:15] <chrisalbon>	 cool
[02:03:36] <cdanis>	 thanks! can you file a third to verify that ores dnsdisc should have both DCs pooled?
[02:03:47] <chrisalbon>	 yep
[02:03:55] <wikibugs>	 (03PS2) 10Dzahn: add testvm5001.eqsin.wmnet [dns] - 10https://gerrit.wikimedia.org/r/630319
[02:04:03] <logmsgbot>	 !log cdanis@cumin2001 conftool action : set/pooled=true; selector: dnsdisc=ores,name=codfw
[02:04:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:04:25] <logmsgbot>	 !log cdanis@cumin2001 conftool action : set/pooled=false; selector: dnsdisc=ores,name=eqiad
[02:04:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:05:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add testvm5001.eqsin.wmnet [dns] - 10https://gerrit.wikimedia.org/r/630319 (owner: 10Dzahn)
[02:13:09] <cdanis>	 things look pretty good now.  I'm going back to my evening :)
[02:13:53] <chrisalbon>	 yep they look good, thanks cdanis! I owe you a beer next time we are in person
[02:14:07] <cdanis>	 🍻 to 2023!
[02:17:17] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[02:17:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:23:24] <wikibugs>	 (03PS1) 10Dzahn: DHCP: add testvm5001 MAC address [puppet] - 10https://gerrit.wikimedia.org/r/630320 (https://phabricator.wikimedia.org/T252526)
[05:06:05] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[05:06:05] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100%
[05:13:11] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:14:47] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:17:33] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms
[05:17:33] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.57 ms
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200926T0700)
[08:34:36] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10jijiki)
[10:35:52] <wikibugs>	 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Aklapper) @Esanders: Is the #Editing-Team still going to investigate this problem, despite lack of expert...
[12:51:56] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10CDanis) "busy web workers" graph, which correlates quite well with the slowdown: https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=13&orgId=1&f...
[16:01:39] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:03:17] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:39:25] <wikibugs>	 (03PS3) 10ArielGlenn: new util to display info about revisions for one or more pages from XML input [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/630267 (https://phabricator.wikimedia.org/T263319)
[16:39:37] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 128 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:42:49] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 12 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:43:29] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:45:05] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:51:14] <chrisalbon>	 cdanis looks like there is a build up again. still small now though
[17:52:03] <chrisalbon>	 https://usercontent.irccloud-cdn.com/file/3Ll5oDrI/Screenshot%20from%202020-09-26%2010-51-29.png
[17:54:03] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:58:53] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:06:38] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) every time it starts at 16:00
[18:10:34] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) {F32364643}
[18:44:01] <chrisalbon>	 Any SRE around?
[18:46:47] <chrisalbon>	 I'm not confident I want to do what I am thinking of doing without some SRE to back me up
[18:54:16] <chrisalbon>	 When an SRE gets on: ORES is filling up with busy workings like it did yesterday. Here is cdanis 's solution that works (at least for 24 hours): https://phabricator.wikimedia.org/T263910
[18:54:16] <chrisalbon>	 I don't have permissions to move traffic to eqiad and I don't have cumin, but I _think_ I could stop ORES (sudo systemctl stop uwsgi-ores.service && sudo systemctl stop celery-ores-worker.service) on each ORES codw box, wait ~10 minutes for redis connections to time out, then restart everything (sudo systemctl start uwsgi-ores.service && sudo systemctl start celery-ores-worker.service)
[18:54:17] <chrisalbon>	 But honestly that is a guess and I hesitate to do that 
[19:15:42] <chrisalbon>	 I think, given that I'm bouncing kids on my lap right now, I'll continue to monitor it and if it really gets back I'll consider doing my nuclear option (described above)
[19:16:02] <chrisalbon>	 s/back/bad
[19:18:09] <cdanis>	 chrisalbon: you don't have to wait 10 minutes -- stopping the services should close the sockets, which IIRC was sufficient last night
[19:18:13] <Amir1>	 chrisalbon: hey Amir here
[19:18:27] <Amir1>	 I have done this before, you just need to restart uwsgi
[19:18:40] <Amir1>	 sudo service uwsgi-ores restart
[19:19:18] <Amir1>	 even there is something that can do it rollingly (I just made up that word) on the nine nodes of codfw
[19:19:33] <chrisalbon>	 okay cool let me try that now
[19:20:03] <Amir1>	 I assume, we should check the config of ores, I have a feeling something is still trying to connect to eqiad instead of rolling over to codfw (like redis connections)
[19:20:13] <Amir1>	 but that can wait until Monday
[19:20:15] <chrisalbon>	 !log sudo service uwsgi-ores restart
[19:20:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:20:23] <chrisalbon>	 did that work? I see cdanis do it yesterday
[19:20:27] <cdanis>	 Amir1: given the timing of everything I think it's more about external traffic patterns
[19:20:37] <chrisalbon>	 yeah I think someone is hitting it hard
[19:20:46] <cdanis>	 yesterday I did the failover to eqiad yesterday so I could restart codfw all at once without worrying about nuking all in-flight queries
[19:20:59] <Amir1>	 it is likely, they should be some limits on req/ip, we can reduce that limit
[19:21:07] <cdanis>	 chrisalbon: also to set expectations -- AFAIK we don't have any SLO around ORES, and even for services where we have established an SLO, SRE can't be in the habit of routinely restarting things by hand; it's just not sustainable
[19:21:14] <Amir1>	 (The poolcounter thingy)
[19:21:54] <Amir1>	 cdanis: definitely, the ores code needs lots of love
[19:22:07] <Amir1>	 (or rewrite, depends on the perspective)
[19:22:40] <chrisalbon>	 yeah, we just hired our own SRE (Tobias) and im going to hire a second in Q3, but right now all you have is me
[19:22:41] <chrisalbon>	 fear my code
[19:23:03] <cdanis>	 :)
[19:23:09] <Amir1>	 chrisalbon: whatever you're going to write, I guarantee, I have seen worse :D
[19:23:27] <Amir1>	 I can help as much as possible, let me know (onboarding, etc.)
[19:23:33] <cdanis>	 btw chrisalbon I'm happy to discuss designing for reliability and production best practices and such -- engaging early on stuff like that is how SRE is "supposed" to work
[19:24:02] <Amir1>	 chrisalbon: for the problem at hand, do we know is it just a set of IPs or it's distributed?
[19:24:22] <Amir1>	 We can get the numbers out of hadoop (or turnilo if it can be found there)
[19:24:49] <chrisalbon>	 Awesome, okay I `sudo service uwsgi-ores restart` on all the ORES 200X boxes
[19:24:52] <chrisalbon>	 lets see what that does
[19:25:22] <chrisalbon>	 cdanis My plan is to make a replacement for ORES instead of keeping the beast limping along, so I'd love that conversation.
[19:26:07] <Amir1>	 if it's a set of ips, we can reduce the poolcounter limit https://github.com/wikimedia/ores/blob/master/config/00-main.yaml#L64
[19:26:27] <Amir1>	 (and the next line)
[19:26:38] <Amir1>	 you can simply override it here: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/ores/deploy/+/refs/heads/master/config/00-main.yaml
[19:26:41] <cdanis>	 Amir1: I don't think we know yet
[19:27:16] <cdanis>	 judging by the 1600 start time on all 3 days, my guess would be that it's some characteristic of external traffic, but, whatever is happening is not obvious from turnilo (there's not one IP address or user-agent or AS number, etc)
[19:27:45] <chrisalbon>	 https://usercontent.irccloud-cdn.com/file/zzVHoHe3/Screenshot%20from%202020-09-26%2012-27-32.png
[19:28:11] <cdanis>	 👉 🦋
[19:28:19] <Amir1>	 lol
[19:28:42] <chrisalbon>	 cdanis yeah that me my thought too, especially since we haven't touched ORES in at least a month
[19:28:55] <Amir1>	 so one thing is, ores itself should restart its workers after a set of requests, since it has a big memleak baked in (don't get mee started)
[19:29:50] <Amir1>	 we can reduce that number
[19:30:09] <chrisalbon>	 oh interesting
[19:30:21] <Amir1>	 cdanis: can I see the turnlio?
[19:30:27] <Amir1>	 I want to mess with it a bit
[19:30:42] <chrisalbon>	 alright, thanks all. I'll keep monitoring. Thanks for all your knowledge on this.
[19:31:06] <cdanis>	 yeah, we just need to trade off how much effort we spend bailing water and how much effort we spend building a new boat
[19:31:39] <Amir1>	 as I said before, I'm more than happy bailing water while you're making a new boat
[19:31:40] <cdanis>	 Amir1: https://w.wiki/dsV the TTFB filter is to highlight the slow requests
[19:41:56] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) It started happening again, I went into each Ores200X box and manually `sudo service uwsgi-ores restart` to restart it.{F32364681}
[22:03:05] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-me
[22:04:43] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[22:22:23] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:23:57] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets