[00:04:19] 10Operations, 10Patch-For-Review: setup replacements for maintenance_server (terbium, wasat) on Stretch - https://phabricator.wikimedia.org/T192092 (10Dzahn) a:03Dzahn [00:28:15] 10Operations, 10Patch-For-Review: setup replacements for maintenance_server (terbium, wasat) on Stretch - https://phabricator.wikimedia.org/T192092 (10Peachey88) [03:29:37] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 841.31 seconds [03:39:57] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 219.68 seconds [03:48:37] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [03:50:48] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [04:01:17] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [04:04:37] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [04:09:08] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [04:10:18] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [06:22:47] RECOVERY - Memory correctable errors -EDAC- on cp1049 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=cp1049&var-datasource=eqiad%2520prometheus%252Fops [06:31:49] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/DigiCert_High_Assurance_CA-3.crt] [06:48:27] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [06:49:28] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [06:57:29] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:07:58] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [08:10:17] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [08:12:32] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1094 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448532 (owner: 10Jcrespo) [08:13:43] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1094 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448532 (owner: 10Jcrespo) [08:16:36] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1094 fully (duration: 00m 57s) [08:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:03] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1094 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448532 (owner: 10Jcrespo) [08:34:48] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [08:35:57] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [09:28:27] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 40.97, 35.64, 23.67 [09:31:27] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 59.08, 39.93, 24.75 [09:32:49] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [09:32:49] or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/ [09:41:38] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 9.21, 20.71, 23.66 [09:48:57] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 8.58, 15.29, 23.19 [11:19:29] Anyone from ops around and mind checking something for me? [11:23:15] (03PS1) 10Jcrespo: Packages for MySQL 8.0.12 and MariaDB 10.3.8 [software] - 10https://gerrit.wikimedia.org/r/448854 [11:26:17] (03PS2) 10Jcrespo: Packages for MySQL 8.0.12 and MariaDB 10.3.8 [software] - 10https://gerrit.wikimedia.org/r/448854 [11:29:47] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [11:34:18] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [11:35:51] (03PS3) 10Jcrespo: Packages for MySQL 8.0.12 and MariaDB 10.3.8 [software] - 10https://gerrit.wikimedia.org/r/448854 [11:42:18] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [11:43:28] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [13:04:49] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [13:39:47] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [13:40:48] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [13:50:07] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [13:55:48] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [14:03:47] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [14:04:48] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [15:45:58] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [15:50:43] I've silenced this ^ it is due to ulsfo depooled [15:52:48] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [17:35:55] !log restart eventlogging on eventlog1002 after tons of kafka disconnects (still not clear what happened) [17:35:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:41] !log Navigation Timing alerts indicate we're losing a significant about of traffic [17:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:56] godog: Can you give some context on ulsfo? Was/is there a problem? [17:55:58] elukey: is it being discussed somewhere? [17:57:57] Krinkle: I just seen some errors and checked, it is only me for the moment. For some reason the el processors on eventlog1002 at some point get disconnected by kafka, but I am not sure why [17:58:04] load is higher than usual [17:58:18] https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1 [17:58:29] For about 2 hours, there was virtually no packets [17:58:31] yep I know [17:58:36] https://grafana.wikimedia.org/dashboard/db/eventlogging?orgId=1&from=now-3h&to=now-5m [17:58:46] I tried to restart it but for some reason it doesn't work [17:58:49] I assume the load spike was from after it got back, or was there an earlier spike that broke it? [17:59:39] at around 15:48 UTC (more or less) el1002 started to consume more CPU and load went up, then the processors logged kafka errors and disconnected [17:59:44] the second spike I suppose it was puppet [17:59:57] third and forth are me restarting [18:01:19] elukey: Aye, okay. Are you looking for a cause, or for a way to mitigate the cause? [18:02:11] I am trying yes, of course I have people for dinner in a bit and I might need to recheck later on [18:02:30] it is really weird [18:06:14] Aye, wanted to know so I can skip what you've done already.. [18:06:27] The Kibana dashboard was broken, I fixed it just now. https://logstash.wikimedia.org/app/kibana#/dashboard/eventlogging [18:07:35] thanks! [18:08:20] I am not sure what happened at ~15:48 to make this [18:11:03] if you have any ideas I am all ear [18:15:16] trying one setting on the fly for kafka python, let's see if it works [18:23:51] (so puppet disabled on eventlog1002 atm, kafka-python set with max_poll_records=100 - default 500) [18:28:02] Came back for about 15min [18:30:50] but seems to have dropped again. [18:31:15] Looking at the hour building up to 15:48, no particular schema stands out as having increased traffic or something like that. [18:32:12] "raw client side" and "EventError" didn't increase much either. So it's also not a burst of invalid stuff breaking it somehow. [18:34:11] yeah, my attempt failed [18:37:06] elukey: so, what is actually happening? CPU exhaustion? run-time fatal in Python? [18:43:24] Krinkle: not super clear from the logs, still checking (on and off since I have people at home sorry, will try to put more time later on) [18:44:31] elukey: OK. Should we call in someone else as well? I don't actually know what the pager policy is for EL. [18:44:32] mmm now that I see, it might be the producer side of the eventlogging processors that is not working [18:44:53] Krinkle: in theory it is not something that can page on a weekend [18:45:32] nothing relevant on the kafka logs [18:45:53] Okay. [18:46:51] what I see is that all of a sudden things like [18:46:53] producer-1| kafka-jumbo1005.eqiad.wmnet:9092/1005: Receive failed: Disconnected etc.. [18:46:57] starts popping up [18:46:58] for a lot of brokers [18:47:19] that seems librdkafka having issues producing, but no more indication why [18:49:03] librdkafka seems really old on eventlog1002 [18:49:04] librdkafka1:amd64 0.9.3-1 [18:49:35] the newer one is 0.11.5-1~bpo9+1 [18:52:22] el processors processes are at 100% cpu usage [18:53:28] and python-confluent-kafka is 0.9.2-1 [18:53:44] so not sure if upgrading librdkafka will work fine (it should in theory) [18:55:32] I saw also erros like "6/6 brokers are down" for the producer (that is confluent-kafka-python IIRC), that seems to be a librdkafka bug [18:58:07] ok I tested the upgrade of librdkafka in labs, looks promising [19:07:38] so let's see first if I can add more verbose logging to librdkafka on one processor [19:14:19] I started only one processor to see if it manages to stay up [19:14:33] need to go now, but I'll rejoin in hopefully one hour or something [20:08:50] so from what I can see it seems that this happens [20:09:21] 1) the el processors for some reason are spinning/using cpu and not showing anything in strace [20:09:42] 2) after 10 mins, the kafka brokers tear down the connections with them, and hence the disconnect errors [20:09:51] (because of idle timeout) [20:10:05] next step is to figure out what are the python processes doing atm [20:19:05] (afk again, hope to check it later on) [20:20:18] Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/ [21:56:41] (03PS1) 10Alex Monk: Python 3: Fix another bytes vs. string problem breaking JSON formatting [software/cumin] - 10https://gerrit.wikimedia.org/r/448985 (https://phabricator.wikimedia.org/T200622) [21:59:33] (03CR) 10jerkins-bot: [V: 04-1] Python 3: Fix another bytes vs. string problem breaking JSON formatting [software/cumin] - 10https://gerrit.wikimedia.org/r/448985 (https://phabricator.wikimedia.org/T200622) (owner: 10Alex Monk) [22:01:25] (03Abandoned) 10Alex Monk: phabricator: Attempt to multiply rate limits for WMDE and WMF offices [puppet] - 10https://gerrit.wikimedia.org/r/444124 (https://phabricator.wikimedia.org/T198612) (owner: 10Alex Monk) [22:03:23] (03CR) 10Alex Monk: "I'm sure there's got to be a legitimate problem in that Jenkins console output somewhere but I don't see it." [software/cumin] - 10https://gerrit.wikimedia.org/r/448985 (https://phabricator.wikimedia.org/T200622) (owner: 10Alex Monk) [23:38:58] PROBLEM - Check status of defined EventLogging jobs on eventlog1002 is CRITICAL: CRITICAL: Stopped EventLogging jobs: eventlogging-processor@client-side-10 eventlogging-processor@client-side-09 eventlogging-processor@client-side-08 eventlogging-processor@client-side-07 eventlogging-processor@client-side-06 eventlogging-processor@client-side-05 eventlogging-processor@client-side-04 eventlogging-processor@client-side-03 eventlogg [23:38:58] t-side-02 eventlogging-processor@client-side-01 eventlogging-processor@client-side-00 [23:40:54] this is me --^