[00:04:19] <wikibugs>	 10Operations, 10Patch-For-Review: setup replacements for maintenance_server (terbium, wasat) on Stretch - https://phabricator.wikimedia.org/T192092 (10Dzahn) a:03Dzahn
[00:28:15] <wikibugs>	 10Operations, 10Patch-For-Review: setup replacements for maintenance_server (terbium, wasat) on Stretch - https://phabricator.wikimedia.org/T192092 (10Peachey88)
[03:29:37] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 841.31 seconds
[03:39:57] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 219.68 seconds
[03:48:37] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[03:50:48] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[04:01:17] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[04:04:37] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[04:09:08] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[04:10:18] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[06:22:47] <icinga-wm>	 RECOVERY - Memory correctable errors -EDAC- on cp1049 is OK: (C)4 ge (W)2 ge 0 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=cp1049&var-datasource=eqiad%2520prometheus%252Fops
[06:31:49] <icinga-wm>	 PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/DigiCert_High_Assurance_CA-3.crt]
[06:48:27] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[06:49:28] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[06:57:29] <icinga-wm>	 RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:07:58] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[08:10:17] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[08:12:32] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1094 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448532 (owner: 10Jcrespo)
[08:13:43] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1094 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448532 (owner: 10Jcrespo)
[08:16:36] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1094 fully (duration: 00m 57s)
[08:16:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:03] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool db1094 for reimage" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448532 (owner: 10Jcrespo)
[08:34:48] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received
[08:35:57] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy
[09:28:27] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 40.97, 35.64, 23.67
[09:31:27] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 59.08, 39.93, 24.75
[09:32:49] <FrozenFox26>	 Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/
[09:32:49] <FrozenFox26>	 or maybe this blog by freenode staff member Matthew 'mst' Trout https://MattSTrout.com/
[09:41:38] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 9.21, 20.71, 23.66
[09:48:57] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 8.58, 15.29, 23.19
[11:19:29] <Krenair>	 Anyone from ops around and mind checking something for me?
[11:23:15] <wikibugs>	 (03PS1) 10Jcrespo: Packages for MySQL 8.0.12 and MariaDB 10.3.8 [software] - 10https://gerrit.wikimedia.org/r/448854
[11:26:17] <wikibugs>	 (03PS2) 10Jcrespo: Packages for MySQL 8.0.12 and MariaDB 10.3.8 [software] - 10https://gerrit.wikimedia.org/r/448854
[11:29:47] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[11:34:18] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[11:35:51] <wikibugs>	 (03PS3) 10Jcrespo: Packages for MySQL 8.0.12 and MariaDB 10.3.8 [software] - 10https://gerrit.wikimedia.org/r/448854
[11:42:18] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[11:43:28] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[13:04:49] <jwhisnant25>	 Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/
[13:39:47] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[13:40:48] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[13:50:07] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[13:55:48] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[14:03:47] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[14:04:48] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[15:45:58] <icinga-wm>	 PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert.
[15:50:43] <godog>	 I've silenced this ^ it is due to ulsfo depooled 
[15:52:48] <icinga-wm>	 RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting.
[17:35:55] <elukey>	 !log restart eventlogging on eventlog1002 after tons of kafka disconnects (still not clear what happened)
[17:35:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:41] <Krinkle>	 !log Navigation Timing alerts indicate we're losing a significant about of traffic
[17:49:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:56] <Krinkle>	 godog: Can you give some context on ulsfo? Was/is there a problem?
[17:55:58] <Krinkle>	 elukey: is it being discussed somewhere?
[17:57:57] <elukey>	 Krinkle: I just seen some errors and checked, it is only me for the moment. For some reason the el processors on eventlog1002 at some point get disconnected by kafka, but I am not sure why
[17:58:04] <elukey>	 load is higher than usual
[17:58:18] <Krinkle>	 https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1
[17:58:29] <Krinkle>	 For about 2 hours, there was virtually no packets
[17:58:31] <elukey>	 yep I know
[17:58:36] <elukey>	 https://grafana.wikimedia.org/dashboard/db/eventlogging?orgId=1&from=now-3h&to=now-5m
[17:58:46] <elukey>	 I tried to restart it but for some reason it doesn't work
[17:58:49] <Krinkle>	 I assume the load spike was from after it got back, or was there an earlier spike that broke it?
[17:59:39] <elukey>	 at around 15:48 UTC (more or less) el1002 started to consume more CPU and load went up, then the processors logged kafka errors and disconnected
[17:59:44] <elukey>	 the second spike I suppose it was puppet
[17:59:57] <elukey>	 third and forth are me restarting
[18:01:19] <Krinkle>	 elukey: Aye, okay. Are you looking for a cause, or for a way to mitigate the cause?
[18:02:11] <elukey>	 I am trying yes, of course I have people for dinner in a bit and I might need to recheck later on
[18:02:30] <elukey>	 it is really weird
[18:06:14] <Krinkle>	 Aye, wanted to know so I can skip what you've done already..
[18:06:27] <Krinkle>	 The Kibana dashboard was broken, I fixed it just now. https://logstash.wikimedia.org/app/kibana#/dashboard/eventlogging
[18:07:35] <elukey>	 thanks!
[18:08:20] <elukey>	 I am not sure what happened at ~15:48 to make this
[18:11:03] <elukey>	 if you have any ideas I am all ear
[18:15:16] <elukey>	 trying one setting on the fly for kafka python, let's see if it works
[18:23:51] <elukey>	 (so puppet disabled on eventlog1002 atm, kafka-python set with max_poll_records=100 - default 500)
[18:28:02] <Krinkle>	 Came back for about 15min
[18:30:50] <Krinkle>	 but seems to have dropped again.
[18:31:15] <Krinkle>	 Looking at the hour building up to 15:48, no particular schema stands out as having increased traffic or something like that.
[18:32:12] <Krinkle>	 "raw client side" and "EventError" didn't increase much either. So it's also not a burst of invalid stuff breaking it somehow.
[18:34:11] <elukey>	 yeah, my attempt failed
[18:37:06] <Krinkle>	 elukey: so, what is actually happening? CPU exhaustion? run-time fatal in Python?
[18:43:24] <elukey>	 Krinkle: not super clear from the logs, still checking (on and off since I have people at home sorry, will try to put more time later on)
[18:44:31] <Krinkle>	 elukey: OK. Should we call in someone else as well? I don't actually know what the pager policy is for EL.
[18:44:32] <elukey>	 mmm now that I see, it might be the producer side of the eventlogging processors that is not working
[18:44:53] <elukey>	 Krinkle: in theory it is not something that can page on a weekend
[18:45:32] <elukey>	 nothing relevant on the kafka logs
[18:45:53] <Krinkle>	 Okay.
[18:46:51] <elukey>	 what I see is that all of a sudden things like
[18:46:53] <elukey>	 producer-1| kafka-jumbo1005.eqiad.wmnet:9092/1005: Receive failed: Disconnected etc..
[18:46:57] <elukey>	 starts popping up
[18:46:58] <elukey>	 for a lot of brokers
[18:47:19] <elukey>	 that seems librdkafka having issues producing, but no more indication why
[18:49:03] <elukey>	 librdkafka seems really old on eventlog1002
[18:49:04] <elukey>	 librdkafka1:amd64                         0.9.3-1
[18:49:35] <elukey>	 the newer one is 0.11.5-1~bpo9+1
[18:52:22] <elukey>	 el processors processes are at 100% cpu usage
[18:53:28] <elukey>	 and python-confluent-kafka is 0.9.2-1
[18:53:44] <elukey>	 so not sure if upgrading librdkafka will work fine (it should in theory)
[18:55:32] <elukey>	 I saw also erros like "6/6 brokers are down" for the producer (that is confluent-kafka-python IIRC), that seems to be a librdkafka bug
[18:58:07] <elukey>	 ok I tested the upgrade of librdkafka in labs, looks promising
[19:07:38] <elukey>	 so let's see first if I can add more verbose logging to librdkafka on one processor
[19:14:19] <elukey>	 I started only one processor to see if it manages to stay up
[19:14:33] <elukey>	 need to go now, but I'll rejoin in hopefully one hour or something
[20:08:50] <elukey>	 so from what I can see it seems that this happens
[20:09:21] <elukey>	 1) the el processors for some reason are spinning/using cpu and not showing anything in strace
[20:09:42] <elukey>	 2) after 10 mins, the kafka brokers tear down the connections with them, and hence the disconnect errors
[20:09:51] <elukey>	 (because of idle timeout)
[20:10:05] <elukey>	 next step is to figure out what are the python processes doing atm
[20:19:05] <elukey>	 (afk again, hope to check it later on)
[20:20:18] <Guest18117>	 Hey, I thought you guys might be interested in this blog by freenode staff member Bryan 'kloeri' Ostergaard https://bryanostergaard.com/
[21:56:41] <wikibugs>	 (03PS1) 10Alex Monk: Python 3: Fix another bytes vs. string problem breaking JSON formatting [software/cumin] - 10https://gerrit.wikimedia.org/r/448985 (https://phabricator.wikimedia.org/T200622)
[21:59:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Python 3: Fix another bytes vs. string problem breaking JSON formatting [software/cumin] - 10https://gerrit.wikimedia.org/r/448985 (https://phabricator.wikimedia.org/T200622) (owner: 10Alex Monk)
[22:01:25] <wikibugs>	 (03Abandoned) 10Alex Monk: phabricator: Attempt to multiply rate limits for WMDE and WMF offices [puppet] - 10https://gerrit.wikimedia.org/r/444124 (https://phabricator.wikimedia.org/T198612) (owner: 10Alex Monk)
[22:03:23] <wikibugs>	 (03CR) 10Alex Monk: "I'm sure there's got to be a legitimate problem in that Jenkins console output somewhere but I don't see it." [software/cumin] - 10https://gerrit.wikimedia.org/r/448985 (https://phabricator.wikimedia.org/T200622) (owner: 10Alex Monk)
[23:38:58] <icinga-wm>	 PROBLEM - Check status of defined EventLogging jobs on eventlog1002 is CRITICAL: CRITICAL: Stopped EventLogging jobs: eventlogging-processor@client-side-10 eventlogging-processor@client-side-09 eventlogging-processor@client-side-08 eventlogging-processor@client-side-07 eventlogging-processor@client-side-06 eventlogging-processor@client-side-05 eventlogging-processor@client-side-04 eventlogging-processor@client-side-03 eventlogg
[23:38:58] <icinga-wm>	 t-side-02 eventlogging-processor@client-side-01 eventlogging-processor@client-side-00
[23:40:54] <elukey>	 this is me --^