[00:26:04] <elukey>	 Krinkle: opened T200630 - I *think* that the issue is related to some events that the Kafka Python consumer cannot digest
[00:26:05] <stashbot>	 T200630: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630
[00:26:16] <elukey>	 but I haven't really figure out how to fix it
[00:31:32] <Krinkle>	 elukey: I find it interesting that a few times a burst did come true, possibly at the times you restart it.
[00:31:49] <Krinkle>	 Is that just replaying the same messages again between the last committed offset and the problem message
[00:32:13] <Krinkle>	 As in, it is not committing between those points?
[00:32:49] <Krinkle>	 Is there a stack trace?
[00:34:27] <elukey>	 IIRC topic=eventlogging-client-side&auto_offset_reset=earliest should force EL's processors to start from the beginning of the queue when restarting, but I am not 100% sure. This might explain why we see bursts - events gets (re)processed until the bad one that causes the hang
[00:34:42] <elukey>	 I don't have any error or relevant things from the logs, it is really strange
[00:35:28] <elukey>	 I was able to enable debug logging for the el's processor (but not the kafka python stuff, still working on it0
[00:35:36] <elukey>	 and I can see logs repeating
[00:35:58] <elukey>	 namely stuff like (MainThread) VirtualPageView is blacklisted, not writing event <Event etc..
[00:36:04] <elukey>	 those are ok since we do blacklist some stuff
[00:36:26] <elukey>	 but it seems repeating logs and ending up to the same event before getting stuck
[00:37:09] <elukey>	 I tried to check the messages stored at the current offset commits via kafkacat, not much joy
[00:37:39] <elukey>	 (from kafka-jumbo1001 - kafka consumer-groups --describe --offsets --group eventlogging_processor_client_side_00, and then kafkacat etc.. -o offset)
[00:38:47] <elukey>	 Krinkle: I was also able to see via gdb that a lot of librdkafka threads are keep doing poll(), so I guess that they are the producers waiting for data
[00:39:12] <elukey>	 if you are able to get a better view of where those python processes are stuck on eventlog1002 please go ahead :)
[00:40:22] <Krinkle>	 Aye, I don't actually have ssh to that.
[00:41:55] <elukey>	 ouch
[00:42:28] <elukey>	 so I am inclined to go to sleep now and restart working on it tomorrow morning
[00:43:30] <Krinkle>	 elukey: Hm.. sorry if I missed it, but I still didn't quite get what actually happens. Given no error, does that mean the process doesn't die?
[00:46:26] <Krinkle>	 Hm. yeah sounds like it. So there's a process reading data from Kafka getting stuck (the consumer process reading the client side raw data?), and there are other processes polling those with nothing left, thus causing CPU increase/
[00:47:02] <elukey>	 I think so yes
[00:47:57] <elukey>	 what I see are the el processors processes stuck at 100% cpu usage
[00:48:54] <elukey>	 gdb (that is not super good in this case but I have no idea what else to use) shows me a lot of librdkafka threads doing polls() (those are the producers since we use kafka-python for consuming, that does not use librdkafka)
[00:49:17] <elukey>	 and another one, probably the kafka python one, that is not super clear what it is doing
[00:49:22] <elukey>	 strace is not useful too
[00:49:52] <elukey>	 after some minutes the kafka brokers are tired of not seeing anything on the producer tcp conns and they tear them down
[00:49:57] <elukey>	 then errors etc..
[00:50:12] <elukey>	 all stopped
[00:50:23] <elukey>	 so the processes don't die, they just stay stuck
[00:53:06] <elukey>	 all right going afk, will try to work on it asap tomorrow and then report back
[00:53:10] <elukey>	 sorry for the delay :(
[01:42:11] <MyIgel>	 Interested in reasonably priced GLOBAL IRC ADVERTISING? Contact me on twitter https://twitter.com/nenolod or linkedin https://www.linkedin.com/in/nenolod
[01:48:23] <paladox>	 Uh
[01:50:44] <closedmouth>	 not a great sales pitch
[01:51:36] <Krenair>	 at some point FreeNode is just going to have to start filtering spam as it enters the network instead of afterwards
[01:55:37] <icinga-wm>	 PROBLEM - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sdc1 fstype=xfs instance=ms-be2040:9100 job=node mountpoint=/srv/swift-storage/sdc1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops
[01:56:06] <paladox>	 Krenair: +1
[01:56:19] <paladox>	 BexUse I clicked on that link (as it was twitter)
[01:56:27] <paladox>	 BrexUse =because
[01:56:36] <paladox>	 Obviously I knew it was going to be fake
[02:05:28] <icinga-wm>	 PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds
[02:05:37] <icinga-wm>	 PROBLEM - SSH on stat1005 is CRITICAL: Server answer
[02:05:38] <icinga-wm>	 PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds
[02:06:07] <icinga-wm>	 PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds
[02:06:17] <icinga-wm>	 PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds
[02:06:18] <icinga-wm>	 PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds
[02:06:27] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds
[02:09:27] <icinga-wm>	 PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds
[02:13:21] <Krinkle>	 >  PROBLEM - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sdc1 fstype=xfs instance=ms-be2040:9100
[02:13:24] <Krinkle>	 I didn't know that was possible.
[02:13:49] <Krinkle>	 Shows usage went from 70% to -350% ; https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2040
[02:23:57] <icinga-wm>	 RECOVERY - SSH on stat1005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0)
[02:24:37] <icinga-wm>	 RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0
[02:24:38] <icinga-wm>	 RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient
[02:24:47] <icinga-wm>	 RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational
[02:24:48] <icinga-wm>	 RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures
[02:24:58] <icinga-wm>	 RECOVERY - Disk space on stat1005 is OK: DISK OK
[02:25:08] <icinga-wm>	 RECOVERY - DPKG on stat1005 is OK: All packages OK
[02:25:37] <icinga-wm>	 RECOVERY - configured eth on stat1005 is OK: OK - interfaces up
[02:25:45] <wikibugs>	 (03PS3) 10Reedy: Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410)
[02:25:45] <wikibugs>	 (03PS4) 10Reedy: Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410)
[02:26:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410) (owner: 10Reedy)
[02:27:26] <wikibugs>	 (03PS5) 10Reedy: Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410)
[02:30:14] <p858snake>	 Krinkle: The reporting was designed because of https://phabricator.wikimedia.org/T199198
[02:32:33] <wikibugs>	 (03CR) 10Reedy: [C: 032] Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410) (owner: 10Reedy)
[02:33:43] <wikibugs>	 (03Merged) 10jenkins-bot: Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410) (owner: 10Reedy)
[02:34:58] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Neuter EducationProgram T188410 (duration: 00m 58s)
[02:35:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:35:04] <stashbot>	 T188410: Block creation of new courses on the education extension - https://phabricator.wikimedia.org/T188410
[02:38:23] <wikibugs>	 (03CR) 10jenkins-bot: Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410) (owner: 10Reedy)
[02:40:19] <wikibugs>	 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10Krinkle) 05Resolved>03Open >>! [#wikimedia-operations] 01:55 UTC <icinga-wm> >  PROBLEM - Filesystem available is greater than filesystem size on ms-be2040...
[02:40:38] <Krinkle>	 p858snake: thx
[02:51:20] <wikibugs>	 (03PS1) 10Alex Monk: toollabs::proxy: Fix logo_height [puppet] - 10https://gerrit.wikimedia.org/r/448999
[03:27:08] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 859.02 seconds
[03:33:17] <icinga-wm>	 PROBLEM - puppet last run on mw2266 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz],File[/usr/share/GeoIP/GeoIPCity.dat.test]
[03:36:38] <icinga-wm>	 PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[03:44:18] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 273.14 seconds
[04:02:18] <icinga-wm>	 RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[04:03:58] <icinga-wm>	 RECOVERY - puppet last run on mw2266 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:31:37] <icinga-wm>	 PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/confd]
[06:41:47] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:42:47] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time
[07:02:18] <icinga-wm>	 RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[08:50:01] <wikibugs>	 (03PS3) 10Thiemo Kreuz (WMDE): Do not leak local $wgWBShared… variables to th eglobal scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444632
[09:12:43] <elukey>	 Krinkle: o/ - I think I found the issue, all details in the task. Not sure exactly how to fix it yet though
[09:30:42] <icinga-wm>	 PROBLEM - mysqld processes on pc2006 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
[09:32:00] <marostegui>	 o/
[09:32:14] <marostegui>	 Checking that ^
[09:32:36] <godog>	 ack, I'm here too if needed
[09:32:44] * volans|off too
[09:33:51] <marostegui>	 server rebooted itself
[09:33:54] <marostegui>	 I will create a task
[09:35:06] <volans|off>	 ack, let us know if we can help
[09:36:19] <elukey>	 if anybody has time to help with a python issue for eventlogging, I'd be really glad :)
[09:36:29] <volans|off>	 elukey: shoot
[09:36:32] <elukey>	 (Eventlogging is currently not working)
[09:36:39] <elukey>	 volans|off: https://phabricator.wikimedia.org/T200630
[09:36:47] <elukey>	 I am trying to figure out how to overcome that
[09:37:24] * volans|off reading
[09:37:39] <elukey>	 you can skip to the last bits - afaics it seems that a huge UA string causes our ua parser (https://github.com/ua-parser/uap-python0) to hang miserably
[09:38:08] <marostegui>	 https://phabricator.wikimedia.org/T200641
[09:38:35] <volans|off>	 elukey: py2 or py3?
[09:38:53] <elukey>	 py2
[09:39:38] <volans>	 ack, give me some minutes
[09:40:13] <elukey>	 if you want on eventlog1002 there is a python process already hanging
[09:40:17] <elukey>	 under my username
[09:40:29] <volans>	 no need, your repro is pretty straightforward
[09:43:53] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Depool pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449010 (https://phabricator.wikimedia.org/T200641)
[09:45:29] <jynus>	 checking pc2006
[09:46:31] <jynus>	 uptime 19 minutes, checking hardware logs
[09:47:06] <marostegui>	 jynus: https://phabricator.wikimedia.org/T200641
[09:47:15] <marostegui>	 jynus: Let's depool it and just check it on monday
[09:47:25] <jynus>	 oh, thanks
[09:47:25] <marostegui>	 jynus: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/449010/
[09:51:49] <wikibugs>	 (03PS2) 10Marostegui: db-codfw.php: Depool pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449010 (https://phabricator.wikimedia.org/T200641)
[09:53:08] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449010 (https://phabricator.wikimedia.org/T200641) (owner: 10Marostegui)
[09:54:24] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449010 (https://phabricator.wikimedia.org/T200641) (owner: 10Marostegui)
[09:56:16] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool pc2006 T200641 (duration: 00m 57s)
[09:56:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:56:21] <stashbot>	 T200641: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641
[09:56:48] <marostegui>	 jynus: server depooled and silenced till tuesday, I am logging off :)
[09:58:11] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449010 (https://phabricator.wikimedia.org/T200641) (owner: 10Marostegui)
[10:25:01] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 20 seconds
[10:25:37] <godog>	 womp womp
[10:25:40] <godog>	 I'll take a look
[10:25:52] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 3.588 second response time
[10:25:54] <jynus>	 meanwhile, I will check high level impact
[10:26:10] <akosiaris_>	 hey
[10:26:20] <akosiaris_>	 what's up?
[10:26:34] <godog>	 hi, thumbor just paged and then recovered
[10:26:38] <jynus>	 no idea, it seems thumbor had a hiccup
[10:26:40] <godog>	 I believe it timed out
[10:26:56] <akosiaris_>	 yeah I just got the recovery
[10:27:55] <godog>	 thumbor seems otherwise ok tho
[10:29:30] <jynus>	 I don't even see a significant error spike
[10:31:04] <godog>	 indeed, I think icinga timed out while waiting for a response
[10:31:16] <godog>	 to which T187765 is part of the fix
[10:31:17] <stashbot>	 T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests - https://phabricator.wikimedia.org/T187765
[10:31:27] <godog>	 I'll log off again o/~
[10:32:32] <akosiaris_>	 same here. ttl
[11:29:54] <wikibugs>	 (03PS1) 10Matěj Suchánek: Update several Wikidata-related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449017
[11:31:15] <volans>	 godog: FYI I never got thumbor pages, but I got pc2006 ones before :(
[15:45:28] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0
[15:48:48] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 22 probes of 309 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[15:53:57] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 12 probes of 309 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[16:43:47] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[16:43:47] <icinga-wm>	 PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[17:55:19] <wikibugs>	 (03PS1) 10Legoktm: Add base stretch image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449032
[18:52:27] <wikibugs>	 (03PS1) 10Legoktm: [WIP] Add php72 base and web images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449033 (https://phabricator.wikimedia.org/T188318)
[18:55:42] <wikibugs>	 (03CR) 10Legoktm: [C: 04-1] "E: Unable to locate package toollabs-webservice" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449033 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm)
[19:12:07] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 222.04 ms
[19:12:07] <icinga-wm>	 RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 229.41 ms
[19:36:58] <wikibugs>	 (03CR) 10Zhuyifei1999: Add base stretch image (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449032 (owner: 10Legoktm)
[19:37:59] <wikibugs>	 (03CR) 10Legoktm: Add base stretch image (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449032 (owner: 10Legoktm)
[21:20:36] <wikibugs>	 (03CR) 10Krinkle: [WIP] Add php72 base and web images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449033 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm)
[21:39:50] <wikibugs>	 (03PS3) 10Krinkle: Enable Special:Block Feedback Request (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 (https://phabricator.wikimedia.org/T199919) (owner: 10Dbarratt)
[21:40:33] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "Oustanding fixme https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMessages/+/444639/ - fix/backport before enabling." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 (https://phabricator.wikimedia.org/T199919) (owner: 10Dbarratt)
[22:02:13] <wikibugs>	 (03CR) 10Dbarratt: "> Patch Set 3: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 (https://phabricator.wikimedia.org/T199919) (owner: 10Dbarratt)