[00:26:04] Krinkle: opened T200630 - I *think* that the issue is related to some events that the Kafka Python consumer cannot digest [00:26:05] T200630: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 [00:26:16] but I haven't really figure out how to fix it [00:31:32] elukey: I find it interesting that a few times a burst did come true, possibly at the times you restart it. [00:31:49] Is that just replaying the same messages again between the last committed offset and the problem message [00:32:13] As in, it is not committing between those points? [00:32:49] Is there a stack trace? [00:34:27] IIRC topic=eventlogging-client-side&auto_offset_reset=earliest should force EL's processors to start from the beginning of the queue when restarting, but I am not 100% sure. This might explain why we see bursts - events gets (re)processed until the bad one that causes the hang [00:34:42] I don't have any error or relevant things from the logs, it is really strange [00:35:28] I was able to enable debug logging for the el's processor (but not the kafka python stuff, still working on it0 [00:35:36] and I can see logs repeating [00:35:58] namely stuff like (MainThread) VirtualPageView is blacklisted, not writing event those are ok since we do blacklist some stuff [00:36:26] but it seems repeating logs and ending up to the same event before getting stuck [00:37:09] I tried to check the messages stored at the current offset commits via kafkacat, not much joy [00:37:39] (from kafka-jumbo1001 - kafka consumer-groups --describe --offsets --group eventlogging_processor_client_side_00, and then kafkacat etc.. -o offset) [00:38:47] Krinkle: I was also able to see via gdb that a lot of librdkafka threads are keep doing poll(), so I guess that they are the producers waiting for data [00:39:12] if you are able to get a better view of where those python processes are stuck on eventlog1002 please go ahead :) [00:40:22] Aye, I don't actually have ssh to that. [00:41:55] ouch [00:42:28] so I am inclined to go to sleep now and restart working on it tomorrow morning [00:43:30] elukey: Hm.. sorry if I missed it, but I still didn't quite get what actually happens. Given no error, does that mean the process doesn't die? [00:46:26] Hm. yeah sounds like it. So there's a process reading data from Kafka getting stuck (the consumer process reading the client side raw data?), and there are other processes polling those with nothing left, thus causing CPU increase/ [00:47:02] I think so yes [00:47:57] what I see are the el processors processes stuck at 100% cpu usage [00:48:54] gdb (that is not super good in this case but I have no idea what else to use) shows me a lot of librdkafka threads doing polls() (those are the producers since we use kafka-python for consuming, that does not use librdkafka) [00:49:17] and another one, probably the kafka python one, that is not super clear what it is doing [00:49:22] strace is not useful too [00:49:52] after some minutes the kafka brokers are tired of not seeing anything on the producer tcp conns and they tear them down [00:49:57] then errors etc.. [00:50:12] all stopped [00:50:23] so the processes don't die, they just stay stuck [00:53:06] all right going afk, will try to work on it asap tomorrow and then report back [00:53:10] sorry for the delay :( [01:42:11] Interested in reasonably priced GLOBAL IRC ADVERTISING? Contact me on twitter https://twitter.com/nenolod or linkedin https://www.linkedin.com/in/nenolod [01:48:23] Uh [01:50:44] not a great sales pitch [01:51:36] at some point FreeNode is just going to have to start filtering spam as it enters the network instead of afterwards [01:55:37] PROBLEM - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sdc1 fstype=xfs instance=ms-be2040:9100 job=node mountpoint=/srv/swift-storage/sdc1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [01:56:06] Krenair: +1 [01:56:19] BexUse I clicked on that link (as it was twitter) [01:56:27] BrexUse =because [01:56:36] Obviously I knew it was going to be fake [02:05:28] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:05:37] PROBLEM - SSH on stat1005 is CRITICAL: Server answer [02:05:38] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:06:07] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:06:17] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:06:18] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:06:27] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:09:27] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:13:21] > PROBLEM - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sdc1 fstype=xfs instance=ms-be2040:9100 [02:13:24] I didn't know that was possible. [02:13:49] Shows usage went from 70% to -350% ; https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2040 [02:23:57] RECOVERY - SSH on stat1005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) [02:24:37] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [02:24:38] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [02:24:47] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [02:24:48] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [02:24:58] RECOVERY - Disk space on stat1005 is OK: DISK OK [02:25:08] RECOVERY - DPKG on stat1005 is OK: All packages OK [02:25:37] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [02:25:45] (03PS3) 10Reedy: Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410) [02:25:45] (03PS4) 10Reedy: Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410) [02:26:48] (03CR) 10jerkins-bot: [V: 04-1] Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410) (owner: 10Reedy) [02:27:26] (03PS5) 10Reedy: Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410) [02:30:14] Krinkle: The reporting was designed because of https://phabricator.wikimedia.org/T199198 [02:32:33] (03CR) 10Reedy: [C: 032] Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410) (owner: 10Reedy) [02:33:43] (03Merged) 10jenkins-bot: Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410) (owner: 10Reedy) [02:34:58] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: Neuter EducationProgram T188410 (duration: 00m 58s) [02:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:04] T188410: Block creation of new courses on the education extension - https://phabricator.wikimedia.org/T188410 [02:38:23] (03CR) 10jenkins-bot: Remove most EP related userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447953 (https://phabricator.wikimedia.org/T188410) (owner: 10Reedy) [02:40:19] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10Krinkle) 05Resolved>03Open >>! [#wikimedia-operations] 01:55 UTC > PROBLEM - Filesystem available is greater than filesystem size on ms-be2040... [02:40:38] p858snake: thx [02:51:20] (03PS1) 10Alex Monk: toollabs::proxy: Fix logo_height [puppet] - 10https://gerrit.wikimedia.org/r/448999 [03:27:08] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 859.02 seconds [03:33:17] PROBLEM - puppet last run on mw2266 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz],File[/usr/share/GeoIP/GeoIPCity.dat.test] [03:36:38] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:44:18] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 273.14 seconds [04:02:18] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:03:58] RECOVERY - puppet last run on mw2266 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:31:37] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/logrotate.d/confd] [06:41:47] PROBLEM - HHVM jobrunner on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:42:47] RECOVERY - HHVM jobrunner on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [07:02:18] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:50:01] (03PS3) 10Thiemo Kreuz (WMDE): Do not leak local $wgWBShared… variables to th eglobal scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444632 [09:12:43] Krinkle: o/ - I think I found the issue, all details in the task. Not sure exactly how to fix it yet though [09:30:42] PROBLEM - mysqld processes on pc2006 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [09:32:00] o/ [09:32:14] Checking that ^ [09:32:36] ack, I'm here too if needed [09:32:44] * volans|off too [09:33:51] server rebooted itself [09:33:54] I will create a task [09:35:06] ack, let us know if we can help [09:36:19] if anybody has time to help with a python issue for eventlogging, I'd be really glad :) [09:36:29] elukey: shoot [09:36:32] (Eventlogging is currently not working) [09:36:39] volans|off: https://phabricator.wikimedia.org/T200630 [09:36:47] I am trying to figure out how to overcome that [09:37:24] * volans|off reading [09:37:39] you can skip to the last bits - afaics it seems that a huge UA string causes our ua parser (https://github.com/ua-parser/uap-python0) to hang miserably [09:38:08] https://phabricator.wikimedia.org/T200641 [09:38:35] elukey: py2 or py3? [09:38:53] py2 [09:39:38] ack, give me some minutes [09:40:13] if you want on eventlog1002 there is a python process already hanging [09:40:17] under my username [09:40:29] no need, your repro is pretty straightforward [09:43:53] (03PS1) 10Marostegui: db-codfw.php: Depool pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449010 (https://phabricator.wikimedia.org/T200641) [09:45:29] checking pc2006 [09:46:31] uptime 19 minutes, checking hardware logs [09:47:06] jynus: https://phabricator.wikimedia.org/T200641 [09:47:15] jynus: Let's depool it and just check it on monday [09:47:25] oh, thanks [09:47:25] jynus: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/449010/ [09:51:49] (03PS2) 10Marostegui: db-codfw.php: Depool pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449010 (https://phabricator.wikimedia.org/T200641) [09:53:08] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449010 (https://phabricator.wikimedia.org/T200641) (owner: 10Marostegui) [09:54:24] (03Merged) 10jenkins-bot: db-codfw.php: Depool pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449010 (https://phabricator.wikimedia.org/T200641) (owner: 10Marostegui) [09:56:16] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool pc2006 T200641 (duration: 00m 57s) [09:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:21] T200641: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 [09:56:48] jynus: server depooled and silenced till tuesday, I am logging off :) [09:58:11] (03CR) 10jenkins-bot: db-codfw.php: Depool pc2006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449010 (https://phabricator.wikimedia.org/T200641) (owner: 10Marostegui) [10:25:01] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 20 seconds [10:25:37] womp womp [10:25:40] I'll take a look [10:25:52] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 281 bytes in 3.588 second response time [10:25:54] meanwhile, I will check high level impact [10:26:10] hey [10:26:20] what's up? [10:26:34] hi, thumbor just paged and then recovered [10:26:38] no idea, it seems thumbor had a hiccup [10:26:40] I believe it timed out [10:26:56] yeah I just got the recovery [10:27:55] thumbor seems otherwise ok tho [10:29:30] I don't even see a significant error spike [10:31:04] indeed, I think icinga timed out while waiting for a response [10:31:16] to which T187765 is part of the fix [10:31:17] T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests - https://phabricator.wikimedia.org/T187765 [10:31:27] I'll log off again o/~ [10:32:32] same here. ttl [11:29:54] (03PS1) 10Matěj Suchánek: Update several Wikidata-related configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449017 [11:31:15] godog: FYI I never got thumbor pages, but I got pc2006 ones before :( [15:45:28] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0 [15:48:48] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 22 probes of 309 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:53:57] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 12 probes of 309 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:43:47] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [16:43:47] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:19] (03PS1) 10Legoktm: Add base stretch image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449032 [18:52:27] (03PS1) 10Legoktm: [WIP] Add php72 base and web images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449033 (https://phabricator.wikimedia.org/T188318) [18:55:42] (03CR) 10Legoktm: [C: 04-1] "E: Unable to locate package toollabs-webservice" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449033 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm) [19:12:07] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 222.04 ms [19:12:07] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 229.41 ms [19:36:58] (03CR) 10Zhuyifei1999: Add base stretch image (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449032 (owner: 10Legoktm) [19:37:59] (03CR) 10Legoktm: Add base stretch image (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449032 (owner: 10Legoktm) [21:20:36] (03CR) 10Krinkle: [WIP] Add php72 base and web images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449033 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm) [21:39:50] (03PS3) 10Krinkle: Enable Special:Block Feedback Request (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 (https://phabricator.wikimedia.org/T199919) (owner: 10Dbarratt) [21:40:33] (03CR) 10Krinkle: [C: 04-1] "Oustanding fixme https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMessages/+/444639/ - fix/backport before enabling." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 (https://phabricator.wikimedia.org/T199919) (owner: 10Dbarratt) [22:02:13] (03CR) 10Dbarratt: "> Patch Set 3: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 (https://phabricator.wikimedia.org/T199919) (owner: 10Dbarratt)