[00:00:20] 192555 [00:00:28] T192555* [00:00:29] T192555: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 [00:01:17] I see a list of User-Agents on T194380 that I bet came from our logstash system [00:01:18] T194380: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380 [00:02:30] boo. it looks like that was only being collected for a limited time [00:04:13] the problem is that is very tricky to get the header of the requests from inside the app itself. [00:04:46] agreed. its not really even a header. its just a flag that is exchanged in the TLS handshake [00:05:21] on windows i could just use wireshack and check the data if starts with 16 03 03 [00:05:29] but in linux... [00:07:33] MarioFinale: do you need to test if from Toolforge or your own linux host somewhere? [00:07:37] *test it [00:07:49] from toolforge [00:07:54] *nod* [00:10:02] I just looked at the source for Mono.Security.Protocol.Tls.SslClientStream. I don't see an obvious way to log or access the negotiated cypher suite from there. [00:10:36] nope, even in native .NET is tricky [00:12:20] Well... I will try creating a very small program that crashes if connects to tls 1.2-only service. [00:12:51] if fails to connect* [00:13:51] if it works my bot should do it also [00:14:20] i guess [00:14:24] MarioFinale: if you can't make that work, we could try to do a packet capture from inside Toolforge. You will need a root (like me) to help with that. Unfortunately I have to leave soon so I can't help you do that right now [00:14:48] No problem, I need to leave too. Thanks! [00:14:58] sure. sorry this isn't easy [00:15:16] No problem. :) [00:29:38] (03PS1) 10BryanDavis: Check for security status before reporting the title [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/443011 (https://phabricator.wikimedia.org/T180081) [00:29:40] (03PS1) 10BryanDavis: Refactor security bug verification process [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/443012 (https://phabricator.wikimedia.org/T180081) [00:30:34] (03CR) 10BryanDavis: [C: 032] Check for security status before reporting the title [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/443011 (https://phabricator.wikimedia.org/T180081) (owner: 10BryanDavis) [00:30:42] (03CR) 10BryanDavis: [C: 032] Refactor security bug verification process [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/443012 (https://phabricator.wikimedia.org/T180081) (owner: 10BryanDavis) [00:31:15] (03Merged) 10jenkins-bot: Check for security status before reporting the title [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/443011 (https://phabricator.wikimedia.org/T180081) (owner: 10BryanDavis) [00:31:25] (03Merged) 10jenkins-bot: Refactor security bug verification process [labs/tools/stashbot] - 10https://gerrit.wikimedia.org/r/443012 (https://phabricator.wikimedia.org/T180081) (owner: 10BryanDavis) [13:18:44] !help Croptool is hanging. Before I restart the webservice, I wondered if perhaps someone could check if it's still the same problem as https://phabricator.wikimedia.org/T104799? At least there's lots of open connections: https://tools.wmflabs.org/croptool/server-statistics [13:18:44] danmichaelo: If you don't get a response in 15-30 minutes, please create a phabricator task -- https://phabricator.wikimedia.org/maniphest/task/edit/form/1/?projects=wmcs-team [13:24:31] danmichaelo: looking [13:27:25] zhuyifei1999_: Thanks. "netstat -a | grep croptool" reveals lots of connections in CONNECTING state on tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs, so I don't think it's the CLOSE_WAIT problem, but what can it be? Also possibly related: https://phabricator.wikimedia.org/T182070#4305541, but CPU usage seems to be low now [13:27:51] root@tools-webgrid-lighttpd-1413:~# lsof -p 2592 | grep TCP | wc -l [13:27:51] 187 [13:29:22] uh, are you looking for unix sockets or tcp sockets? [13:30:02] Ah, right, netstat -t, I'm a bit of a netstat noob :) [13:31:53] root@tools-webgrid-lighttpd-1413:~# ps uf -u tools.croptool [13:31:53] USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND [13:31:53] tools.c+ 2592 0.1 0.0 54208 5444 ? Ss Jun28 1:39 /usr/sbin/lighttpd -f /var/run/lighttpd/croptool -D [13:31:53] tools.c+ 2598 0.0 0.2 340556 20764 ? Ss Jun28 0:00 \_ /usr/bin/php-cgi [13:31:53] tools.c+ 2600 0.2 0.6 647472 51796 ? Sl Jun28 2:04 | \_ /usr/bin/php-cgi [13:31:53] tools.c+ 2601 0.2 0.7 656452 60020 ? Sl Jun28 2:19 | \_ /usr/bin/php-cgi [13:31:53] tools.c+ 2599 0.0 0.2 340556 20760 ? Ss Jun28 0:00 \_ /usr/bin/php-cgi [13:32:18] (oops, was supposed to send as a paste) [13:32:29] https://www.irccloud.com/pastebin/pj25JDLe/ [13:34:44] PID 2600 's fd 5 is php-cgi 2600 tools.croptool 5u REG 253,1 18 786738 /var/lib/php5/sess_7[REDACTED]4 (deleted) [13:35:06] (why is it locking this file?) [13:36:15] the same for PID 260{1,2,3} [13:36:23] danmichaelo: ^ [13:36:59] Hm, so the php processes are locked waiting for access to some session file? [13:37:23] oh wait, maybe PID 2603 is holding that lock [13:37:39] the strace is futex instead of flock [13:38:00] futex is thread synchronization. deadlock? [13:39:34] https://www.irccloud.com/pastebin/xBBucz3w/ [13:39:37] 2602 bt [13:40:14] https://www.irccloud.com/pastebin/Kf3rlR6T/ [13:40:19] 2603 bt [13:40:38] 2603 definitely has a larger stack [13:42:59] uh... is it really locked in malloc()? [13:44:38] weird [13:47:21] I don't think I have any idea what's going on. malloc shouldn't deadlock [13:50:14] At least it seems like there was several attempts to crop the same file: https://ctrlv.it/id/129957/448843925 (these are the last entries from the croptool log) [13:51:12] Wait, they are not the same, just similar :/ [13:51:17] yep, that function __GI___libc_malloc is glibc confirmed [13:53:19] Hm, so a PHP bug? [13:53:42] * zhuyifei1999_ is reading the source [13:54:01] (at least, trying to figuring out what version of glibc it is) [13:57:08] 2.19 [14:03:37] this lock is ` arena_lock (ar_ptr, bytes);` [14:10:58] zhuyifei1999_: This is unfortunately a bit beyond my competence, but is there more you need from the running processes or can I restart the webservice? [14:11:13] umm [14:11:27] can I make a core dump of the process? [14:11:50] Sure! [14:11:54] just so I can continue debugging (inspect its memory) [14:11:54] ok [14:12:06] It's already been down for many hours, so a few more minutes cannot hurt :D [14:12:27] Thanks a lot for looking into it btw.! [14:12:28] done. you can restart :) [14:12:41] great :) [14:13:48] I got the mutex id = 2 [14:56:41] danmichaelo: the more I look into it, the more I don't understand how it's locking [14:57:05] the mutex_t is defined in malloc_machine.c as an int [14:57:18] and then there's `# define mutex_lock(m) ((*(m) = 1), 0)` [14:57:59] that's only 'registering it's locked', doesn't even wait or anything [14:58:01] ok :) [14:58:18] so weird [14:59:09] and then the disassembly is so clear that a lock happened [15:00:36] I mean, ouch :) [15:14:45] I give up :) too messed up [15:15:23] one thing is for sure: everything's locked [15:15:41] zhuyifei1999_: you are not the first person to wander into php internals and wander out confused. ;) [15:15:54] it's glibc internals... [15:16:24] but still, how can this happen?! o.O [15:20:02] the last stack trace I see as a paste from you looks like -- php ran out of memory while trying to create a backtrace for an exception and then tried to start handling that OOM error when it hit the deadlock. [15:20:41] my guess is that xdebug's tracing is holding a non-reentrant lock [15:21:09] andre__: ummm [15:21:15] Is it possible to view Legobot logs somewhere? Do I need a special account or permissions? [15:22:53] jmcgnh: that sounds like a question for legoktm (that's his bot) [15:25:13] bd808:-- Ideally, yes, it's a question for legotkm, but they haven't responded elsewhere and their userpage says they are on vacation until 7 July. I was hoping that at least some knowledge of how things work was distributed around a bit so we're not completely dependent on individuals. [15:25:27] Cyberpower678: Context is helpful. "ummm" does not help. [15:25:54] andre__: I don't know how, but you managed to mass-subscribe me to pywikibot tasks. :/ [15:26:27] bd808: makes sense. libc itself is holding the lock [15:26:47] danmichaelo: fyi ^ [15:27:04] Cyberpower678, check the Herald transcript and find out how if you'd like to know how? :) [15:27:25] Oh you're right. It was Herald. Hmmmm... [15:27:40] Cyberpower678, https://phabricator.wikimedia.org/H247 says "Description contains IABot". [15:27:47] jmcgnh: legoktm is listed as the only maintainer of that tool. I took a very quick look into its ~/public_html and did not see any obvious log reader links or scripts [15:28:06] so malloc ran out of memory, grid sends php a sigint, php's signal handler gets called and tries to malloc again, non-reentrant [15:28:44] so it just deadlocks on itself [15:28:54] zhuyifei1999_: yeah, I think we could search the web a bit and find that this is a known problem in php 5.x error handling [15:29:23] andre__: yea, I need to fix that. Didn't realize iabot would overlap with pywikipediabot [15:29:37] bd808:-- Thank you very much. I did try to find that spot, based on the pywikibot instructions and found it seemed to be empty. [16:44:59] hm, is the *.wmflabs.org web proxy down or just me? [16:46:08] !log tools Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. T182070 [16:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [16:46:13] T182070: tools-webgrid-lighttpd have ~ 90 procs stuck at 100% CPU time (mostly tools.jembot) - https://phabricator.wikimedia.org/T182070 [16:46:57] brion: I can get to things I just tried. Is there a particular URL that is failing for you? [16:47:26] https://ogvjs-testing.wmflabs.org/wiki/Mandelbrot_set and https://media-streaming.wmflabs.org/clean/transcoded/a/a2/RED_4K_Video_of_Colorful_Liquid_in_Space.webm/RED_4K_Video_of_Colorful_Liquid_in_Space.webm.360p.webm for instance [16:47:54] the main page there works for me -- https://ogvjs-testing.wmflabs.org/wiki/Main_Page [16:48:00] hmm, curious [16:48:21] i can't even reach port 443 or 80 [16:48:25] so does the main page of https://media-streaming.wmflabs.org/ [16:48:56] * brion blames comcast [16:49:06] brion: https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue ;) [16:49:24] \o/ [16:49:56] yeah its fine from my linode. sighhhhhhhh [16:50:25] I saw some chatter in another channel about AMS route issues, but nothing about our domestic POPs [16:51:59] I suppose Comcast could be having route issues to eqiad and few people would see that unless they are hitting things that are not behind varnish (like *.wmflabs.org) [16:52:35] can't reach eqiad or esams directly [16:52:50] (i can log in via ulsfo bastion) [16:54:41] brion: *nod* the folks who may be able to troubleshoot that would be in -operations. My core network fu is pretty much providing that wikitech troubleshooting link [16:54:42] i am seeing comcast issues reported by another web hoster too. [16:55:23] i'm filing an issue with details :) [16:56:12] brion http://downdetector.com/status/comcast-xfinity/map/ [16:57:17] funnnn [16:58:59] internet provider snow day \o/ [16:59:40] brion: just walk uphill both ways by tunneling through one of your VPS hosts ;) [17:02:25] i can vpn through the office too, helps sometimes [17:05:33] bd808: zhuyifei1999_: Interesting. I created https://phabricator.wikimedia.org/T198503 to track the issue. Still not sure what to do about it though. Would it perhaps make sense trying to move the tool to a separate VM with PHP 7? [17:06:10] figure out if there is a memory leak? [17:08:36] Problem is that users can try cropping very large files, I'm not sure if I can prevent PHP from sometimes running out of memory [17:08:37] brion https://twitter.com/comcastcares/status/1012744026095620097?ref_src=twsrc%5Etfw%7Ctwcamp%5Eembeddedtimeline%7Ctwterm%5E310747411813376004&ref_url=http%3A%2F%2Fdowndetector.com%2Fstatus%2Fcomcast-xfinity lol [17:09:37] I try to make an estimate based on the filesize and stop users from cropping extremely large files, but I'm not sure how exact I can do the estimate. [17:10:09] hehehe [17:11:46] !log tools Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU (T123121) [17:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:11:50] T123121: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121 [17:21:40] zhuyifei1999_: Ok, I added logging of memory usage at php startup, so at least we can see if that increases over time. [17:21:53] k [17:22:17] if you want I can copy the core dump to your home dir [17:22:51] in case you want to analyze the heap [17:24:38] It's not something I'm familiar with doing, but could always try [17:26:00] k [17:32:32] danmichaelo: ok running `pv -L 1m core.2603 > /data/project/croptool/core.2603` [17:35:01] done [17:37:43] made your tool owner and chmodded 400 [17:41:38] !log tools Rescheduling continuous jobs away from tools-exec-1408 where load is high [17:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [22:46:18] (03PS2) 10BryanDavis: Order maintainers by cn [labs/striker] - 10https://gerrit.wikimedia.org/r/421669 [22:46:20] (03PS3) 10BryanDavis: Update UI to use term "Wikimedia developer account" [labs/striker] - 10https://gerrit.wikimedia.org/r/421670 (https://phabricator.wikimedia.org/T190543) [22:46:22] (03PS1) 10BryanDavis: Temporarily replace requirements.txt with `pip --freeze` from prod [labs/striker] - 10https://gerrit.wikimedia.org/r/443203 [22:48:18] (03CR) 10jerkins-bot: [V: 04-1] Temporarily replace requirements.txt with `pip --freeze` from prod [labs/striker] - 10https://gerrit.wikimedia.org/r/443203 (owner: 10BryanDavis) [22:48:20] (03CR) 10jerkins-bot: [V: 04-1] Order maintainers by cn [labs/striker] - 10https://gerrit.wikimedia.org/r/421669 (owner: 10BryanDavis) [22:48:29] (03CR) 10jerkins-bot: [V: 04-1] Update UI to use term "Wikimedia developer account" [labs/striker] - 10https://gerrit.wikimedia.org/r/421670 (https://phabricator.wikimedia.org/T190543) (owner: 10BryanDavis) [22:50:11] * bd808 cries and goes to look at the tox-docker environment [23:11:21] (03CR) 10BryanDavis: "Iff97e7671ba9a1ab6041372cc3cb69c2242dfc7b attempts to fix the tox-docker job for this build." [labs/striker] - 10https://gerrit.wikimedia.org/r/443203 (owner: 10BryanDavis) [23:11:57] (03PS2) 10BryanDavis: Temporarily replace requirements.txt with `pip --freeze` from prod [labs/striker] - 10https://gerrit.wikimedia.org/r/443203 [23:13:53] (03CR) 10jerkins-bot: [V: 04-1] Temporarily replace requirements.txt with `pip --freeze` from prod [labs/striker] - 10https://gerrit.wikimedia.org/r/443203 (owner: 10BryanDavis) [23:16:13] (03PS3) 10BryanDavis: Temporarily replace requirements.txt with `pip --freeze` from prod [labs/striker] - 10https://gerrit.wikimedia.org/r/443203 (https://phabricator.wikimedia.org/T198076) [23:16:39] (03CR) 10jerkins-bot: [V: 04-1] Temporarily replace requirements.txt with `pip --freeze` from prod [labs/striker] - 10https://gerrit.wikimedia.org/r/443203 (https://phabricator.wikimedia.org/T198076) (owner: 10BryanDavis) [23:43:04] !log tools.wm-commons-emoji-bot Stopped webservice. It was in a continuous CrashBackoffLoop state [23:43:05] bd808: Unknown project "tools.wm-commons-emoji-bot" [23:43:09] lies [23:44:34] !log tools.stashbot test [23:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [23:44:49] I wonder why it hates that other tool name? [23:48:43] !log tool.ores-support-checklist Stopped webservice due to CrashLoopBackOff failures. Caused by "ImportError: No module named 'yamlconf'" [23:48:44] bd808: Unknown project "tool.ores-support-checklist" [23:48:51] really stashbot? [23:49:12] !log tools.ores-support-checklist Stopped webservice due to CrashLoopBackOff failures. Caused by "ImportError: No module named 'yamlconf'" [23:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.ores-support-checklist/SAL [23:49:21] !log tools.wm-commons-emoji-bot Stopped webservice. It was in a continuous CrashBackoffLoop state [23:49:21] bd808: Unknown project "tools.wm-commons-emoji-bot" [23:50:51] !log tools.interactoa Stopped webservice due to CrashLoopBackOff failures. Caused by "ImportError: No module named 'wikidataintegrator'" [23:50:51] bd808: Unknown project "tools.interactoa" [23:52:35] !log tools.interactoa Stopped webservice due to CrashLoopBackOff failures. Caused by "ImportError: No module named 'wikidataintegrator'" [23:52:44] !log tools.interactoa Stopped webservice due to CrashLoopBackOff failures. Caused by "ImportError: No module named 'wikidataintegrator'" [23:52:45] bd808: Unknown project "tools.interactoa"