[00:03:35] This seems like a bit of a bug - the toolforge ascii art on the banner is duplicated - https://cdn.discordapp.com/attachments/563024520101888010/695060022861889536/SPOILER_unknown.png [00:17:26] https://usercontent.irccloud-cdn.com/file/dMegQIKd/Screenshot_2020-04-01_19-17-07.png [00:17:34] DSquirrelGM: can't reproduce [00:24:13] hmm, did it a couple times, then back to normal, idk... [00:25:08] may have been just a connection hiccup, hard to tell [00:25:41] wasn't any delay though [00:36:34] DSquirrelGM: https://github.com/wikimedia/puppet/blob/97971c6d8d54fcfb151693943a5770b316bfab8f/modules/profile/files/toolforge/40-tools-bastion-banner.sh <= this is the script that prints it [00:37:01] looks unlikely it'll repeat a part for no reason [00:37:23] I'm more inclined to suspect it's something with windows terminal [00:38:32] that was the first time I had seen that happen, and for it to happen twice in a row, was rather odd [00:47:37] anyway, for the section on php, do you think it might be worth mentioning testing for and setting the location via a redirect header on: https://wikitech.wikimedia.org/wiki/News/Toolforge.org [00:58:25] think you'd better delete and revdel this - keys should not be posted in public - https://wikitech.wikimedia.org/wiki/User:Wolfgang_Kandek [01:03:34] still here bd808 ? ^ [01:05:04] Those are public keys. [01:05:05] Nothing at all scary about that [01:05:53] DsquirrelGM: if they were secrets, you just did the literal worst job of responsible disclosure that you could do [01:06:14] I know they're public keys [01:06:40] just saying I don't think it's a good idea to have them posted [01:06:52] Why? [01:08:31] If a private key can be derived from the public key then all keys of that cypher suite in the world are compromised [01:20:50] got to go for now - it just seems that it was a matter of them maybe misreading instructions. [08:16:28] anyone to check a network issue [08:53:23] RhinosF1: maybe? can you describe the issue? [08:54:18] IRC bot hosted on tool forge disconnecting every few days due to what seems to be network hiccups. Never used to do this. [08:54:39] https://phabricator.wikimedia.org/T248960 [09:08:13] RhinosF1: we are going to need a bit more evidences / tests that the issue is in cloudvps side [09:08:42] arturo: what tests would you expect? [09:09:09] I'm stuck beyond the bot is timing out of network stuff often [09:09:50] some logs entries on the bot side. I can only see IRC logs of the bot reconnecting, which can be due to many different reasons. [09:10:02] we need to know the particular reason the bot times out [09:12:52] I can see "Server timeout detected after 180s; closing." [09:21:22] what is server timeout? under what circumstance is that triggered? [09:23:09] also the pastes in that ticket are not visible to me [09:23:16] When it doesn't receive anything from the network in so long [09:23:28] zhuyifei1999_: they won't be, they're marked private [09:24:38] When it doesn't receive anything from the network in so long <= is it expected to always receive messages under normal conditions? like, could it timeout simply because there are no messages? [09:25:00] looking at the log, I think it's had a ping message from Freenode and then not returned a pong before timing out [09:25:23] so I assume it's the sending of the PONG that timed out. [09:25:34] they won't be, they're marked private <= so they are not useful for me then :) [09:25:39] zhuyifei1999_: it's IRC, there is a PING/PONG every so often [09:26:59] https://tools.ietf.org/html/rfc2812#section-3.7.2 [09:27:35] https://www.irccloud.com/pastebin/xE72wnnX/ [09:27:38] server would send a ping to client if the client doesn't do anything [09:27:38] zhuyifei1999_: ^ [09:28:04] it did, the pong never went and then the bot decided there was a server timeout [09:28:16] so in theory if the client keeps sending messages the server does not have to ping client [09:28:46] so you mean the client is pinging sever but the server is not pong-ing? [09:29:06] zhuyifei1999_: server pinged us twice, we never responded. [09:29:12] we normally do [09:29:49] I think [09:30:01] Logs for that is oon that paste [09:30:27] zhuyifei1999_: no, you're right. We ping Freenode and it doesn't return the message [09:30:55] if the irc server pings the client and the client does not pong after 180 seconds the server would kill the connection with a 'ping timeout' message, not remote disconnect message [09:31:06] the other 2 disconnects on the task (with no paste) simply had no logs [09:31:34] zhuyifei1999_: I said it's your way round. Client pings server, Client gets no response, Client assumes server died. [09:32:13] is the source code of ZppixBot public anywhere? [09:32:33] zhuyifei1999_: it is, https://github.com/Pix1234/ZppixBot-Source [09:40:06] RhinosF1: can I directly look at the log file on toolforge? which file is it? [09:40:17] zhuyifei1999_: sftp://login.tools.wmflabs.org/mnt/nfs/labstore-secondary-tools-project/zppixbot-test/.sopel/logs/default.raw.log.2020-04-01 [09:40:25] k [09:41:36] zhuyifei1999_, arturo: I've checked with sopel devs and they say no other user has reported issues using same software and it's new to us so it implies a network issue. [09:43:06] so last successful ping was 06:46:56,267 with response at 06:46:56,331, two pings at 06:48:56,395 and 06:49:56,457 and disconnect at 06:49:56,468 hmm [09:43:21] yep [09:43:38] it's new could have multiple meanings, so we debug [09:44:23] March 25th was the first time it happened [09:44:51] zhuyifei1999_: that's what I'm here for, help debugging [09:45:22] (though, considering that other bots like wm-bot and stashbot did not fail recently, I'm not very inclined to believe network issue is the sole contributer) [09:45:29] yeah, I'm looking [09:45:44] march 25... checking SAL [09:47:34] (don't see anything that stands out) [09:54:14] ok I see how sopel is doing this [09:57:16] RhinosF1: this is running on k8s right? [09:57:23] I'm going to strace it [09:57:29] zhuyifei1999_: kubectl thing, yes [10:07:42] RhinosF1: ok I'm straceing [10:07:49] will note that in the ticket [10:08:24] zhuyifei1999_: on which tool or both [10:08:46] zppixbot and zppixbot-test have same issue [10:09:06] zppixbot-test [10:09:15] cool [10:11:35] !log tools.zppixbot-test approved zhuyifei1999_ to debug network issues - T248960 [10:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot-test/SAL [10:11:38] T248960: Debug random ZppixBot(-test) restarts - https://phabricator.wikimedia.org/T248960 [10:12:10] wait this needed approval? okay. I thought I just jumped in lol [10:12:36] zhuyifei1999_: it’s just so people know who asked for the help and don’t go wondering [10:13:16] ok [10:14:01] anyways, I gotta go to sleep. the strace is relatively quiet so I think I'll see the exact lines where it fails [10:14:21] and please ping me when you see it fail [10:14:34] zhuyifei1999_: I will [10:14:45] I don’t always notice straight away though [10:16:15] zhuyifei1999_: is it best to run on tools.zppixbot as well. Knowing my luck, the next one will be on that tool. It’s random. [10:16:43] ok one sec [10:17:16] !log tools.zppixbot [10:17:16] RhinosF1: Missing project or message? Expected !log [10:17:51] !log tools.zppixbot strace requested on zppixbot tool as well - T248960 [10:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.zppixbot/SAL [10:17:59] T248960: Debug random ZppixBot(-test) restarts - https://phabricator.wikimedia.org/T248960 [10:19:44] ok, stracing [10:20:02] I would love to trace on select as well, but that generates way too much logs [10:20:12] the select syscall [10:21:33] Okay [10:21:38] Thx for the help [10:26:20] np [10:26:44] Now we wait :) [10:30:34] the more I think about it, the more I think I would need the select on main thread [10:30:53] argh [10:31:46] in case it's a main thread deadlock or something [10:32:04] because main thread is the tread that is receiving and processing messages [10:33:04] thanks zhuyifei1999_ :-) [10:33:10] np [10:34:06] though, given that there are no logs appearing, "in case it's a main thread deadlock or something" is unlikely [10:34:56] I assume this is running on k8s? [10:35:13] yes [10:35:14] I wonder if the new resource limits are hitting this tool [10:35:34] what resource limits? [10:35:47] we have now resource limiting / quotas in the new kubernetes cluster [10:36:28] yeah? but it no network limits right? even if CPU is delaied for a few secongds it should not fail a ping for 180 secs [10:36:32] arturo: how would we know? [10:36:40] *delayed [10:36:43] *seconds [10:36:50] RhinosF1: the grafana dashboard should tell [10:37:02] check the k8s-status tool for the links and other information regarding your tool [10:37:17] https://grafana-labs.wikimedia.org/d/toolforge-k8s-namespace-resources/kubernetes-namespace-resources?orgId=1&var-namespace=tool-zppixbot&from=now-2d&to=now&refresh=5m [10:37:17] let me find the link [10:38:18] by looking and the graphs, you don't seem to be affected by resource limits :-) [10:40:20] arturo: tools.wmflabs.org/k8s-status is slow and https://tools.wmflabs.org/k8s-status/nodes/ is an Internal Server Error [10:42:45] RhinosF1: you are right, will let the developer know later today [10:43:00] arturo: any other things we can check> [10:44:50] yes I can check other stuff, but I will let yifei's tests run first to discard an issue in the source code before diving more deep here [10:45:05] cool [10:48:00] https://github.com/sopel-irc/sopel/blob/74f6f4d05418e4bd2e041167bab7eb3c2a6d7e50/sopel/irc/backends.py#L146 this looks very fragile to me. I can construct a message to have this bot ignore the message, but that should be temporary and won't affect future messages.... [10:48:44] like, it wouldn't be sufficient to kill the bot to produce the described issue [10:49:27] zhuyifei1999_: that shouldn't be related though because per the raw log I linked you to, It's only a ping being recieved. [10:49:45] You are free to raise issues upstream though [10:50:20] it's a pong not being received [10:51:21] I'm not too interested in raising the issue upstream unless it is at lease somewhat exploitable. this doesn't look so [10:51:44] in 50% of cases, we've seen 2 where Freenode never got a PING as well [10:51:57] you can see on the task [10:52:51] 3 disconnects, 1 timeout but connection recovered before disconnect, 2 from sopel not ponging freenode, 2 from server not ponging sopel [10:53:02] what do you mean by freenode never got a ping? [10:53:42] zhuyifei1999_: freenode never got a PONG back [10:54:09] as in, server never send a PONG to the client? that's the case we are looking at right? [10:54:19] A ping can be sent by either client or server, both respond in the same way [10:54:48] zhuyifei1999_: yesterday was PONG not sent to client but we've also had PONG not sent to server twice [10:54:55] ok [10:55:36] so server kills the bot with 'ping timeout' message? [10:56:01] zhuyifei1999_: yes, see https://phabricator.wikimedia.org/T248960#6013917 [10:56:30] k [10:56:47] that shows the date and time of every incident - raw logs are available after 25/03 at ~12:00 [10:56:50] anyways, gotta zzz. [10:56:56] night [10:57:00] ok, will check that [11:00:37] RhinosF1: /data/project/zppixbot/.sopel/logs/default.raw.log.2020-03-25 ENOENT [11:00:54] I mean, no such file [11:01:01] zhuyifei1999_: I zipped some [11:01:33] ok, will look later then. too sleepy [11:01:47] * RhinosF1 will leave link [11:07:54] sftp://login.tools.wmflabs.org/mnt/nfs/labstore-secondary-tools-project/zppixbot/.sopel/logs/old-logs.tar.gz [17:18:41] Hello here [17:19:55] I have just installed minikube, and wanted to start contributing to Paws project [17:20:25] What next should I do [20:53:25] !log admin codfw1dev clear VM error states and start bastions, puppet master and database [20:53:56] jeh: Failed to log message to wiki. Somebody should check the error logs. [20:58:57] * jeh finds stash.bot timed out to wikitech on 443 [20:59:01] !log admin codfw1dev clear VM error states and start bastions, puppet master and database [20:59:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [23:41:32] !log tools.copyvios Increased limits.cpu and requests.cpu to 4 (T245426) [23:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.copyvios/SAL [23:41:35] T245426: Earwig's copyvio tool 504 gateway time-out issues - https://phabricator.wikimedia.org/T245426