[00:25:19] Krinkle, ur guc is not working .. [00:30:32] infact a few global tools are failing..change in API? [01:02:03] Coren: you still around? [02:30:27] still getting 502's .. [02:30:35] andrewbogott_afk: Coren (Cannot contact the database server: Can't connect to MySQL server on '10.68.16.193' (111) (10.68.16.193)) [02:30:41] http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page [02:31:54] greg-g: since I can't sleep, might as well debug... [02:31:56] greg-g: hmm, works for me? [02:32:36] not for me nor many others: https://bugzilla.wikimedia.org/show_bug.cgi?id=71764 [02:33:06] wth is core-n's email in bugzilla? [02:33:24] greg-g: try 'mpel*'? [02:33:37] found it [02:33:49] no, marc@uberbox [02:33:59] ah [02:34:13] 3Wikimedia Labs / 3deployment-prep (beta): All page loads fail with "DB connection error: Can't connect to MySQL server on '10.68.16.193' (4) (10.68.16.193)" - 10https://bugzilla.wikimedia.org/71764 (10Greg Grossmeier) [02:35:10] greg-g: ah, I see. works when you're logged out [02:35:21] greg-g: or basically if you're getting it from cache [02:35:28] attempting to login itself fails [02:39:28] 3Wikimedia Labs / 3deployment-prep (beta): Can't connect to MySQL on deployment-db1 - 10https://bugzilla.wikimedia.org/71764 (10Alex Monk) [02:39:34] greg-g: fixed [02:39:58] 3Wikimedia Labs / 3deployment-prep (beta): Can't connect to MySQL on deployment-db1 - 10https://bugzilla.wikimedia.org/71764#c3 (10Greg Grossmeier) All browser tests will fail, all users/developers will complain. Marc/Andrew/Yuvi: Anything going on that might explain this? Any changes from production that m... [02:40:09] greg-g: responding on the bug now [02:40:13] thanks [02:40:29] 3Wikimedia Labs / 3deployment-prep (beta): Can't connect to MySQL on deployment-db1 - 10https://bugzilla.wikimedia.org/71764#c4 (10Greg Grossmeier) (due to a mid-air collision, re-adding Sean. Sean, see above) [02:41:25] brb [02:42:15] 3Wikimedia Labs / 3deployment-prep (beta): Can't connect to MySQL on deployment-db1 - 10https://bugzilla.wikimedia.org/71764#c5 (10Yuvi Panda) 5NEW>3RESO/FIX So... There was a 'rogue' mysql server on deployment-db1, which wasn't being recognized by the init script for... some reason. I killed it manuall... [02:42:27] I tried to restart mysql on there a few minutes ago but found I didn't have sudo. [02:42:51] "This incident will be reported.", etc. [02:42:57] Krenair: ah, hmm. [02:43:09] Krenair: I can grant you sudo if greg-g is ok with it (or someone else 'responsible for betalabs' is) [02:43:30] I probably won't use it much. [02:44:11] Krenair: heh :) [02:44:28] Anyway, I'm going to sleep. thanks for fixing it YuviPanda [02:44:36] Krenair: yw. [02:50:21] thanks a ton, YuviPanda [02:50:28] 3Wikimedia Labs / 3deployment-prep (beta): Can't connect to MySQL on deployment-db1 - 10https://bugzilla.wikimedia.org/71764#c6 (10Greg Grossmeier) Thanks Yuvi! (Sorry for the pings Marc/Andrew/Sean :) ) [02:50:38] greg-g: :) thank my insomnia! :) [02:50:42] it's 8:20AM and I can't seem to sleep... [02:50:45] eek [02:51:04] and there's no power as well, so yay! [02:51:04] hope you can sleep at some point :/ [03:00:27] greg-g: yeah, I hope so too... [03:00:50] greg-g: also, I'm owed at least 2 drinks at some point in the future for this and previous beta labs features :) [03:01:24] at least while I continue to do so before I move to ops :) [03:34:24] YuviPanda: noted :) [05:26:18] [13nagf] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/wikimedia/nagf/commit/a966dfe55009a2dcd786db86e7e12e93901a6abe [05:26:19] 13nagf/06master 14a966dfe 15Timo Tijhof: Update urls from github.com/wikimedia instead of github.com/Krinkle [05:33:00] [13nagf] 15Krinkle 04deleted 06travis at 14d5a7949: 02https://github.com/wikimedia/nagf/commit/d5a7949 [05:48:53] [13nagf] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/wikimedia/nagf/commit/2046c375e229f4cbad9253e1aa2efbd33710968d [05:48:54] 13nagf/06master 142046c37 15Timo Tijhof: Offset headers against fixed header... [06:13:55] yes you can delete etherpad-matanya andrewbogott_afk [06:39:55] should everything work again at tool labs? for me some things are missing ("jlocal: command not found"; "exec: /usr/local/bin/portgrabber: not found") [06:56:05] the webservices of Magnus are not functioning properly ie Reasonator, Toolscript etc [07:05:34] anyone who can restart webservices for the tools of Magnus ?? [07:05:37] please ?? [07:15:20] 502 Bad Gateway https://tools.wmflabs.org/bub/ [07:18:42] virt1007 seems rather unhappy [07:23:01] Thanks Nemo_bis [07:23:25] It's not like I helped :P [07:23:31] yes you did [07:23:40] you tried and told me about it [07:23:46] makes me feel less alone [07:23:51] :) [07:24:42] there is a mail about virt1007 being overexstended ... new hardware is needed [07:25:02] and as can be expected (no irony intended) it comes too late [07:25:09] Labs is popular [07:25:41] Oh. Till few days we were told the servers were not overloaded at all [07:25:47] * few days ago [07:29:44] There are more and more data entered from external sources.. more and more queries are done against Wikidata as well [07:30:04] that is ONLY Wikidata.. the rest is not less active as far as I understand [07:32:15] Dunno. We have no stats about Tools usage. Last time we had, it was still at some 5 % of Toolserver's hits (ballpark figure blabla) [07:36:25] as a whole it is not that relevant.. it is servers that die because they are overextended [07:36:46] Toolserver is not labs [07:38:47] andrewbogott_afk, Coren: speaking of outages; virt1009 has a failed disk, I think it's like that for quite a while [07:39:32] andrewbogott_afk, Coren: Sep 19th specifically [07:57:59] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (30.00%) [07:58:39] (03PS1) 10Gilles: Add ImageMetrics to the -multimedia channel [labs/tools/pywikibugs] - 10https://gerrit.wikimedia.org/r/165438 [08:43:57] Bad Gateway [08:43:59] Code 502 [08:44:03] for geohack apparently [08:44:26] and quite a few others... [08:44:31] 3Wikimedia Labs / 3deployment-prep (beta): Jenkins can not ssh to deployment-cxserver01 - 10https://bugzilla.wikimedia.org/71783 (10Antoine "hashar" Musso) 3NEW p:3Unprio s:3normal a:3None The Jenkins master on gallium is unable to connect to the deployment-cxserver01.eqiad.wmflabs instance to update... [08:47:28] 3Wikimedia Labs / 3deployment-prep (beta): Jenkins can not ssh to deployment-cxserver01 - 10https://bugzilla.wikimedia.org/71783#c1 (10Antoine "hashar" Musso) The virt1005 compute node died overnight, might explain the issue. [08:48:52] thedj: some labs server had an issue over the night (girt1005) [08:48:55] virt1005 [08:49:11] so maybe the tool ran on instance hosted on that virt1005 machine [08:54:13] 3Wikimedia Labs / 3Infrastructure: Jenkins can not ssh to deployment-cxserver01 (hosted by virt1005) - 10https://bugzilla.wikimedia.org/71783#c2 (10Antoine "hashar" Musso) The instance is hosted on virt1005 which died overnight. I have marked the Jenkins slave as offline: https://integration.wikimedia.org/c... [08:54:43] 3Wikimedia Labs / 3Infrastructure: Jenkins can not ssh to deployment-cxserver01 (hosted by virt1005) - 10https://bugzilla.wikimedia.org/71783#c3 (10Antoine "hashar" Musso) Link to instance informations: https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000421.eqiad.wmflabs [09:34:38] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [10:56:40] I've trouble to use the internal proxy, from commandline on webgrid-04 it works but from php I get a 403 [10:56:55] wget "http://tools-webproxy/phetools/hocr_cgi.py?cmd=ping" <-- fine [10:57:00] php -r "file_get_contents('http://tools-webproxy/phetools/hocr_cgi.py?cmd=ping');" [10:57:05] PHP Warning: file_get_contents(http://tools-webproxy/phetools/hocr_cgi.py?cmd=ping): failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden [10:57:05] in Command line code on line 1 [11:03:46] phe: blocked by user agent, maybe? [11:09:18] hmm, I use the default UA from php, and it was working this way before yesterday outage [11:14:17] valhallasw`cloud, right, the internal proxy no longer accept an empty user agent [11:29:13] 3Wikimedia Labs / 3deployment-prep (beta): Mobile redirect goes to wrong domain name on beta labs - 10https://bugzilla.wikimedia.org/71079#c3 (10Andre Klapper) Aude: Could you answer comment 2, please? [11:43:43] 3Wikimedia Labs / 3deployment-prep (beta): Mobile redirect goes to wrong domain name on beta labs - 10https://bugzilla.wikimedia.org/71079#c4 (10Aude) @andre I really don't know the setup for the mobile domains and probably can't look soon. Suggest to ask Jan Zerebecki (from our team) [11:46:58] 3Wikimedia Labs / 3deployment-prep (beta): Mobile redirect goes to wrong domain name on beta labs - 10https://bugzilla.wikimedia.org/71079#c5 (10Aude) the main page being broken was a different issue of wikibase being broken due to memcached issues and the main page connected to wikidata beta. I fixed that... [12:57:38] Still outage? [13:07:28] phe: it hasn't accepted an empty user agent for quite a while, I think [13:08:20] Josve05a|cloud: individual tools may need to be restarted [13:08:34] thedj: I just restarted geohack [13:08:47] Coren: no idea why bigbrother didn't restart a bunch of tools [13:09:03] Coren: perhaps we should do a general restart? [13:09:07] they just are serving blank pages... [13:17:48] * Coren ponders. [13:19:02] bigbrother may well have spent all three restart attempts during the outage; it's meant to help failing tools, not failing hardware. [13:23:40] Coren: yeah. I wonder if a general restart of everything on the webgrid might be a good idea. [13:23:44] we probably should do that [13:24:34] Hmm. -webproxy was also down. 10:1 the issue comes from /that/. [13:24:57] Anything restarting while its down would never have gotten its entry back in redis. [13:26:25] But yeah, might be best. Lemme make a quick script to do so gradually. [13:27:10] Coren: ah, that's true as well. [13:27:11] Coren: ok [13:27:21] Coren: put it on puppet as well, so we can use it next time if needed? [13:27:24] * YuviPanda should read up on grid more [13:27:45] -webgrid is a spof. I wonder if we can add some redundance for it? [13:27:59] I mean -webproxy [13:30:38] Coren: yeah, we totally can. [13:30:49] Coren: easy enough to run another instance, and just make it a redis slave. [13:31:03] and use DNS round robin, which should be *ok* here [13:31:28] We'll think about it later. [13:31:31] yeah [13:32:04] I can only restart jobs that use the standard webservice though. [13:39:35] Started. It's a really ugly script and it does them one at a time so it'll take some time. [13:43:07] Coren: ok [13:46:03] Coren one "bug" I discovered recently is kinda interesting. If your error.log is deleted while the webservice is running any script that outputs to error.log stops functioning. Im surprised it just doesnt re-create the file and keep chugging along [13:46:41] Betacommand: That's... huh. I'm not even sure how that'd work. [13:47:43] Coren: ? [13:49:20] Betacommand: Well, that'd require the scripts to open the log file without creation; which really atypical for logfiles. [13:51:07] Coren: Ive got a script that uses sderr for some outout. On the webservers its redirected/set to error.log If for some reason the file gets deleted, the script is given an invalid reference to a file [13:51:19] *output [13:51:45] and thus crashes [13:51:50] Coren: maybe lighttpd keeps the file open instead? [13:51:56] or rather... sge [13:52:40] had some fun trying to debug that [13:53:01] and then add some nfs magic to the mix [13:53:15] If you restart the web service it fixes it. [13:54:53] Hmm. That'd mean that lighttpd never reopens its error log. [13:55:03] Or at least, that it doesn't do so automatically. [13:55:17] It has /got/ to have an interface for log rotation though - probably a signal. [13:55:41] Coren: aaaah, it didn't last time I checked [13:56:05] Coren: you need to do a 'reload' [13:56:14] Eeeew. [13:56:27] huh, some sources say HUP might work, some say it is supposed to but often doesn't in practice [13:56:44] "some sources". Heh. {{cn}} [13:56:52] Ima read the source. Works all the time. [13:57:09] see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=380080 [13:57:19] oh, it's from 2006 :| [13:57:22] maybe fixed in the meantime [13:58:22] (03CR) 10Gergő Tisza: [C: 031] Add ImageMetrics to the -multimedia channel [labs/tools/pywikibugs] - 10https://gerrit.wikimedia.org/r/165438 (owner: 10Gilles) [14:00:37] Coren: I always find the fun bugs [14:18:41] Coren: Krinkle wrote this the other day, to provide ganglia'ish views for graphite... https://tools.wmflabs.org/nagf/?project=tools [14:27:33] YuviPanda, empty ua to -webproxy worked 36 hours ago, and it always worked this way [14:28:02] phe: ah, I suspect the nginx restart I did at that time (to block TweetmemeBot) also applied the (dormant) code to block empty UA [14:29:01] ok, I'll workaround that, I don't see the point to block empty ua for an internal proxy, but ok [14:34:59] phe: it's not an 'internal proxy', since it's also the same thing accessible from the internet [14:35:24] it's also accessible from inside tools, but it's primarily used from the internet... [14:36:38] YuviPanda, I'm talking about http://tools-webproxy/phetools/hocr_cgi.py?cmd=ping [14:36:49] phe: it's the same as tools.wmflabs.org [14:37:04] not reachable from my home [14:37:23] tools.wmflabs.org and tools-webproxy are the same host, served by the same services. [14:37:42] so if you replace tools-webproxy with tools.wmflabs.org, you'll get exactly the same results [15:29:09] Coren: I think that labs has hit some kind of connection limit with freenode… Danny mentioned having trouble last night and now I'm getting the same error (when proxying via labs) [15:29:22] Are you in touch with whoever manages those limits? [15:29:41] I am, but where are you proxying /from/? [15:30:06] Because there are different limits for different things. [15:33:50] Coren: in this case it would've been from util-abogott [15:33:58] In the 'testlabs' project [15:34:13] In Danny's case, from the 'wildcat' project I believe. [15:34:25] …if that's what you meant [15:37:20] That is what I meant. Outgoing from instances without public IPs are all lumped together and can't have identd - there is a stringent limit on that which freenode doesn't like raising. [15:37:49] But also, I'm pretty sure we don't *want* random irc clients from all over labs. [15:38:33] That is why exec nodes have IPs and identd running; this way bots can connect and have proper ident. [15:40:46] ok… well, maybe it's time for me to find another bouncer, anyway. [15:40:58] is NovaProxy not working for others either? [15:41:09] milimetric: can you be more specific? [15:41:12] I saw my proxies there yesterday, now it's not showing any [15:41:17] https://wikitech.wikimedia.org/wiki/Special:NovaProxy [15:41:24] milimetric: this is a running gag this week -- log out and in, they should reappear. [15:41:31] i did that, knowing the gag [15:41:34] and still - no proxies :) [15:41:56] i'll try clearing cache and stuff [15:43:32] milimetric: looks broken to me too [15:43:40] k [15:43:51] i verified they are still working (by hitting the urls) [15:43:56] but they're just not showing up on wikitech [15:43:58] hm, curious [15:44:01] let me know if i can help debug [15:44:05] YuviPanda|food: any ideas? [15:44:15] And who the heck is yuvipanda35 ? [15:44:42] hm - maybe yuvi's android clone he leaves in the channel when he's eating... [15:48:54] andrewbogott: It may be reasonable to have an instance specifically to allow bouncers instead of allowing them in random places. [15:49:16] Coren: yeah… I feel like we've discussed that a few times and there's never concensus so I just kept doing what I'd always done. [15:49:26] bah, I absolutely cannot spell that word. [15:49:32] consensus? [15:49:35] *shrug* [15:50:41] * bd808 points out that bouncers in labs raise interesting security questions for anyone who has access to restricted channels [15:50:51] I've offered four or five times to other people to let them use my dircproxy setup but as best I can tell most everyone but me has their own private server. [15:50:58] bd808: yeah [15:51:32] ^d is working on an official un-official shared bouncer for WMF staff [15:52:07] http://irc.anyonecanedit.org/ [15:52:44] shout + znc [15:53:01] So the znc bit would work for folks who want a "real" client [15:53:29] bd808: great, that's what I need! Just, I need it a couple of years ago :) [15:53:59] yeah. I have a vps just for znc thatI'll be glad to kill actually [15:55:14] milimetric: So… I'm diverted right now trying to fix the session bug. I predict that Yuvi will appear and magically fix the proxy issue but if not I'll look at it after lunch. [15:56:00] andrewbogott: Make sure your bouncer is on an instance with (a) an IP and (b) identd turned on and you'll get reliable bouncing. Anyone else will hit the limit. :-) [16:03:24] (thanks andrew, I shall pray to my Yuvi shrine) [16:11:20] hey milimetric [16:11:25] * YuviPanda reads backscroll [16:14:14] andrewbogott: hmm, how does wikitech read the data from the proxy host? [16:14:24] it looks like the proxies are still working... [16:14:32] so problem is perhaps in wikitech <-> proxy [16:15:26] YuviPanda: via the rest interface? [16:15:30] hmm [16:15:35] oh, I forgot I wrote that [16:15:39] There should be a log of queries/responses someplace right? [16:15:40] let me go see what it's doing [16:15:42] Yeah, you wrote it :) [16:15:42] yeah [16:16:02] YuviPanda: I'm dumping it on you mostly 'cause I haven't had breakfast and it is now lunchtime so I'm about to disappear for a bit. [16:16:08] heh :) [16:16:09] ok [16:16:12] But also you're the expert! [16:17:09] milimetric: try now? [16:17:35] !log project-proxy restarted dynamicproxy-api on dynamicproxy-gateway [16:17:38] Logged the message, Master [16:17:39] YuviPanda: looks fixed to me [16:17:39] andrewbogott: is back up for me [16:17:41] yeah [16:17:46] Do we just need an upstart script for that? [16:17:47] I guess it didn't restart properly... [16:17:48] +beer YuviPanda, thanks :) [16:17:54] andrewbogott: we have an upstart script for that [16:18:01] I just did restart dynamicproxy-api [16:18:06] andrewbogott: it was running as well. [16:18:27] Oh, that's unfortunate. [16:18:41] and I don't think it keeps logs... [16:18:48] btw, YuviPanda (and everyone) Reedy and I have a good theory for why the session bug has been happening like crazy this week. This will be deployed later in the day and should fix it… https://gerrit.wikimedia.org/r/#/c/165501/ [16:18:58] The bad news, it turns out that keystone sessions are probably expiring every 60 minutes. [16:18:59] w0000tttt [16:19:02] oh wow [16:19:05] Which means… you need to log out and in semi-constantly [16:19:09] hmm, it hasn't been that frequent for me... [16:19:23] Yeah, which means my patch might not fix it :( [16:19:41] andrewbogott: btw, Krinkle wrote a ganglia-ish viewer for graphite... https://tools.wmflabs.org/nagf/?project=tools [16:19:43] kinda. [16:19:47] milimetric: ^ [16:20:55] that's a lotta graphs :) [16:21:54] YuviPanda: so I just added "http://pentaho.wmflab.org/" pointing to port 8080 on a new instance [16:22:03] and usually it works right away [16:22:08] but it's having trouble now... [16:22:13] * YuviPanda checks [16:22:48] milimetric: hmm, I get redirected to http://pentaho.wmflabs.org/pentaho/Login [16:23:19] milimetric: if you had accessed the URL before adding the entry, your local DNS might've cached it [16:23:37] ah, good call - i was hitting it before you restarted the proxy thingamagig [16:23:38] thx [16:23:42] :) [16:24:43] YuviPanda: thanks for the quick fix [16:24:57] :) [16:25:09] I'm just slightly pissed at whoever wrote that thing in the first place for not adding logging [16:25:11] stupid person [16:28:45] * bd808 sees pentaho and shudders [16:29:10] it vaguely sounds like the state you live in, bd808 [16:29:22] Open core and GWT front end [16:29:29] ow [16:29:36] * YuviPanda runs away from GWT [16:29:42] I have a lot of past trauma :) [16:29:48] heh [16:29:56] bd808: did you see the thing Krinkle built? https://tools.wmflabs.org/nagf/?project=tools [16:30:03] this? http://en.wikipedia.org/wiki/Pentaho [16:30:10] YuviPanda: I did. [16:30:19] chasemp: yup [16:30:53] that big table in the middle of that article should perhaps not be there... [16:31:30] We used it at Kount to give clients access to a data warehouse and I drew the short straw of integrating it with our systems. [16:35:28] coren: andrewbogott: hello! I am awaited for dinner, but I noticed a virt1005 instance refuses to reboot following yesterday outage. It is deployment-cxserver01 on deployment-prep project. Filled bug with info at https://bugzilla.wikimedia.org/show_bug.cgi?id=71783 [16:35:58] and off for dinner sorry [16:36:09] hashar: I'm heading out for lunch, will look when I return [16:36:24] andrewbogott: awesome. Bon apétit! [17:18:12] Four of my toollabs projects were restarted correctly, but "osm" is one of the projects with a bigbrother that doesn't start. What could be the reason? [17:20:33] Kolossos: It's possible that it tried the maximum number of times (3 per 24h) before the failed host came back up. [17:21:13] bigbrother helps, but was designed to help failed individual tools and doesn't really have support for "omg, server asplode!" [17:22:16] Your bigbrother log will tell you; it probably has a "failed too many times" message in it; that means it'll try again 24h past the third-to-last attempt. [17:22:27] But you can always restart it manually. [17:23:10] Ok, i will restart it now, but I hope it will be more stable in the future. As a volunteer I can not look each hour that all my tools are running. There was a promise to be more stable than the Toolserver... [17:28:26] The interesting thing is that in project "osm" is no log-file, but in correct restarted projects like templatetiger. [17:33:10] Kolossos: I don't want to knock toolserver, but last time I remember an entire server failing there the effects were much more severe and longer lasting. [17:40:35] It seems I have a problem to start webservice for project osm: [17:40:58] tools.osm@tools-login:~$ webservice start [17:41:00] Starting webservice... started. [17:41:01] tools.osm@tools-login:~$ qstat [17:41:03] job-ID prior name user state submit/start at queue slots ja-task-ID [17:41:04] ----------------------------------------------------------------------------------------------------------------- [17:41:06] 4653796 0.00000 lighttpd-o tools.osm qw 10/08/2014 17:38:18 1 [17:41:07] tools.osm@tools-login:~$ qstat [17:41:09] tools.osm@tools-login:~$ [17:55:10] It looks to me as though it starts then immediately stops. What is in your error.log? [17:57:06] 2014-05-05 20:31:34: (log.c.166) server started [17:57:07] 2014-05-05 20:32:51: (server.c.1512) server stopped by UID = 0 PID = 24825 [17:57:09] 2014-05-05 20:32:51: (server.c.1502) unlink failed for: /var/run/lighttpd/osm.pid 2 No such file or directory [17:57:10] Duplicate config variable in conditional 0 global: server.dir-listing [17:57:47] No idea what this means. Projects osm host only some js-libs and css files. [17:58:38] Sorry, old stuff: [17:58:44] New things: [17:58:46] 2014-10-08 14:30:42: (configfile.c.912) source: /var/run/lighttpd/osm.conf line: 548 pos: 16 parser failed somehow near here: fastcgi.server [17:59:25] Hm, this is probably because you .lighttpd.conf doesn't end with a newline as it must. [17:59:54] * Coren tests. [18:00:53] (03PS1) 10BearND: Output the current time stamp to stdout and stderr [labs/tools/wikipedia-android-builds] - 10https://gerrit.wikimedia.org/r/165520 [18:01:48] Hurray, now it works. Thanks. [18:02:20] Yeah, I was about to tell you that I tested it with a missing newline and that cause the same issue. :-) [18:06:28] 3Wikimedia Labs / 3Infrastructure: Jenkins can not ssh to deployment-cxserver01 (hosted by virt1005) - 10https://bugzilla.wikimedia.org/71783#c4 (10Andrew Bogott) If this instance has any important data I can try to reclaim the drive contents. Otherwise you should just delete and recreate. [18:29:50] YuviPanda: Around? Do you happen to know if our current Mariadb setup has a session time limit? [18:30:21] multichill: hey, I don't know, but I think it will timeout if held idle for a certain period of time [18:30:25] The runtime of one of my bots exploded and now I'm getting _mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away') [18:30:25] * YuviPanda still doesn't have root [18:30:50] multichill: yeah, that happens reasonably quickly, I think. You need to use a pool, or do conn.ping() [18:31:00] you can't keep a connection open forever with mysql.... [18:31:14] I don't think it's a long query [18:32:14] cursor.execute(query, (countrycode, lang, monumentId)). query = u"""SELECT `name`, `commonscat`, `monument_article`, `source` FROM monuments_all WHERE (country=%s AND lang=%s AND id=%s) LIMIT 1"""; [18:32:28] So basically lookup a monument and return relevant info [18:32:33] Thats subsecond [18:32:44] country/lang/id is the primary key [19:06:39] multichill: I'd wager that it's not the /first/ query that dies, but the last of a long series? [19:06:51] Coren: check [19:07:03] multichill: it's not a 'long query' but the *connection* that has been open for long [19:07:10] multichill: when was the connection that you got cursor from open? [19:07:18] For hours [19:07:24] yeah, that'll fail [19:07:49] can anyone help out with a template import to beta labs, or y'all scrambling to fix the outage? [19:09:53] yuvipanda: Why exactly? Is there some sort of max session time? [19:10:04] multichill: yes! I'm looking for the exact docs, moment [19:10:09] mysql kills them after a certain 'idle' time [19:10:25] It's not idle, it's doing loads of small queries [19:10:50] multichill: https://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_wait_timeout [19:11:02] … guess i answered my own question. [19:11:07] On my firewalls I usually have two settings: One for inactivity (say 30 minutes) and for maximum session time (say 24 hours) [19:11:14] Maryana: which templates? i can take a look... [19:46:13] 3Wikimedia Labs / 3Infrastructure: Jenkins can not ssh to deployment-cxserver01 (hosted by virt1005) - 10https://bugzilla.wikimedia.org/71783#c5 (10Antoine "hashar" Musso) (In reply to Andrew Bogott from comment #4) > If this instance has any important data I can try to reclaim the drive > contents. Otherwi... [19:46:43] andrewbogott: thank you to have looked at deployment-cxserver01 (dead instance related to virt1005 outage) [19:47:04] hashar: you're welcome -- I don't know what's gone wrong with it though :( [19:47:12] andrewbogott: will check with i18n folks and we will most probably recreate it. Don't waste your precious time trying to recover it :] [19:47:25] andrewbogott: unless i18n team says they need data on there, but that is unlikely [19:47:41] shit happens :D [19:47:59] also I noticed at least one occurrence of a corrupted file on a labs instance (solved though) [19:48:07] might or might not be related to virt1005 outage [19:49:36] It's possible. The system was too far gone for me to shut it down gracefully [20:04:46] Coren: bd808: I noticed a strange and significant increase in incoming packets on most instances in different projects. Not sure what they have in common. [20:05:04] Krinkle: What kind of timeline? [20:05:31] Coren: Around Sep 28. [20:05:47] From N packets received to 2N packets sent, it went from 2N packets sent and 4N packets received [20:06:30] Krinkle: That's... odd. Any idea where that new traffic comes from? [20:06:57] No idea, but it affects completely unrelated projects [20:07:02] maybe internal services [20:07:35] Monitoring is the only significant thing I can think of that'd have wide effects like this, but I'd expect it to tip traffic in the other direction. [20:14:22] Coren: Here's a few examples [20:14:22] https://tools.wmflabs.org/nagf/?project=integration#h_integration-puppetmaster_network-packets [20:14:35] https://tools.wmflabs.org/nagf/?project=bastion#h_bastion2_network-packets [20:14:49] https://tools.wmflabs.org/nagf/?project=tools#h_tools-mail_network-packets [20:14:54] Yeah, that's noticable indeed. [20:15:08] https://tools.wmflabs.org/nagf/?project=cvn#h_cvn-dev_network-packets [20:15:09] YuviPanda: ^^ could this be your monitoring? [20:15:47] * YuviPanda reads backscroll [20:16:12] Coren: Krinkle shouldn't be, since diamond is the only thing that runs on all machines, and that's been on for quite a long time [20:16:22] Hm. [20:16:29] * Coren investigates further then. [20:22:35] Krinkle: At first glance, it doesn't look like a sudden increase so much as a quiet period (Sept-16 Sept-29) [20:23:44] The effect is small, in absolute terms - it's only noticable on instances that have naturally low traffic. [20:42:30] 3Wikimedia Labs / 3deployment-prep (beta): VisualEditor:"404 file Not Found Error" when logging into betalabs - 10https://bugzilla.wikimedia.org/71806#c4 (10Andre Klapper) 5UNCO>3NEW a:3None I can confirm comment 3, same outcome here, simply because http://login.wikimedia.beta.wmflabs.org/wiki/Special:... [20:46:11] YuviPanda: Do you want me to put commits to nagf in -labs or -qa? [20:46:28] Krinkle: -labs, yeah [20:46:31] k [20:46:35] Krinkle: I'll respond to the email soon, I think [20:46:49] I've updated github config [20:46:51] updating travis now [20:46:59] cool [20:47:14] Krinkle: it should probably live in gerrit tho [20:47:34] [13nagf] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/wikimedia/nagf/commit/a76aa8be7f75ad1c877774def61a3f10b8a40630 [20:47:34] 13nagf/06master 14a76aa8b 15Timo Tijhof: travis: Send notifications to #wikimedia-labs [20:47:40] YuviPanda: I'd rather be productive. [20:47:44] Krinkle: heh :) [20:47:46] fair enough [20:47:50] I'll wait for phab [20:47:55] I've given you all access [20:47:59] via @wikimedia [20:48:16] ssh tools-login; become nagf, cd src/nagf; git pull [20:48:19] (to deploy) [20:48:21] cool [20:48:27] and to github as well [20:48:34] Krinkle: I'm slightly nagged about using PHP, but that's ok, I think. [20:48:39] though you may wanna do a local branch pull request if you want code review. [20:48:41] Krinkle: I don't mind just building on top of this. [20:48:50] Krinkle: yeah, I'll fork and PR. [20:48:55] cool :) [20:49:14] Krinkle: I think if we add ability to customize things with JSON/YAML blobs, this should be enough... [20:49:22] YuviPanda: It's what we know :) [20:49:30] Krinkle: heh :) [20:49:34] Krinkle: well, most ops tools are python... [20:49:57] Krinkle: I've been keeping an eye on the network traffic, and I see nothing out of the ordinary. I'm thinking at least part of the reduction for a week was caused by the LDAP outage? [20:50:03] The dates fit [20:50:04] wikimedia/nagf#14 (master - a76aa8b: Timo Tijhof) The build was broken. - http://travis-ci.org/wikimedia/nagf/builds/37439138 [20:50:18] [13nagf] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/wikimedia/nagf/commit/6034ab45adad28d52ddc877c1804bfd75819b63e [20:50:18] 13nagf/06master 146034ab4 15Timo Tijhof: Remove unused settings.json... [20:51:38] [13nagf] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/wikimedia/nagf/commit/7e89da5a6460d25eacecbd61a957bb78398db9ec [20:51:38] 13nagf/06master 147e89da5 15Timo Tijhof: travis: Exclude vendor/ from linting and coding style [20:51:48] YuviPanda: Yeah, I had that in place ^ (settings.json) but factored it out before the first commit. [20:52:00] we should bring that in at some point to allow configuration. [20:52:21] Krinkle: yeah. toollabs should probably have some more stats (Grid stats, for which I haven't written a collector yet, ugh) [20:52:40]