[00:25:19] Krinkle, ur guc is not working .. [00:30:32] infact a few global tools are failing..change in API? [01:02:03] Coren: you still around? [02:30:27] still getting 502's .. [02:30:35] andrewbogott_afk: Coren (Cannot contact the database server: Can't connect to MySQL server on '10.68.16.193' (111) (10.68.16.193)) [02:30:41] http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page [02:31:54] greg-g: since I can't sleep, might as well debug... [02:31:56] greg-g: hmm, works for me? [02:32:36] not for me nor many others: https://bugzilla.wikimedia.org/show_bug.cgi?id=71764 [02:33:06] wth is core-n's email in bugzilla? [02:33:24] greg-g: try 'mpel*'? [02:33:37] found it [02:33:49] no, marc@uberbox [02:33:59] ah [02:34:13] 3Wikimedia Labs / 3deployment-prep (beta): All page loads fail with "DB connection error: Can't connect to MySQL server on '10.68.16.193' (4) (10.68.16.193)" - 10https://bugzilla.wikimedia.org/71764 (10Greg Grossmeier) [02:35:10] greg-g: ah, I see. works when you're logged out [02:35:21] greg-g: or basically if you're getting it from cache [02:35:28] attempting to login itself fails [02:39:28] 3Wikimedia Labs / 3deployment-prep (beta): Can't connect to MySQL on deployment-db1 - 10https://bugzilla.wikimedia.org/71764 (10Alex Monk) [02:39:34] greg-g: fixed [02:39:58] 3Wikimedia Labs / 3deployment-prep (beta): Can't connect to MySQL on deployment-db1 - 10https://bugzilla.wikimedia.org/71764#c3 (10Greg Grossmeier) All browser tests will fail, all users/developers will complain. Marc/Andrew/Yuvi: Anything going on that might explain this? Any changes from production that m... [02:40:09] greg-g: responding on the bug now [02:40:13] thanks [02:40:29] 3Wikimedia Labs / 3deployment-prep (beta): Can't connect to MySQL on deployment-db1 - 10https://bugzilla.wikimedia.org/71764#c4 (10Greg Grossmeier) (due to a mid-air collision, re-adding Sean. Sean, see above) [02:41:25] brb [02:42:15] 3Wikimedia Labs / 3deployment-prep (beta): Can't connect to MySQL on deployment-db1 - 10https://bugzilla.wikimedia.org/71764#c5 (10Yuvi Panda) 5NEW>3RESO/FIX So... There was a 'rogue' mysql server on deployment-db1, which wasn't being recognized by the init script for... some reason. I killed it manuall... [02:42:27] I tried to restart mysql on there a few minutes ago but found I didn't have sudo. [02:42:51] "This incident will be reported.", etc. [02:42:57] Krenair: ah, hmm. [02:43:09] Krenair: I can grant you sudo if greg-g is ok with it (or someone else 'responsible for betalabs' is) [02:43:30] I probably won't use it much. [02:44:11] Krenair: heh :) [02:44:28] Anyway, I'm going to sleep. thanks for fixing it YuviPanda [02:44:36] Krenair: yw. [02:50:21] thanks a ton, YuviPanda [02:50:28] 3Wikimedia Labs / 3deployment-prep (beta): Can't connect to MySQL on deployment-db1 - 10https://bugzilla.wikimedia.org/71764#c6 (10Greg Grossmeier) Thanks Yuvi! (Sorry for the pings Marc/Andrew/Sean :) ) [02:50:38] greg-g: :) thank my insomnia! :) [02:50:42] it's 8:20AM and I can't seem to sleep... [02:50:45] eek [02:51:04] and there's no power as well, so yay! [02:51:04] hope you can sleep at some point :/ [03:00:27] greg-g: yeah, I hope so too... [03:00:50] greg-g: also, I'm owed at least 2 drinks at some point in the future for this and previous beta labs features :) [03:01:24] at least while I continue to do so before I move to ops :) [03:34:24] YuviPanda: noted :) [05:26:18] [13nagf] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/wikimedia/nagf/commit/a966dfe55009a2dcd786db86e7e12e93901a6abe [05:26:19] 13nagf/06master 14a966dfe 15Timo Tijhof: Update urls from github.com/wikimedia instead of github.com/Krinkle [05:33:00] [13nagf] 15Krinkle 04deleted 06travis at 14d5a7949: 02https://github.com/wikimedia/nagf/commit/d5a7949 [05:48:53] [13nagf] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/wikimedia/nagf/commit/2046c375e229f4cbad9253e1aa2efbd33710968d [05:48:54] 13nagf/06master 142046c37 15Timo Tijhof: Offset headers against fixed header... [06:13:55] yes you can delete etherpad-matanya andrewbogott_afk [06:39:55] should everything work again at tool labs? for me some things are missing ("jlocal: command not found"; "exec: /usr/local/bin/portgrabber: not found") [06:56:05] the webservices of Magnus are not functioning properly ie Reasonator, Toolscript etc [07:05:34] anyone who can restart webservices for the tools of Magnus ?? [07:05:37] please ?? [07:15:20] 502 Bad Gateway https://tools.wmflabs.org/bub/ [07:18:42] virt1007 seems rather unhappy [07:23:01] Thanks Nemo_bis [07:23:25] It's not like I helped :P [07:23:31] yes you did [07:23:40] you tried and told me about it [07:23:46] makes me feel less alone [07:23:51] :) [07:24:42] there is a mail about virt1007 being overexstended ... new hardware is needed [07:25:02] and as can be expected (no irony intended) it comes too late [07:25:09] Labs is popular [07:25:41] Oh. Till few days we were told the servers were not overloaded at all [07:25:47] * few days ago [07:29:44] There are more and more data entered from external sources.. more and more queries are done against Wikidata as well [07:30:04] that is ONLY Wikidata.. the rest is not less active as far as I understand [07:32:15] Dunno. We have no stats about Tools usage. Last time we had, it was still at some 5 % of Toolserver's hits (ballpark figure blabla) [07:36:25] as a whole it is not that relevant.. it is servers that die because they are overextended [07:36:46] Toolserver is not labs [07:38:47] andrewbogott_afk, Coren: speaking of outages; virt1009 has a failed disk, I think it's like that for quite a while [07:39:32] andrewbogott_afk, Coren: Sep 19th specifically [07:57:59] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (30.00%) [07:58:39] (03PS1) 10Gilles: Add ImageMetrics to the -multimedia channel [labs/tools/pywikibugs] - 10https://gerrit.wikimedia.org/r/165438 [08:43:57] Bad Gateway [08:43:59] Code 502 [08:44:03] for geohack apparently [08:44:26] and quite a few others... [08:44:31] 3Wikimedia Labs / 3deployment-prep (beta): Jenkins can not ssh to deployment-cxserver01 - 10https://bugzilla.wikimedia.org/71783 (10Antoine "hashar" Musso) 3NEW p:3Unprio s:3normal a:3None The Jenkins master on gallium is unable to connect to the deployment-cxserver01.eqiad.wmflabs instance to update... [08:47:28] 3Wikimedia Labs / 3deployment-prep (beta): Jenkins can not ssh to deployment-cxserver01 - 10https://bugzilla.wikimedia.org/71783#c1 (10Antoine "hashar" Musso) The virt1005 compute node died overnight, might explain the issue. [08:48:52] thedj: some labs server had an issue over the night (girt1005) [08:48:55] virt1005 [08:49:11] so maybe the tool ran on instance hosted on that virt1005 machine [08:54:13] 3Wikimedia Labs / 3Infrastructure: Jenkins can not ssh to deployment-cxserver01 (hosted by virt1005) - 10https://bugzilla.wikimedia.org/71783#c2 (10Antoine "hashar" Musso) The instance is hosted on virt1005 which died overnight. I have marked the Jenkins slave as offline: https://integration.wikimedia.org/c... [08:54:43] 3Wikimedia Labs / 3Infrastructure: Jenkins can not ssh to deployment-cxserver01 (hosted by virt1005) - 10https://bugzilla.wikimedia.org/71783#c3 (10Antoine "hashar" Musso) Link to instance informations: https://wikitech.wikimedia.org/wiki/Nova_Resource:I-00000421.eqiad.wmflabs [09:34:38] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [10:56:40] I've trouble to use the internal proxy, from commandline on webgrid-04 it works but from php I get a 403 [10:56:55] wget "http://tools-webproxy/phetools/hocr_cgi.py?cmd=ping" <-- fine [10:57:00] php -r "file_get_contents('http://tools-webproxy/phetools/hocr_cgi.py?cmd=ping');" [10:57:05] PHP Warning: file_get_contents(http://tools-webproxy/phetools/hocr_cgi.py?cmd=ping): failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden [10:57:05] in Command line code on line 1 [11:03:46] phe: blocked by user agent, maybe? [11:09:18] hmm, I use the default UA from php, and it was working this way before yesterday outage [11:14:17] valhallasw`cloud, right, the internal proxy no longer accept an empty user agent [11:29:13] 3Wikimedia Labs / 3deployment-prep (beta): Mobile redirect goes to wrong domain name on beta labs - 10https://bugzilla.wikimedia.org/71079#c3 (10Andre Klapper) Aude: Could you answer comment 2, please? [11:43:43] 3Wikimedia Labs / 3deployment-prep (beta): Mobile redirect goes to wrong domain name on beta labs - 10https://bugzilla.wikimedia.org/71079#c4 (10Aude) @andre I really don't know the setup for the mobile domains and probably can't look soon. Suggest to ask Jan Zerebecki (from our team) [11:46:58] 3Wikimedia Labs / 3deployment-prep (beta): Mobile redirect goes to wrong domain name on beta labs - 10https://bugzilla.wikimedia.org/71079#c5 (10Aude) the main page being broken was a different issue of wikibase being broken due to memcached issues and the main page connected to wikidata beta. I fixed that... [12:57:38] Still outage? [13:07:28] phe: it hasn't accepted an empty user agent for quite a while, I think [13:08:20] Josve05a|cloud: individual tools may need to be restarted [13:08:34] thedj: I just restarted geohack [13:08:47] Coren: no idea why bigbrother didn't restart a bunch of tools [13:09:03] Coren: perhaps we should do a general restart? [13:09:07] they just are serving blank pages... [13:17:48] * Coren ponders. [13:19:02] bigbrother may well have spent all three restart attempts during the outage; it's meant to help failing tools, not failing hardware. [13:23:40] Coren: yeah. I wonder if a general restart of everything on the webgrid might be a good idea. [13:23:44] we probably should do that [13:24:34] Hmm. -webproxy was also down. 10:1 the issue comes from /that/. [13:24:57] Anything restarting while its down would never have gotten its entry back in redis. [13:26:25] But yeah, might be best. Lemme make a quick script to do so gradually. [13:27:10] Coren: ah, that's true as well. [13:27:11] Coren: ok [13:27:21] Coren: put it on puppet as well, so we can use it next time if needed? [13:27:24] * YuviPanda should read up on grid more [13:27:45] -webgrid is a spof. I wonder if we can add some redundance for it? [13:27:59] I mean -webproxy [13:30:38] Coren: yeah, we totally can. [13:30:49] Coren: easy enough to run another instance, and just make it a redis slave. [13:31:03] and use DNS round robin, which should be *ok* here [13:31:28] We'll think about it later. [13:31:31] yeah [13:32:04] I can only restart jobs that use the standard webservice though. [13:39:35] Started. It's a really ugly script and it does them one at a time so it'll take some time. [13:43:07] Coren: ok [13:46:03] Coren one "bug" I discovered recently is kinda interesting. If your error.log is deleted while the webservice is running any script that outputs to error.log stops functioning. Im surprised it just doesnt re-create the file and keep chugging along [13:46:41] Betacommand: That's... huh. I'm not even sure how that'd work. [13:47:43] Coren: ? [13:49:20] Betacommand: Well, that'd require the scripts to open the log file without creation; which really atypical for logfiles. [13:51:07] Coren: Ive got a script that uses sderr for some outout. On the webservers its redirected/set to error.log If for some reason the file gets deleted, the script is given an invalid reference to a file [13:51:19] *output [13:51:45] and thus crashes [13:51:50] Coren: maybe lighttpd keeps the file open instead? [13:51:56] or rather... sge [13:52:40] had some fun trying to debug that [13:53:01] and then add some nfs magic to the mix [13:53:15] If you restart the web service it fixes it. [13:54:53] Hmm. That'd mean that lighttpd never reopens its error log. [13:55:03] Or at least, that it doesn't do so automatically. [13:55:17] It has /got/ to have an interface for log rotation though - probably a signal. [13:55:41] Coren: aaaah, it didn't last time I checked [13:56:05] Coren: you need to do a 'reload' [13:56:14] Eeeew. [13:56:27] huh, some sources say HUP might work, some say it is supposed to but often doesn't in practice [13:56:44] "some sources". Heh. {{cn}} [13:56:52] Ima read the source. Works all the time. [13:57:09] see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=380080 [13:57:19] oh, it's from 2006 :| [13:57:22] maybe fixed in the meantime [13:58:22] (03CR) 10Gergő Tisza: [C: 031] Add ImageMetrics to the -multimedia channel [labs/tools/pywikibugs] - 10https://gerrit.wikimedia.org/r/165438 (owner: 10Gilles) [14:00:37] Coren: I always find the fun bugs [14:18:41] Coren: Krinkle wrote this the other day, to provide ganglia'ish views for graphite... https://tools.wmflabs.org/nagf/?project=tools [14:27:33] YuviPanda, empty ua to -webproxy worked 36 hours ago, and it always worked this way [14:28:02] phe: ah, I suspect the nginx restart I did at that time (to block TweetmemeBot) also applied the (dormant) code to block empty UA [14:29:01] ok, I'll workaround that, I don't see the point to block empty ua for an internal proxy, but ok [14:34:59] phe: it's not an 'internal proxy', since it's also the same thing accessible from the internet [14:35:24] it's also accessible from inside tools, but it's primarily used from the internet... [14:36:38] YuviPanda, I'm talking about http://tools-webproxy/phetools/hocr_cgi.py?cmd=ping [14:36:49] phe: it's the same as tools.wmflabs.org [14:37:04] not reachable from my home [14:37:23] tools.wmflabs.org and tools-webproxy are the same host, served by the same services. [14:37:42] so if you replace tools-webproxy with tools.wmflabs.org, you'll get exactly the same results [15:29:09] Coren: I think that labs has hit some kind of connection limit with freenode… Danny mentioned having trouble last night and now I'm getting the same error (when proxying via labs) [15:29:22] Are you in touch with whoever manages those limits? [15:29:41] I am, but where are you proxying /from/? [15:30:06] Because there are different limits for different things. [15:33:50] Coren: in this case it would've been from util-abogott [15:33:58] In the 'testlabs' project [15:34:13] In Danny's case, from the 'wildcat' project I believe. [15:34:25] …if that's what you meant [15:37:20] That is what I meant. Outgoing from instances without public IPs are all lumped together and can't have identd - there is a stringent limit on that which freenode doesn't like raising. [15:37:49] But also, I'm pretty sure we don't *want* random irc clients from all over labs. [15:38:33] That is why exec nodes have IPs and identd running; this way bots can connect and have proper ident. [15:40:46] ok… well, maybe it's time for me to find another bouncer, anyway. [15:40:58] is NovaProxy not working for others either? [15:41:09] milimetric: can you be more specific? [15:41:12] I saw my proxies there yesterday, now it's not showing any [15:41:17] https://wikitech.wikimedia.org/wiki/Special:NovaProxy [15:41:24] milimetric: this is a running gag this week -- log out and in, they should reappear. [15:41:31] i did that, knowing the gag [15:41:34] and still - no proxies :) [15:41:56] i'll try clearing cache and stuff [15:43:32] milimetric: looks broken to me too [15:43:40] k [15:43:51] i verified they are still working (by hitting the urls) [15:43:56] but they're just not showing up on wikitech [15:43:58] hm, curious [15:44:01] let me know if i can help debug [15:44:05] YuviPanda|food: any ideas? [15:44:15] And who the heck is yuvipanda35 ? [15:44:42] hm - maybe yuvi's android clone he leaves in the channel when he's eating... [15:48:54] andrewbogott: It may be reasonable to have an instance specifically to allow bouncers instead of allowing them in random places. [15:49:16] Coren: yeah… I feel like we've discussed that a few times and there's never concensus so I just kept doing what I'd always done. [15:49:26] bah, I absolutely cannot spell that word. [15:49:32] consensus? [15:49:35] *shrug* [15:50:41] * bd808 points out that bouncers in labs raise interesting security questions for anyone who has access to restricted channels [15:50:51] I've offered four or five times to other people to let them use my dircproxy setup but as best I can tell most everyone but me has their own private server. [15:50:58] bd808: yeah [15:51:32] ^d is working on an official un-official shared bouncer for WMF staff [15:52:07] http://irc.anyonecanedit.org/ [15:52:44] shout + znc [15:53:01] So the znc bit would work for folks who want a "real" client [15:53:29] bd808: great, that's what I need! Just, I need it a couple of years ago :) [15:53:59] yeah. I have a vps just for znc thatI'll be glad to kill actually [15:55:14] milimetric: So… I'm diverted right now trying to fix the session bug. I predict that Yuvi will appear and magically fix the proxy issue but if not I'll look at it after lunch. [15:56:00] andrewbogott: Make sure your bouncer is on an instance with (a) an IP and (b) identd turned on and you'll get reliable bouncing. Anyone else will hit the limit. :-) [16:03:24] (thanks andrew, I shall pray to my Yuvi shrine) [16:11:20] hey milimetric [16:11:25] * YuviPanda reads backscroll [16:14:14] andrewbogott: hmm, how does wikitech read the data from the proxy host? [16:14:24] it looks like the proxies are still working... [16:14:32] so problem is perhaps in wikitech <-> proxy [16:15:26] YuviPanda: via the rest interface? [16:15:30] hmm [16:15:35] oh, I forgot I wrote that [16:15:39] There should be a log of queries/responses someplace right? [16:15:40] let me go see what it's doing [16:15:42] Yeah, you wrote it :) [16:15:42] yeah [16:16:02] YuviPanda: I'm dumping it on you mostly 'cause I haven't had breakfast and it is now lunchtime so I'm about to disappear for a bit. [16:16:08] heh :) [16:16:09] ok [16:16:12] But also you're the expert! [16:17:09] milimetric: try now? [16:17:35] !log project-proxy restarted dynamicproxy-api on dynamicproxy-gateway [16:17:38] Logged the message, Master [16:17:39] YuviPanda: looks fixed to me [16:17:39] andrewbogott: is back up for me [16:17:41] yeah [16:17:46] Do we just need an upstart script for that? [16:17:47] I guess it didn't restart properly... [16:17:48] +beer YuviPanda, thanks :) [16:17:54] andrewbogott: we have an upstart script for that [16:18:01] I just did restart dynamicproxy-api [16:18:06] andrewbogott: it was running as well. [16:18:27] Oh, that's unfortunate. [16:18:41] and I don't think it keeps logs... [16:18:48] btw, YuviPanda (and everyone) Reedy and I have a good theory for why the session bug has been happening like crazy this week. This will be deployed later in the day and should fix it… https://gerrit.wikimedia.org/r/#/c/165501/ [16:18:58] The bad news, it turns out that keystone sessions are probably expiring every 60 minutes. [16:18:59] w0000tttt [16:19:02] oh wow [16:19:05] Which means… you need to log out and in semi-constantly [16:19:09] hmm, it hasn't been that frequent for me... [16:19:23] Yeah, which means my patch might not fix it :( [16:19:41] andrewbogott: btw, Krinkle wrote a ganglia-ish viewer for graphite... https://tools.wmflabs.org/nagf/?project=tools [16:19:43] kinda. [16:19:47] milimetric: ^ [16:20:55] that's a lotta graphs :) [16:21:54] YuviPanda: so I just added "http://pentaho.wmflab.org/" pointing to port 8080 on a new instance [16:22:03] and usually it works right away [16:22:08] but it's having trouble now... [16:22:13] * YuviPanda checks [16:22:48] milimetric: hmm, I get redirected to http://pentaho.wmflabs.org/pentaho/Login [16:23:19] milimetric: if you had accessed the URL before adding the entry, your local DNS might've cached it [16:23:37] ah, good call - i was hitting it before you restarted the proxy thingamagig [16:23:38] thx [16:23:42] :) [16:24:43] YuviPanda: thanks for the quick fix [16:24:57] :) [16:25:09] I'm just slightly pissed at whoever wrote that thing in the first place for not adding logging [16:25:11] stupid person [16:28:45] * bd808 sees pentaho and shudders [16:29:10] it vaguely sounds like the state you live in, bd808 [16:29:22] Open core and GWT front end [16:29:29] ow [16:29:36] * YuviPanda runs away from GWT [16:29:42] I have a lot of past trauma :) [16:29:48] heh [16:29:56] bd808: did you see the thing Krinkle built? https://tools.wmflabs.org/nagf/?project=tools [16:30:03] this? http://en.wikipedia.org/wiki/Pentaho [16:30:10] YuviPanda: I did. [16:30:19] chasemp: yup [16:30:53] that big table in the middle of that article should perhaps not be there... [16:31:30] We used it at Kount to give clients access to a data warehouse and I drew the short straw of integrating it with our systems. [16:35:28] coren: andrewbogott: hello! I am awaited for dinner, but I noticed a virt1005 instance refuses to reboot following yesterday outage. It is deployment-cxserver01 on deployment-prep project. Filled bug with info at https://bugzilla.wikimedia.org/show_bug.cgi?id=71783 [16:35:58] and off for dinner sorry [16:36:09] hashar: I'm heading out for lunch, will look when I return [16:36:24] andrewbogott: awesome. Bon apétit! [17:18:12] Four of my toollabs projects were restarted correctly, but "osm" is one of the projects with a bigbrother that doesn't start. What could be the reason? [17:20:33] Kolossos: It's possible that it tried the maximum number of times (3 per 24h) before the failed host came back up. [17:21:13] bigbrother helps, but was designed to help failed individual tools and doesn't really have support for "omg, server asplode!" [17:22:16] Your bigbrother log will tell you; it probably has a "failed too many times" message in it; that means it'll try again 24h past the third-to-last attempt. [17:22:27] But you can always restart it manually. [17:23:10] Ok, i will restart it now, but I hope it will be more stable in the future. As a volunteer I can not look each hour that all my tools are running. There was a promise to be more stable than the Toolserver... [17:28:26] The interesting thing is that in project "osm" is no log-file, but in correct restarted projects like templatetiger. [17:33:10] Kolossos: I don't want to knock toolserver, but last time I remember an entire server failing there the effects were much more severe and longer lasting. [17:40:35] It seems I have a problem to start webservice for project osm: [17:40:58] tools.osm@tools-login:~$ webservice start [17:41:00] Starting webservice... started. [17:41:01] tools.osm@tools-login:~$ qstat [17:41:03] job-ID prior name user state submit/start at queue slots ja-task-ID [17:41:04] ----------------------------------------------------------------------------------------------------------------- [17:41:06] 4653796 0.00000 lighttpd-o tools.osm qw 10/08/2014 17:38:18 1 [17:41:07] tools.osm@tools-login:~$ qstat [17:41:09] tools.osm@tools-login:~$ [17:55:10] It looks to me as though it starts then immediately stops. What is in your error.log? [17:57:06] 2014-05-05 20:31:34: (log.c.166) server started [17:57:07] 2014-05-05 20:32:51: (server.c.1512) server stopped by UID = 0 PID = 24825 [17:57:09] 2014-05-05 20:32:51: (server.c.1502) unlink failed for: /var/run/lighttpd/osm.pid 2 No such file or directory [17:57:10] Duplicate config variable in conditional 0 global: server.dir-listing [17:57:47] No idea what this means. Projects osm host only some js-libs and css files. [17:58:38] Sorry, old stuff: [17:58:44] New things: [17:58:46] 2014-10-08 14:30:42: (configfile.c.912) source: /var/run/lighttpd/osm.conf line: 548 pos: 16 parser failed somehow near here: fastcgi.server [17:59:25] Hm, this is probably because you .lighttpd.conf doesn't end with a newline as it must. [17:59:54] * Coren tests. [18:00:53] (03PS1) 10BearND: Output the current time stamp to stdout and stderr [labs/tools/wikipedia-android-builds] - 10https://gerrit.wikimedia.org/r/165520 [18:01:48] Hurray, now it works. Thanks. [18:02:20] Yeah, I was about to tell you that I tested it with a missing newline and that cause the same issue. :-) [18:06:28] 3Wikimedia Labs / 3Infrastructure: Jenkins can not ssh to deployment-cxserver01 (hosted by virt1005) - 10https://bugzilla.wikimedia.org/71783#c4 (10Andrew Bogott) If this instance has any important data I can try to reclaim the drive contents. Otherwise you should just delete and recreate. [18:29:50] YuviPanda: Around? Do you happen to know if our current Mariadb setup has a session time limit? [18:30:21] multichill: hey, I don't know, but I think it will timeout if held idle for a certain period of time [18:30:25] The runtime of one of my bots exploded and now I'm getting _mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away') [18:30:25] * YuviPanda still doesn't have root [18:30:50] multichill: yeah, that happens reasonably quickly, I think. You need to use a pool, or do conn.ping() [18:31:00] you can't keep a connection open forever with mysql.... [18:31:14] I don't think it's a long query [18:32:14] cursor.execute(query, (countrycode, lang, monumentId)). query = u"""SELECT `name`, `commonscat`, `monument_article`, `source` FROM monuments_all WHERE (country=%s AND lang=%s AND id=%s) LIMIT 1"""; [18:32:28] So basically lookup a monument and return relevant info [18:32:33] Thats subsecond [18:32:44] country/lang/id is the primary key [19:06:39] multichill: I'd wager that it's not the /first/ query that dies, but the last of a long series? [19:06:51] Coren: check [19:07:03] multichill: it's not a 'long query' but the *connection* that has been open for long [19:07:10] multichill: when was the connection that you got cursor from open? [19:07:18] For hours [19:07:24] yeah, that'll fail [19:07:49] can anyone help out with a template import to beta labs, or y'all scrambling to fix the outage? [19:09:53] yuvipanda: Why exactly? Is there some sort of max session time? [19:10:04] multichill: yes! I'm looking for the exact docs, moment [19:10:09] mysql kills them after a certain 'idle' time [19:10:25] It's not idle, it's doing loads of small queries [19:10:50] multichill: https://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_wait_timeout [19:11:02] … guess i answered my own question. [19:11:07] On my firewalls I usually have two settings: One for inactivity (say 30 minutes) and for maximum session time (say 24 hours) [19:11:14] Maryana: which templates? i can take a look... [19:46:13] 3Wikimedia Labs / 3Infrastructure: Jenkins can not ssh to deployment-cxserver01 (hosted by virt1005) - 10https://bugzilla.wikimedia.org/71783#c5 (10Antoine "hashar" Musso) (In reply to Andrew Bogott from comment #4) > If this instance has any important data I can try to reclaim the drive > contents. Otherwi... [19:46:43] andrewbogott: thank you to have looked at deployment-cxserver01 (dead instance related to virt1005 outage) [19:47:04] hashar: you're welcome -- I don't know what's gone wrong with it though :( [19:47:12] andrewbogott: will check with i18n folks and we will most probably recreate it. Don't waste your precious time trying to recover it :] [19:47:25] andrewbogott: unless i18n team says they need data on there, but that is unlikely [19:47:41] shit happens :D [19:47:59] also I noticed at least one occurrence of a corrupted file on a labs instance (solved though) [19:48:07] might or might not be related to virt1005 outage [19:49:36] It's possible. The system was too far gone for me to shut it down gracefully [20:04:46] Coren: bd808: I noticed a strange and significant increase in incoming packets on most instances in different projects. Not sure what they have in common. [20:05:04] Krinkle: What kind of timeline? [20:05:31] Coren: Around Sep 28. [20:05:47] From N packets received to 2N packets sent, it went from 2N packets sent and 4N packets received [20:06:30] Krinkle: That's... odd. Any idea where that new traffic comes from? [20:06:57] No idea, but it affects completely unrelated projects [20:07:02] maybe internal services [20:07:35] Monitoring is the only significant thing I can think of that'd have wide effects like this, but I'd expect it to tip traffic in the other direction. [20:14:22] Coren: Here's a few examples [20:14:22] https://tools.wmflabs.org/nagf/?project=integration#h_integration-puppetmaster_network-packets [20:14:35] https://tools.wmflabs.org/nagf/?project=bastion#h_bastion2_network-packets [20:14:49] https://tools.wmflabs.org/nagf/?project=tools#h_tools-mail_network-packets [20:14:54] Yeah, that's noticable indeed. [20:15:08] https://tools.wmflabs.org/nagf/?project=cvn#h_cvn-dev_network-packets [20:15:09] YuviPanda: ^^ could this be your monitoring? [20:15:47] * YuviPanda reads backscroll [20:16:12] Coren: Krinkle shouldn't be, since diamond is the only thing that runs on all machines, and that's been on for quite a long time [20:16:22] Hm. [20:16:29] * Coren investigates further then. [20:22:35] Krinkle: At first glance, it doesn't look like a sudden increase so much as a quiet period (Sept-16 Sept-29) [20:23:44] The effect is small, in absolute terms - it's only noticable on instances that have naturally low traffic. [20:42:30] 3Wikimedia Labs / 3deployment-prep (beta): VisualEditor:"404 file Not Found Error" when logging into betalabs - 10https://bugzilla.wikimedia.org/71806#c4 (10Andre Klapper) 5UNCO>3NEW a:3None I can confirm comment 3, same outcome here, simply because http://login.wikimedia.beta.wmflabs.org/wiki/Special:... [20:46:11] YuviPanda: Do you want me to put commits to nagf in -labs or -qa? [20:46:28] Krinkle: -labs, yeah [20:46:31] k [20:46:35] Krinkle: I'll respond to the email soon, I think [20:46:49] I've updated github config [20:46:51] updating travis now [20:46:59] cool [20:47:14] Krinkle: it should probably live in gerrit tho [20:47:34] [13nagf] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/wikimedia/nagf/commit/a76aa8be7f75ad1c877774def61a3f10b8a40630 [20:47:34] 13nagf/06master 14a76aa8b 15Timo Tijhof: travis: Send notifications to #wikimedia-labs [20:47:40] YuviPanda: I'd rather be productive. [20:47:44] Krinkle: heh :) [20:47:46] fair enough [20:47:50] I'll wait for phab [20:47:55] I've given you all access [20:47:59] via @wikimedia [20:48:16] ssh tools-login; become nagf, cd src/nagf; git pull [20:48:19] (to deploy) [20:48:21] cool [20:48:27] and to github as well [20:48:34] Krinkle: I'm slightly nagged about using PHP, but that's ok, I think. [20:48:39] though you may wanna do a local branch pull request if you want code review. [20:48:41] Krinkle: I don't mind just building on top of this. [20:48:50] Krinkle: yeah, I'll fork and PR. [20:48:55] cool :) [20:49:14] Krinkle: I think if we add ability to customize things with JSON/YAML blobs, this should be enough... [20:49:22] YuviPanda: It's what we know :) [20:49:30] Krinkle: heh :) [20:49:34] Krinkle: well, most ops tools are python... [20:49:57] Krinkle: I've been keeping an eye on the network traffic, and I see nothing out of the ordinary. I'm thinking at least part of the reduction for a week was caused by the LDAP outage? [20:50:03] The dates fit [20:50:04] wikimedia/nagf#14 (master - a76aa8b: Timo Tijhof) The build was broken. - http://travis-ci.org/wikimedia/nagf/builds/37439138 [20:50:18] [13nagf] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/wikimedia/nagf/commit/6034ab45adad28d52ddc877c1804bfd75819b63e [20:50:18] 13nagf/06master 146034ab4 15Timo Tijhof: Remove unused settings.json... [20:51:38] [13nagf] 15Krinkle pushed 1 new commit to 06master: 02https://github.com/wikimedia/nagf/commit/7e89da5a6460d25eacecbd61a957bb78398db9ec [20:51:38] 13nagf/06master 147e89da5 15Timo Tijhof: travis: Exclude vendor/ from linting and coding style [20:51:48] YuviPanda: Yeah, I had that in place ^ (settings.json) but factored it out before the first commit. [20:52:00] we should bring that in at some point to allow configuration. [20:52:21] Krinkle: yeah. toollabs should probably have some more stats (Grid stats, for which I haven't written a collector yet, ugh) [20:52:40] Right, you want to add extra graphs. Makes sense. [20:52:52] I think grafana and other dashboards are more suitable for custom graphs though [20:52:57] wikimedia/nagf#15 (master - 6034ab4: Timo Tijhof) The build is still failing. - http://travis-ci.org/wikimedia/nagf/builds/37439382 [20:53:05] wikimedia/nagf#16 (master - 7e89da5: Timo Tijhof) The build was fixed. - http://travis-ci.org/wikimedia/nagf/builds/37439479 [20:53:22] Krinkle: true, but don't want to have two places... [20:53:51] YuviPanda: Does ganglia support custom graphs? How do they do it? [20:54:21] Krinkle: https://github.com/ganglia/ganglia-web/wiki#Defining_Custom_Graphs_Via_JSON [20:54:50] YuviPanda: https://github.com/wikimedia/nagf/blob/master/inc/Nagf.php#L30-L43 [20:55:10] it's mostly abstracted already. I guess we can have a json file by project-name that adds extra ones [20:55:31] yeah [20:55:55] ./graphs/default.json ./graphs/tools.json something like that [20:56:00] yeah [20:56:19] I can do that at the weekend, or maybe you wanna get your hands dirty :) [20:56:48] Krinkle: I'll try, but currently feeling my way around the OpenStackManager code... :) [20:57:06] I'd also like feedback on the code in general. [20:57:09] Caching is a bit flaky [20:57:13] but should work [20:57:48] Krinkle: hmm, inline PHP. I'm not fully sure how I feel about that :) [20:57:59] I wrote a better cache interface for my krinkle tools, https://github.com/Krinkle/toollabs-base/blob/master/src/Cache.php [20:58:17] YuviPanda: only where appropiate, it's a template (index.php), logic is elsewhere. [20:58:25] true [20:58:29] hence not fully sure :) [20:59:17] Krinkle: interesting. why multi store? [20:59:37] YuviPanda: So that I don't rely on redis never being cleared. I can repopulate from disk. [20:59:43] but more importantly, the php memory layer [20:59:51] well, it's cache so it should be expendable... [20:59:58] for repeated access without needing class static instance caching everywhere [21:00:22] ah, hmm [21:01:00] Doesn't all the code get 're-run' for every request anyway? [21:01:13] causing memory layer cache to be a bit useless [21:01:31] No, this was a data-informed choice [21:01:40] my tools tend to interact a lot with information that's queried [21:01:49] e.g. form urls for a certain wiki [21:01:55] and there'll be lots of wikis relevant in a single request. [21:01:57] ah, inside the same request? [21:01:59] right [21:02:01] that makes sense. [21:02:14] it made the code a lot easier [21:02:37] https://github.com/Krinkle/toollabs-base/blob/master/src/Wiki.php#L196-L202 [21:02:39] yeah, I think it'd be nice for that use case. [21:02:43] although I do have static caching there as well it seems [21:03:21] oh well, what you do as a mediawiki engineer, you keep recreating parts of MediaWiki in small ways [21:03:28] hehe :D [21:03:34] * YuviPanda has more python habits than PHP habits [21:03:51] My condolences, I want my semi colon back. [21:03:56] haha :D [21:04:06] Krinkle: not as bad as Java [21:04:07] nah, python is nice. [21:04:18] YuviPanda: php not as bad as java? [21:04:44] Krinkle: nothing is as bad as java :) [21:04:57] it's pretty fast, but the language / surrounding toolset is sometimes terrible. [21:12:42] YuviPanda: I was going to write it in nodejs first, but I didn't want the bother of setting up a dedicated server. Now it's just plug-and-play in my /git localhost [21:12:42] running apache [21:12:46] same for any other language. [21:12:56] It's just convenient, lazy. [21:13:05] Krinkle: true, but if you wrote it in node you wouldn't need apache... [21:13:10] Krinkle: node in toollabs is still fucked, though [21:13:12] (I think?) [21:13:15] there will always be apache [21:13:30] yes, it's very outdated and probably issues with paths/permissions. [21:13:33] nope, I've been trying to make sure there's no apache in most services I touch ;) [21:13:40] and ports, will need lighthttp proxy [21:14:01] I mean, I will alwyas have apache or nginx with php running locally on my dev docroot. [21:14:19] so that I can navigate it statically for simple html projects, and projects like php [21:14:32] ah, right [21:15:19] so from an end-user point, using php means it just works. Be it in my local apache under /git/wikimedia/nagf or at tool labs with the lighthttp/php-cgi set up that works by default there [21:15:31] true [21:15:41] deploying php has a very low barrier [21:17:09] but I guess I've done enough uwsgi+python deployments for them to seem like nbd to me... [21:23:46] nbd? [21:24:12] no big deal [21:24:20] right [21:24:38] I've not deployed a node app before, but I just did one yesterday and it was not too hard either... [21:26:57] yeah, it's not hard. I run a few, even in labs. [21:27:32] But when doing serious web-facing stuff that involves lots of html, I still prefer a synchronous language that's overloaded with built-in features. [21:27:34] aka php [21:28:02] but I'll never again write an irc bot in php [21:28:08] that was a horrying experience [21:28:13] horrifying [21:28:28] I'll use python or node instead [21:41:29] YuviPanda: there's false positives in nagf for labmon1001 since its not a real project it seems [21:41:54] having a way to identify them would be nice (without ducktyping) [21:42:08] Krinkle: true. need to build an OpenStack API [21:42:18] andrewbogott: do you have a few minutes? I've one question re OS [21:42:42] what's up? [21:43:08] andrewbogott: is there an ldap user/password that I can read in code, and use to perform actions against OpenStack API? [21:43:27] andrewbogott: I want to write a readonly API endpoint that lists projects and instances per project, but without needing authentication [21:43:39] andrewbogott: but since I need to authenticate to call the openstack API, I need a dummy user with just read perms... [21:44:07] There isn't anything like that currently. I'm not sure if it's something we could safely create. [21:44:13] :( [21:44:28] andrewbogott: can't we treat the password similar to the db password and put it in the same place? [21:44:30] 3Wikimedia Labs / 3deployment-prep (beta): "404 file Not Found Error" when logging into betalabs - 10https://bugzilla.wikimedia.org/71806#c5 (10spage) p:5Unprio>3High Raising priority (and changed summary) since most Flow browser tests are failing. Login through the API may be working. [21:44:43] But OSM can certainly gather that info. So adding api calls to OSM is probably the right approach. [21:44:43] 3Wikimedia Labs / 3deployment-prep (beta): "404 file Not Found Error" when logging into betalabs - 10https://bugzilla.wikimedia.org/71806#c6 (10Matthew Flaschen) http://login.wikimedia.beta.wmflabs.org/ is entirely down. Normally there is a whole wiki there (albeit one normally used only for login). [21:45:02] YuviPanda, you can do that with OAuth then just store the token in config. [21:45:07] Is OAuth enabled on wikitech? [21:45:18] andrewbogott: I looked at the code in NovaInstance, and it seems to be performing operations requiring a ldap token [21:45:29] andrewbogott: in fact the OS class doesn't seem to be instantiatable without a token... [21:45:43] superm401: nah, this script needs to run from puppet to hit wikitech API to pull in instances/projects metadata... [21:45:50] superm401: so OAuth isn't an option. [21:46:16] YuviPanda, why not? If it were installed, you could do a one-time setup to get the token, then store it in the private puppet repo. [21:46:32] superm401: if we do it in wikitech, it'll be useful for a lot of other tools too (like nagf) [21:47:45] ^ https://bugzilla.wikimedia.org/show_bug.cgi?id=71806 [21:49:01] YuviPanda: the 'novaenv' user should be able to read everything from keystone. And its password is stored in the wikitech config. [21:49:12] andrewbogott: ah, perfect. that's what I was looking for. [21:49:13] thanks [21:49:36] superm401: is it -labs or -qa where we report problems with betalabs [21:49:57] probably -qa [21:50:39] YuviPanda: that would be wgOpenStackManagerLDAPUsername and wgOpenStackManagerLDAPUserPassword [21:50:47] I know that with those values I can query nova from the cmdline with --all-tenants [21:50:55] aah [21:50:56] cool [21:51:08] I'll re-familiarize myself with PHP/MW in a bit and try something out [22:04:25] !log deployment-prep updated OCG to version def24eca [22:04:27] Logged the message, Master [22:57:56] Does anyone other than Magnus control TUSC? Because it's throwing a 403, which is causing problems for us downstream users. And he's almost never on to fix it. [22:58:39] Magog_the_Ogre: He's the only listed maintainer, but I can give it a kick. [22:58:49] Yes please :) [22:59:46] Magog_the_Ogre: It looks like it's working after a kick. [23:00:24] :-/ it is not working Coren [23:00:41] Hm. It's not just the webservice then. Lemme try to see if I can see what's up. [23:01:31] ... I don't see 403s in the log at all, and the service seems to work for me. What URI are you hitting, exactly? [23:02:00] Also we really need to get OpenID working. *grumble* [23:03:50] Magog_the_Ogre: I'm not seeing anything wrong with tusc; it authenticates me well enough. Can you point me at where you get the error? [23:04:07] through the bot interface [23:04:08] 403 [23:04:29] I'm someone else having related issues: https://commons.wikimedia.org/wiki/User_talk:Magnus_Manske#JIRA_system_seems_down_And_403_Error_with_commonsapi [23:05:22] That refers to a toolserver.org address. Toolserver is dead. [23:05:33] oh [23:05:36] well I'm not using that [23:05:37] here: [23:05:56] Coren, the URL is http://tools.wmflabs.org/tusc/tusc.php?check=1&botmode=1&user=$tusc_user&language=commons&project=wikimedia&password=$tusc_pswd [23:06:08] feel free to use whatever tool you'd like to perform the POST request [23:06:12] That doesn't 403 for me, it returns 0 [23:06:35] you have to actually type in the right password and username [23:06:54] 0 means "not authenticated". If I use my own credentials with a POST, it returns 1 [23:07:12] What IP are you trying from? Maybe I can see you in the logs. [23:07:47] OK, let me try again [23:09:12] Coren: 2601:1:400:721:C59F:977B:87AB:6064 [23:09:45] Magog_the_Ogre: Sadly, Labs doesn't speak v6. :-) [23:13:03] Magog_the_Ogre: I see exactly one failed attempt from you, about 20 minutes ago (before I kicked tusc). No other errors since, and no attempts for that matter [23:13:22] I just tried again and it worked [23:13:26] in soapui [23:13:28] but not in my bot [23:13:30] I don't know why [23:13:37] I'll have to investigate [23:13:39] I wonder if it's an IPv6 issue [23:22:02] Coren, might it have to do with user agent? [23:25:07] Coren, that was it [23:25:16] Labs was rejecting my call because it lacked a user agent [23:29:13] Ah! Yes, that's something that was always on but recently noticed to have been accidentally disabled. Sorry for the inconvenience; the projects have been normally rejecting bad UAs for years now.