[01:41:56] &ping [01:41:56] Pinging all local filesystems, hold on [01:41:57] Written and deleted 4 bytes on /tmp in 00:00:00.0006120 [01:41:58] Written and deleted 4 bytes on /data/project in 00:00:00.0057690 [01:44:54] &ping [01:44:54] Pinging all local filesystems, hold on [01:44:55] Written and deleted 4 bytes on /tmp in 00:00:00.0005100 [01:47:02] Written and deleted 4 bytes on /data/project in 00:02:08.0577890 [02:02:16] !log tools robots.txt: "Disallow: /" [02:02:21] Logged the message, Master [03:07:47] !log tools tools-webproxy: Temporarily serving 403s to AhrefsBot/bingbot/Googlebot/PaperLiBot/TweetmemeBot/YandexBot until they reread robots.txt [03:07:48] Logged the message, Master [14:13:04] Coren: so how did the nfs stuff go? my labs instanes aren't looking very happy.... [14:14:31] It didn't; the copy took almost 8h instead of the 2-3 I was counting on. [14:14:49] It's done though, so all that is left is a brief rsync. [14:15:06] (This is why I rescheduled for this afternoon) [14:15:47] Coren: ah. I didn't see the reschedule. Thats cool. [14:16:02] I can't really shell into any of my labs machines any more... [14:16:05] Turns out there is already a LOT of stuff people saved on those disks. :-) [14:16:13] I bet! [14:16:46] manybubbles: That shouldn't happen; afaik the only issue is the stalls at irregular intervals; they are annoying as [bleep] but don't actually break anything. Which instance are you having problem with? [14:18:06] Coren: solr-mw3, elasticsearch0, elasticsearch1, elasticsearch2, elasticsearch3, solr-mw are the ones iirc [14:18:12] certainly the first one isn't working [14:18:18] I'll recheck the rest [14:19:53] Hm. tools seem to be working fine, and NFS is responsive. [14:20:44] Wait, you're not using NFS at all! [14:21:09] solr-mw3, at least, is on gluster. [14:24:46] !log tools chmod 644 ~magnus/.forward [14:24:48] Logged the message, Master [14:26:47] Coren: hmmm - odd. in any case I can't connect to them. [14:27:41] manybubbles: That's because /home is completely dead. Not sure why yet, it's not just because one of the bricks is confused since that only puts it readonly. [14:27:49] * Coren checks. [14:28:02] Coren: thanks! [14:31:03] !log tools tools-webserver-01: "dpkg --configure -a" on apt-get's advice [14:32:00] Logged the message, Master [14:32:52] Hm. Gluster is broken in a way I've yet to see. Should be a slogan for them: "Gluster! Breaks in all sorts of imaginative ways!" [14:33:24] But it's Web Scale! [14:36:06] * Coren grumbles. [14:36:30] I'm not able to start the volume. The diagnostic message is of great clarity: "operation failed" [14:39:27] Coren: amazing! [14:39:54] manybubbles: I think I manage to restart it. It required some seriously evil maneuvers (ugh. kill -9) [14:40:43] and now everything lets me connect to it! you are a wizard [15:14:01] !log tools tools-webserver-01: Simplified /usr/local/bin/php-wrapper [15:14:03] Logged the message, Master [15:16:50] Coren: Re Magnus's tool (and the impending NFS copy), have you done any forensics you wanted to do so the directory can be restored? [15:17:38] scfc_de: No, I left the original filesystem intact. But odds are you are correct that this is leftover from the old corruption: it's exactly the same symptoms, and the dates fit. [15:18:05] It might be simpler to just restore once I switched to the copy this afternoon. [15:19:31] Coren: No, I meant to ask whether the directory can be "repaired" (permissions fixed, directories recreated, etc.). This should be done obviously after the switch, but I wanted to check /whether/ it can be done. [15:19:49] Yes, there is no information left to be gotten. [15:20:46] k [15:21:57] I do hope that the reboot afterwards might bring tools-webserver-01's back to a "defined state" :-). [15:22:39] Coren: Do you know whether global_access.log is used by anything but awstats? [15:23:12] is it still common to have a need to add a public key both to wikitech.wikimedia.org and to gerrit? [15:23:55] it took me 5 minutes to get tired from "permission denied (publickey)" and figure this out [15:24:19] Not as far as I know. [15:24:42] I've always had to add keys to both. [15:26:51] well, that is what happened to me on tools-login [15:28:49] [bz] (8NEW - created by: 2Tim Landscheidt, priority: 4Unprioritized - 6normal) [Bug 52196] Labslogbot creates shell request pages for users in "limbo" - https://bugzilla.wikimedia.org/show_bug.cgi?id=52196 [15:31:03] wizardist: I believe that gerrit's auth infrastructure isn't able to tie into LDAP for the key; so yeah, both are still necessary. [15:31:21] I think Ryan_Lane is working on something in re auth between wikitech and gerrit though. [15:31:30] * Ryan_Lane is not [15:31:41] I guess I could actually make it sync the keys, though [15:32:11] but that would mean also somehow disabling that management in gerrit and I don't know if that's possible [15:54:11] *Argl*. Looking an hour at the Apache conf, only to find that access.log is generated by the logsplitter. "Assumption is the mother of all fuckups." [16:47:08] Ryan_Lane: Why disable; I don't think it'd harm things by being there even if it is redundant. [16:52:08] does amir ladsgroup IRC? [17:32:12] Change on 12mediawiki a page Developer access was Jeremyb, changed by https://www.mediawiki.org/w/index.php?diff=764768 link [+31] link to [[How to contribute]] edit summary: $6 [17:32:26] Change on 12mediawiki a page Developer access was Jeremyb, changed by https://www.mediawiki.org/w/index.php?diff=764769 link [+1]  edit summary: $6 [17:43:12] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Normal - 6enhancement) [Bug 49779] sync articles from production wikis (css/gadgets) - https://bugzilla.wikimedia.org/show_bug.cgi?id=49779 [19:09:52] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4High - 6critical) [Bug 52867] monitor application server are responding - https://bugzilla.wikimedia.org/show_bug.cgi?id=52867 [19:49:26] * sumanah looks at https://meta.wikimedia.org/wiki/Grants:IdeaLab/Labs2  [20:06:51] Isn't the WMF data crunching apparatus (Kraken?) operational yet? [20:09:09] scfc_de: hmm. looking at https://www.mediawiki.org/wiki/Analytics/Kraken [20:09:31] it looks like something is working :) [20:11:27] Hi all [20:11:41] Hi Elph! [20:11:49] I have a Q [20:12:14] when i want to login in my account on labs [20:12:42] i face "Server unexpectedly closed connection" [20:13:08] Elph: Note the topic of the channel, and yesterday's announcement on labs-l. :-) The NFS server is currently down for maintenance. [20:14:32] for how long? [20:15:46] 20:30 UTC :) [20:16:04] Thanks all [20:18:04] Looks like I can't shell into deployment-bastion again.... [20:22:49] ah NFS maintenance, that explains it... [20:25:34] ah! he said it'd be this afternoon [20:26:36] "this afternoon" -> "local time". Fun with timezones. [20:26:50] Which is why I also stated UTC for lower ambiguity. [20:28:37] * Coren grumbles. [20:29:36] That's actually pessimistic, I think only 10 more minutes will do. [20:29:49] yay [20:30:17] Blame cyberbot. It managed to cause ~100G of new deltas to update. :-) [20:30:50] Coren, probably a bad time, but just to refresh your memory. In HK we talked about excluding a cache dir from timetravel. I'm not quite sure what to request on bugzilla for thet. [20:30:54] that [20:31:09] will that be on a separate partition/volume? [20:31:16] or is that a per directory setting [20:31:28] It would be. It's also a little moot atm since I'm currently disabling timetravel. [20:32:22] ha [20:32:56] ok, 'll just proceed for now with my setup and will follow up on this once the cache dir is filling [20:49:24] Rebooting. [21:01:46] Broadcast message from root@tools-login // (unknown) at 20:59 ... // The system is going down for halt NOW! // Power button pressed [21:01:48] quite a long reboot ;-) [21:01:50] what's this? [21:02:22] see channel topic line [21:04:09] * Coren is having issues. [21:04:16] "yeay" [21:06:47] back in business! thanks [21:07:08] ... not quite. [21:07:09] oh, that was a bit premature [21:07:12] Not sure why either. [21:07:20] login kicks me out right after MOTD [21:07:41] Yeah, it's not getting the filesystem right. Shouldn't be long. [21:17:49] Ah, there we go. I think. [21:18:27] ... not quite, seemingly. [21:19:29] Ah, just was slow for a bit. [21:20:51] But, as expected, I'll have to reboot the grid. How... annoying. [21:27:17] Coren: You only worked on labnfs.pmtpa.wmnet:/tools/*, so Toolsbeta or beta don't need to be rebooted, right? [21:28:02] hm, when I do "become projectname" the project's .bashrc is not parsed and my user defined aliases are gone, too [21:28:10] scfc_de: Nope. As I've said in the original email, they almost certainly will. [21:28:25] how can I persistently set shell command aliases? [21:28:35] dschwen: "become" gives you a login shell, it won't source .bashrc by default. You can source it from its .profile though. [21:28:45] ok [21:29:03] (login shell use .profile, not .bashrc; typically one's .profile sources the latter so that you get it in both cases) [21:29:29] That should probably end up in a FAQ somewhere. [21:29:51] Coren: toolsbeta-login.pmtpa.wmflabs works fine for me without stale handles; "labnfs.pmtpa.wmnet:/toolsbeta/home on /home type nfs". [21:30:13] scfc_de: and ls -l /home works? [21:31:26] I think the probability that a reboot will be required depends on whether there were open files at the time of the switch. I would have expected that /some/ at least would have been. [21:32:02] At any rate, now comes the waiting game. We wait to see whether we still get stalls. [21:32:15] If not, then the boo-boo was hardware. [21:32:20] Coren: Yes, fine. Don't know about open files, though. [21:32:40] So we just need to wait 15 minutes? :-) [21:32:59] scfc_de: Heh. I'm going to hold off on judgment until at least a day passes. [21:33:20] But yeah, typically at this time the stalls occured several times an hour. [21:33:59] logging in to tools-dev still does not work [21:34:06] ... works for me. [21:34:17] Oh, tools-*dev* [21:34:19] i get kiecked out after the motd [21:34:23] forgot to reboot that one. Sorry. [21:34:36] do we care about tools-dev? [21:35:25] We do. I just forgot it. :-) [21:35:30] I'm trying to compile stuff, but it seems that the libtiff4-dev package is not installed on tools-login (the bugzilla only mentioned it being installed in the exec_environment) [21:35:52] should I configure and compile using qsub? ;-) [21:36:14] Actually *-dev should normally be only installed on dev_environment, never exec_. Musta been an error [21:36:33] dschwen: Is that the Bugzilla entry that's still open? [21:36:44] -dev should be okay now [21:37:02] https://bugzilla.wikimedia.org/show_bug.cgi?id=52717 reopened [21:37:08] but due to a different package [21:38:16] Coren: IIRC exec_env is a subset of dev_env. [21:39:12] and apparently tools-dev is the dev_env (and NOT tools-login). That should be obvious from the name. [21:39:14] scfc_de: It is. [21:39:30] dschwen: Actually, -login is also dev_env [21:39:46] dschwen: tools-login and tools-dev should be absolutely the same regarding packages & Co. [21:39:56] They are intended to be indentical, you /can/ do everything on both. [21:39:57] but libtiff4-dev was not found configuring on tools-login, only worked on tools-dev [21:40:11] odd [21:40:38] Is a bug. [21:40:39] puppet.log on -login shows "libtiff4-dev : Depends: libtiff4 (= 3.9.5-2ubuntu1.5) but 3.9.6-11 is to be installed" [21:41:36] Ah. Ops's policy of only using "ensure => installed" rather than "ensure => latest" strikes again. [21:41:58] I can understand why the policy, doesn't mean it's not a pain in practice. [21:42:57] [bz] (8NEW - created by: 2Daniel Schwen, priority: 4Unprioritized - 6normal) [Bug 52902] Please install libfcgi-dev - https://bugzilla.wikimedia.org/show_bug.cgi?id=52902 [21:43:27] Just to clarify: Puppet installs always the latest version of a package, and such lib* and lib*-dev are not in sync? [21:45:28] No, that can't be it: libtiff4 is installed in a newer version than the version libtiff4-dev is to be installed. [21:45:56] So libtiff4 has been updated /manually/ in the past? [21:48:08] Rebooting tools-exec-01, doesn't respond. [21:57:29] is FastCGI even going to work on tool labs? [22:06:33] [bz] (8NEW - created by: 2Daniel Schwen, priority: 4Unprioritized - 6normal) [Bug 52903] Please remove service group 'qicbot' - https://bugzilla.wikimedia.org/show_bug.cgi?id=52903 [22:24:29] Hello, there seems to be something wrong with "X!'s Edit Counter" -- it's telling me that I do not exist. [22:24:40] hi tucoxn [22:24:54] tucoxn: have you checked whether it also has that problem for other usernames? [22:26:05] Huon, a relatively new admin, also doesn't exist [22:26:19] checking another.... [22:26:25] tucoxn: do older usernames show up? yeah, sorry, let me wait [22:26:36] "ThaddeusB does not exist." [22:26:50] he's been around for a while (but so have I) [22:28:00] http://tools.wmflabs.org/xtools/pcount/index.php?name=Sumanah&lang=en&wiki=wikipedia [22:28:02] I think it's having an exestential crisis: "Sumanah does not exist." [22:28:08] :-( [22:28:10] tucoxn: you see the MySQL error at the top? [22:28:19] yes [22:28:24] Warning: mysql_connect(): Can't connect to MySQL server on 'enwiki.labsdb' (110) in /data/project/xtools/public_html/counter_commons/Database.php on line 62 Warning: mysql_select_db() expects parameter 2 to be resource, boolean given in /data/project/xtools/public_html/counter_commons/Database.php on line 63 etc. [22:28:30] ok, so that could be an issue here. Coren ^ [22:28:32] <^d> nfs :\ [22:28:35] ahhhh [22:28:40] that's the error I get [22:28:43] <^d> Same thing that's hurting beta atm :( [22:28:51] tucoxn: We're currently having a problem with file storage on Wikimedia Labs. [22:29:04] tucoxn: We're trying to fix it as soon as we can. [22:29:15] Understandable. I'm glad it's a known problem, then. [22:29:44] [[WP:NORUSH]] [22:30:03] are things working now? [22:30:24] tucoxn: just so you know, the next time you try to use a tool at wmflabs.org and you see an error message or warning of some kind, mentioning it when you mention the problem would be great [22:30:27] helps us track things down [22:31:09] ... actually, as far as I can tell, there /are/ no filesystem issues atm [22:31:32] The error is really long and I didn't think it would help [22:31:33] I'll do that next time [22:31:56] tucoxn: thanks! even just an excerpt and a link to the page where you see it would help [22:32:28] Coren: http://pastebin.ca/2432441 is the error [22:32:37] example of how to get it: http://tools.wmflabs.org/xtools/pcount/index.php?name=Sumanah&lang=en&wiki=wikipedia [22:33:20] sumanah: Ah, that was unrelated to NFS, and fixed. [22:33:41] That link of yours points me at a pretty graph without an error. Force reload, maybe? [22:33:58] Coren: oh, ok. lemme try [22:34:15] yep, back up now [22:34:35] Thank you very much! [22:34:42] tucoxn: glad to help :-) [22:34:43] Looks like it's working again. [22:34:45] tucoxn: yeah [22:34:53] tucoxn: sounds like it was a momentary blip [22:35:06] a hiccup [22:35:17] tucoxn: you may wish to inform TParis that it happened, in case it repeats [22:35:26] intermittent problems are hard to find and diagnose and fix [22:36:11] I'll paste the error on his talk page [22:36:48] #wikimedia-tech [22:37:08] dschwen: is that where I should paste it? [22:37:18] sigh [22:37:26] tucoxn: his talk page will be fine :-) [22:37:27] i can't IRC [22:37:32] :-( [22:37:33] thanks! [22:37:42] tucoxn: dschwen meant to *join* #wikimedia-tech and accidentally *said* it instead [22:37:49] hi dschwen! hope you are over your jet lag [22:38:01] barely [22:38:04] At least one bit of good news: not using the controler that was suspected of being faulty results in apparently no stalls on NFS [22:39:06] It does not cleanly isolate the issue as much as I'd have hoped, but it does mean that we are functional in the meantime. [22:41:10] tucoxn: I copied the error message into http://pastebin.ca/2432441 so you can use that when you tell TParis about the problem [23:17:36] Coren: So we're using a different hardware controller now? [23:17:41] But of the same type? [23:18:26] scfc_de: Of /nearly/ the same type. We're using the onboard H700 as opposed to the external H800 [23:19:17] In theory, it's the same chipset; it's definitely the same driver. [23:21:40] Well, the Ganglia CPU graph looks very promising. [23:25:12] hrm, beta labs still down it seems [23:27:21] and zero log files under @bastion1:/data/project$ [23:35:40] chrismcmahon: Are there stale NFS handles? [23:35:50] scfc_de: not that I can see [23:36:23] scfc_de: I try to ssh from bastion to deployment-* hosts and just get kicked out immediately. the hosts seem to be up, but I can't get to them [23:37:45] chrismcmahon: I think that's the condition Coren talked about in his labs-l post so I think that the (all) instances need to be rebooted. [23:39:19] scfc_de: I rebooted them all off the labconsole Special:NovaInstance page about an hour ago [23:40:45] Coren: ^