[01:25:02] marktraceur: updated patchset with another bugfix for skipping tutorial [01:54:40] zz_YuviPanda: What a strange channel to use [02:40:49] [bz] (8NEW - created by: 2Matthew Flaschen, priority: 4Unprioritized - 6enhancement) [Bug 50999] Install git-review on Labs - https://bugzilla.wikimedia.org/show_bug.cgi?id=50999 [04:00:11] [bz] (8NEW - created by: 2Tilman Bayer, priority: 4Unprioritized - 6normal) [Bug 50556] Total number of Wikipedia articles is ca. 11 million too high - https://bugzilla.wikimedia.org/show_bug.cgi?id=50556 [06:23:39] * zhuyifei1999 wonders why no one is saying a word [07:44:18] * zhuyifei1999 wonders why zeljkof is flooding this channel [07:48:54] zhuyifei1999: it should stop soon [07:49:00] I spoke with him [07:49:17] he's disabling his client b/c he's not on a very good connection right now [07:49:55] Jasper_Deng: ok. Thanks for telling me. [07:52:32] hi [07:54:11] * zhuyifei1999 says hi back [08:03:43] Hi petan :) [08:15:14] Unable to parse the feed from http://rss.gmane.org/messages/excerpts/gmane.org.wikimedia.labs this url is probably not a valid rss, the feed will be disabled, until you re-enable it by typing @rss+ mail [08:16:25] @rss+ mail [08:16:26] Permission denied [10:58:14] [bz] (8RESOLVED - created by: 2Matthew Flaschen, priority: 4Unprioritized - 6enhancement) [Bug 50999] Install git-review on Labs - https://bugzilla.wikimedia.org/show_bug.cgi?id=50999 [11:35:26] Cyberpower678: Can you wait for your cloak before you join the channel? [11:35:54] zhuyifei1999, ??? [11:36:45] (11:34:29) Cyberpower678!~Cyberpowe@spruce-4.stoweaccess.com just joined the channel [11:36:46] (11:34:29) Cyberpower678!~Cyberpowe@spruce-4.stoweaccess.com has quit [Changing host] [11:36:47] (11:34:29) Cyberpower678!~Cyberpowe@wikipedia/Cyberpower678 just joined the channel [11:37:03] time converted to UTC [11:37:08] Out of my control. [11:38:04] but you can wait for your cloak in !status [11:39:38] Cyberpower678: http://freenode.net/faq.shtml#nocloakonjoin [11:55:22] Cyberpower678: That's better [11:55:48] zhuyifei1999, I didn't do anything. The network is just stupid [11:56:12] :o [11:57:37] Yeah, you're using X-chat, which is in http://freenode.net/faq.shtml#nocloakonjoin [12:00:29] Cyberpower678: But how come I havn't got a cloak yet? [12:00:43] I submitted long ago [12:02:45] LOL [12:38:12] hi, how is prior calculated for sge jobs? [12:57:46] can someone help me with a labs instance that goes into the "stopped" state when I try to reboot it? [13:03:15] so now my instance is up but seems to be in 10.4.0 instead of 10.4.1 and I get no route to host when I try to connect to it. [13:04:23] manybubbles: seems "they" are still sleeping [13:04:31] perhaps Thehelpfulone can help though [13:04:58] AzaToth: man, sleeping, what a pain. I wish I didn't have to do it. [13:05:04] heh [13:05:17] * AzaToth is listening to Pokarekare Ana (Vocalise) by Traditional [13:06:58] o.o [13:07:23] AzaToth what's up [13:07:36] space [13:08:24] clouds [13:08:39] birds [13:10:11] * YuviPanda is getting a new Linode VPS [13:33:10] YuviPanda: Linode rokcs [13:33:18] drdee indeed [13:33:29] drdee: i have one at prgmr, but they *were* cheaper. Linode is better at same cost right now... [13:33:50] drdee: I tried to sign up for hetzner, but they asked me for a copy of passport + credit card photocopy.... not happening :) [14:29:31] I'm sure no one is in who can fix this either but I'll just throw it out there - when I try to use nfs for /data/project it just hangs. [14:29:51] Coren: ^ [14:29:57] (he was active elsewhere a short while ago) [14:30:33] "when you try to use nfs"? Context, instance? [14:30:45] Coren: solr-mw2. [14:30:58] solr is on NFS? [14:31:33] Coren: solr-mw2 is my host, sorry. when I apply role::labsnfs::client then `ls /data/project/` I hang [14:32:58] Did you reboot it after applying? autofs is completely unable to cope with changing a mount properly. [14:33:11] * Coren checks status [14:34:28] Coren: I didn't. let me try [14:34:46] Yeah, automount is wedged. [14:35:22] Automount works very well as long as you don't try to fiddle its settings while it runs. :-) [14:36:49] Coren: what logs did you check for automount? [14:37:22] manybubbles: Not logs, I just looked at the running processes and noted they were unresponsive to SIGUSR1 [14:37:25] :-) [14:37:57] Coren: ok! [14:38:11] Coren: well I'll keep going from here. thanks! [14:38:29] I note that the instance is back up and properly uses NFS now. \o/ [14:43:37] hey Coren [14:43:47] when are you going to start that mediawiki hosting project? [14:44:06] want any help w that? :0 [14:44:12] Probably after Wikimania. [14:44:16] ok [14:46:31] YuviPanda: So whats the status on wsgi? I saw the bug mail [14:46:45] lazyktm: yeah, should be done by this week, says Coren [14:46:52] yay :D [14:46:55] I know that there are puppet patches being merged, so :) [14:47:05] lazyktm: it also won't be dealing with apache, so will be significantly faster too [14:47:25] Nonono. I said it /might/ be in this week, but that's optimistic and that next week is more likely. [14:47:56] I can't hear you! :) [14:47:58] YuviPanda: There's still an Apache in front to proxy so that I can rewrite the headers. [14:48:22] Coren: of course, but the CGI overhead will be gone, and uwsgi is in general much faster than apache+mod_wsgi (or so I am told) [14:49:29] Coren: you know you can use toolsbeta to test apaches? :P [14:49:53] I'm not touching the apache config, really. [14:50:01] I think you will need to [14:50:08] We already /have/ a proxy. :-) [14:50:11] but... I meant upgrade to 2.4 more than configs [14:50:32] Ah, yes, that upgrade is postponed for quite a while. It'll go to toolsbeta first certainly. [14:50:37] Coren: hmm, also are the web servers accesible from other labs instances? [14:50:38] cool [14:51:14] YuviPanda: They are, but not through the floating IPs (Openstack limitation) [14:51:27] 'floating IPs'? [14:54:01] hi [14:54:46] anyone around who can increase our CPU quotas for the wikidata-dev project ? [14:56:01] andrewbogott: ^ [14:56:12] or Coren [14:56:14] sure, how many do you need? [14:56:45] Abraham_WMDE, increase by how much? [14:56:50] we might also like an instance for [14:56:52] solr stuff [14:57:12] i don't know how much we need [14:57:17] 10 more CPUs would be fine [14:57:19] :) [14:57:22] yeah [14:57:23] thx. [14:57:37] and one public IP address would also be very helpful [14:58:09] denny is moving some of his tools to labs [14:58:14] wikidata stuff [14:58:27] Abraham_WMDE, ok, should be all set. [14:58:55] andrewbogott: great! thank you [14:59:10] thanks! [15:00:33] [bz] (8NEW - created by: 2Chris McMahon, priority: 4Unprioritized - 6major) [Bug 50622] Special:NewPagesFeed error - https://bugzilla.wikimedia.org/show_bug.cgi?id=50622 [15:00:47] [bz] (8REOPENED - created by: 2Chris McMahon, priority: 4Unprioritized - 6major) [Bug 50623] Entering AFTv5 feedback causes error - https://bugzilla.wikimedia.org/show_bug.cgi?id=50623 [15:01:09] andrewbogott: the public IP is not posible to allocate, does this take some time? [15:01:30] Abraham_WMDE, sorry, I missed the thing about IPs. One moment. [15:03:48] Unable to parse the feed from http://rss.gmane.org/messages/excerpts/gmane.org.wikimedia.labs this url is probably not a valid rss, the feed will be disabled, until you re-enable it by typing @rss+ mail [15:04:35] Abraham_WMDE: OK, raised. [15:06:21] @rss- mail [15:06:21] Item was removed from db [15:10:08] [bz] (8NEW - created by: 2orenbochman, priority: 4High - 6major) [Bug 50222] tool-labs is not configured correctly to run phpunit tests - https://bugzilla.wikimedia.org/show_bug.cgi?id=50222 [15:10:22] [bz] (8NEW - created by: 2Yuvi Panda, priority: 4Unprioritized - 6normal) [Bug 50422] Replicate the Gerrit mysql database to labsdb - https://bugzilla.wikimedia.org/show_bug.cgi?id=50422 [15:10:23] [bz] (8NEW - created by: 2Peter Bena, priority: 4Normal - 6normal) [Bug 48930] (Tracking) Database replication services - https://bugzilla.wikimedia.org/show_bug.cgi?id=48930 [15:11:10] * dan-nl waves at andrewbogott [15:11:18] [bz] (8RESOLVED - created by: 2billinghurst, priority: 4Low - 6normal) [Bug 35488] bots.wmflabs.org no https - https://bugzilla.wikimedia.org/show_bug.cgi?id=35488 [15:12:49] andrewbogott: still having a problem accessing a labs instance … i can get into bastion, but i'm never allowed into an instance. created a new instnace and tried to get into it without success. do you know what might be wrong? [15:13:04] [bz] (8NEW - created by: 2Tim Landscheidt, priority: 4Low - 6enhancement) [Bug 50585] Silence the qacct transfer jobs and monitor them with Icinga instead - https://bugzilla.wikimedia.org/show_bug.cgi?id=50585 [15:51:51] Coren: Is /shared/viewstats/ abandoned, or is whatever copies the files just broken since May 31? [15:52:36] anomie: That data is provided for by one of the volunteers, so I don't know for certain how it is updated. I'd ask on labs-l [15:53:33] what is that? [15:53:56] Coren: Ok. Someone asked something on wikitech-l that seems to be related (subject line "[Wikitech-l] How to get the number of pages in a category"), so I mentioned its existence. [15:54:27] ... that doesn't seem related, actually. :-) [15:55:39] Coren: He asked about page view stats for pages in the categories, too [15:55:52] Ah, /that/ is related. :-P [15:56:04] I see the thread. [15:59:43] is anyone else having an issue logging into their instances? [16:04:00] hi is there a list of the all the available dbases being replicated on labs ? [16:05:12] OrenBochman: It should be "all of them", but you can get a cheat list by taking a look at /etc/hosts on any of the tool labs instances; they are enumerated there. [16:14:36] thanks - i'm adding support to my tools for all dbs in a bit ... [16:19:41] Coren: Could you update the list at /Help so that we have something authoritative on what is supposed to work? [16:25:24] is there an issue with creating new instances and logging into them? [16:25:57] just created another instance and while i can get into bastion1 i cannot log into the instance [16:27:24] i am logging into toollabs, after a while [16:27:48] how did i get http://dpaste.com/1292684/ as my .profile file> [16:27:49] ? [16:27:57] * aude did not do that! [16:29:03] aude: so you get into bastion1 and then you're able to ssh into tool labs? [16:29:34] ssh tools-login [16:29:50] and i can get into other instances [16:30:34] aude: thanks for checking … i'm getting nothing on my instances, even a new one i just created [16:31:06] no idea why [16:31:08] aude: You're the only user with such a .profile (which file identifies as "VAX COFF executable" -- wow). [16:31:28] yikes [16:31:50] i think this is my second time logging in [16:32:33] * aude does not know what VAX COFF executable is [16:34:14] aude: Did you compile anything on Tools? (Even though it would be highly unlikely that a compilation product is saved as ~/.profile.) [16:34:52] My assembler-fu is too weak to dissect that further. [16:35:28] scfc_de: i doubt it [16:35:30] * aude checks [16:36:19] who can i ask for help in troubleshooting why i cannot log into an instance? [16:36:20] wtf, it seems my tool got deleted! [16:36:27] * aude can't find my stuff [16:37:20] gah! [16:37:38] andrewbogott: Can you help dan-nl? [16:37:46] aude: What was the tool's name? [16:37:47] Coren: help! [16:37:50] geonotice [16:37:58] ah, yes, dan-nl, thought I missed you. [16:38:06] * aude making something that displays the geonotices [16:38:10] * andrewbogott reads backscroll [16:38:26] i see the directory still in /data/projects [16:38:37] aude: /data/project/geonotice is *file*, should be a directory. [16:38:44] wtf! [16:38:50] i do have a copy, but still [16:38:59] who's deleting tools? or how? [16:39:02] Another "VAX COFF executable". [16:39:45] Coren: Could you take a look at that? It seems as it has been this way since May 27 12:38. [16:40:31] doesn't leave me confident to have tools there [16:40:33] andrewbogott: i may be doing something wrong of course, but the login routine i'm using is the same i've been using for about 9 months … i can get into bastion1 and then when i ssh into an instance ssh just hangs on offering the rya pub key … it indicates the correct path for the key, but it never connects with the instance [16:41:17] aude: That's the only file in /data/project, all the other tools seem fine. Was it Java/Perl/PHP/Python/...? [16:41:22] dan-nl, tell me project and instance again? [16:41:29] php i think [16:41:40] if that [16:41:46] could be just html and js [16:42:03] andrewbogott: ryan accidentally deleted the original so i created a new one this afternoon …. https://wikitech.wikimedia.org/wiki/Nova_Resource:I-0000080b [16:42:38] ah, yes i remember...... [16:42:49] aude: But nothing where something would be compiled and a "--prefix $HOME" could screw something up? [16:42:53] it was just a js file i think [16:42:59] and maybe index.html or index.php [16:43:11] * aude doesn't know how to do that [16:43:20] no python [16:43:55] Coren: Could aude's problems be an artifact of the XFS troubles? [16:44:06] http://en.wikipedia.org/w/index.php?title=User:Aude/common.js&diff=prev&oldid=557189866 [16:44:13] dan-nl, try again now? [16:44:18] i used the js file initially for testing my changes [16:44:31] and then with the intention to have something to view the geonotices [16:44:48] to see what areas they cover [16:44:49] andrewbogott: that worked, thanks! what was the problem? [16:45:16] /home wasn't mounting properly because of a stuck process on the file server. [16:46:38] andrewbogott: thanks. now i can get to setting everything up again … [16:47:00] andrewbogott: is there anyway i could have figured that out on my own? [16:47:09] dan-nl, not really :( [16:47:31] aude: We had a filesystem corruption last week (cf. labs-l), so this may be related to that. But we need to wait for Coren to investigate that. [16:47:50] scfc_de: could be [16:48:01] since i had not much there, not a big deal [16:48:23] but reminder to always have a copy of everything in github or gerrit [16:48:42] aude: +1000 :-). [16:49:28] my .profile is in group adminbot [16:49:32] is that normal? [16:50:08] my .bashrc is owned by reza? [16:50:17] (another user?) [16:50:28] my.cache owned by amire80 [16:50:31] .cache [16:50:55] the .bashrc is nonsense [16:51:02] liek a list of wiki pages [16:51:06] like [16:51:37] .bash_logout also owned by amir [16:52:30] aude: No, that's not normal. [16:52:36] not normal at all [16:52:59] aude: All files in your user directory should be owned by aude.wikidev. [16:53:09] yeah :) [16:54:42] * aude waits for coren [17:10:14] * Coren is back. [17:11:24] Coren: any idea what happened to my tool and toollabs account? [17:11:41] aude: There was filesystem corruption, but as far as I could tell there was no data loss except for the rollback (because we restored from a snapshot dating ~1.5h before the problem) [17:11:43] home directory files not owned by me [17:11:57] when did this occur? [17:12:12] aude: You're the only user having reported any issues, so it might be unrelated. [17:12:16] aude: Early last week. [17:12:22] ok [17:12:32] aude: May I go take a look by myself? [17:12:34] there's no way i could have done this stuff [17:12:35] yes [17:12:58] like my .bash_rc is interesting [17:13:08] owned by someone else and its contents [17:13:19] .bashrc [17:13:31] That is... wow. How completely random. [17:13:37] yeah! [17:13:48] and my tool directory disappearing [17:13:59] What's the name of your tool? [17:14:03] geonotice [17:14:15] ! [17:14:15] There are multiple keys, refine your input: !log, $realm, $site, *, :), ?, access, account, account-questions, accountreq, add, addresses, addshore, afk, airport-centre, alert, amend, ask, bang, bastion, beta, bible, blehlogging, blueprint-dns, bot, bots, botsdocs, broken, bug, bz, chmod, cmds, console, cookies, coren, Coren, credentials, cs, Cyberpower678, damianz, damianz's-reset, db, del, demon, deployment-beta-docs-1, deployment-prep, doc, docs, domain, enwp, epad, etherpad, extension, failure, flow, forwarding, gerrit, gerritsearch, gerrit-wm, ghsh, git, git-puppet, gitweb, google, group, hashar, help, helpmebot, hexmode, home, htmllogs, hyperon, info, initial-login, instance, instance-json, instancelist, instanceproject, ip, keys, labs, labsconf, labsconsole, labsconsole.wiki, labs-home-wm, labs-l, labs-morebots, labs-nagios-wm, labs-project, labswiki, leslie's-reset, link, linux, load, load-all, logs, logsearch, mac, magic, mail, manage-projects, meh, mobile-cache, monitor, morebots, msys-git, nagios, nagios.wmflabs.org, nagios-fix, nc, newgrp, new-labsuser, new-ldapuser, nova-resource, op_on_duty, openstack-manager, origin/test, os-change, osm-bug, pageant, password, pastebin, pathconflict, petan, petan..., petan-build, petan-forgot, ping, pl, pong, port-forwarding, project-access, project-discuss, projects, proxy, puppet, puppetmaster::self, puppetmasterself, puppet-variables, putty, pxe, pypi, python, pythonguy, pythonwalkthrough, queue, quilt, rb, remove, replicateddb, report, requests, resource, revision, rights, rq, rt, rules, Ryan, Ryan_Lane, ryanland, sal, SAL, say, screenfix, search, searchlog, security, security-groups, seen, sexytime, shellrequests, single-node-mediawiki, snapshits, socks-proxy, ssh, sshkey, start, stats, status, stucked, sudo, sudo-policies, sudo-policy, svn, taskinfo, terminology, test, Thehelpfulone, tooldocs, tools-admin, tools-bug, tools-help, tools-request, tools-web, tunnel, tygs, unicorn, venue, vim, vmem, whatIwant, whitespace, wiki, wikitech, wikiversity-sandbox, windows, wl, wm-bot, wmflabs, [17:14:16] i didn't really do much work on it yet, so no data loss [17:14:24] blergh [17:14:25] That... wow. [17:14:26] huh :) [17:14:30] !coren [17:14:30] The toolmeister: http://www.mediawiki.org/wiki/User:MPelletier_(WMF) [17:14:45] * addshore gently slaps Coren with a fish for the ping :P [17:14:46] AzaToth: Yeah, I wasn't talking to wm-bot. :-) [17:14:50] hehe [17:15:03] !:) [17:15:03] /me laughs [17:15:11] !* [17:15:12] $* [17:15:19] !addshore [17:15:19] fail [17:15:22] :D [17:15:22] !!log [17:15:22] petan needs a new hobby :P [17:15:27] * Coren inventories the other tools' homes. [17:15:44] gwicke: can't decide? [17:16:29] AzaToth: switching from wireless to wired triggers a bip reconnect.. [17:16:58] k [17:17:01] aude: As far as I can tell, your home is the only victim. What in blazes could have been different about you and your tool? [17:17:40] my tool was nothing essentially [17:17:45] aude: Your tool's home has been... replaced by a regular file. [17:17:54] i had a js file in it and maybe index.php or index.html [17:17:56] # file geonotice [17:17:56] geonotice: VAX COFF executable - version 24279 [17:17:57] Coren: not even a link? [17:17:59] yeah [17:18:11] * Coren boggles. [17:18:36] seems to be a random file, unless someone is actually using vax [17:19:04] Coren: strings geonotice? [17:19:06] AzaToth: I haven't actually used vaxen in ~15 years, I doubt I still have executables lying around. :-) [17:19:12] anything intresting in it [17:19:13] * aude doesn't know what vax is [17:19:25] aude: Do you remember when your last login was? [17:19:26] aude: what your daddy used [17:19:37] :) [17:19:39] Coren: I have never used VAX [17:19:41] probably end of may [17:19:49] AzaToth: No text; just almost-strings from code. It's either actual program text or its random binary junk. [17:20:03] hmm [17:20:06] http://en.wikipedia.org/w/index.php?title=User:Aude/common.js&diff=prev&oldid=557223001 [17:20:12] * aude removed the js from my user js then [17:20:17] and I assume objdump doesn't reveal anything either? ツ [17:20:33] not sure i did anything since then, though possible i logged in again [17:20:34] AzaToth: I'd have to install multiplatform binutils. [17:20:43] aude: I think he meant login to the server [17:20:55] aude: I see only aude pts/41 bastion1.pmtpa.w Tue Jul 9 16:26 - 17:01 (00:34) in July [17:20:56] i logged in today [17:21:06] i was in bastion and then ssh tools-login [17:21:18] not sure that's the right way to do but should work, right? [17:22:10] aude: It's not necessary, but it works (you can login directly to tools-login.wmflabs.org also) [17:22:13] Coren: did you chmod the geonotice? [17:22:15] chown* [17:22:25] I see it's owned by local-geonotice now at least [17:22:27] aude: Do you mind if I take a few more minutes to inspect this before I fix your accounts? [17:22:54] AzaToth: That doesn't mean much, my toolwatcher daemon would have chowned it. [17:22:55] go ahead [17:23:17] k [17:25:16] * liangent guess d8927129973ab0fd1ea1b58332555cfdf9a5f827^ is enough [17:25:54] Coren: on tools-login I made ls -l /home [17:26:01] and the groups are fucked up imo [17:26:12] AzaToth: How so? [17:26:32] * Coren sees nothing amiss. [17:26:33] 30% is group svn [17:26:44] AzaToth: That's normal for hysterical raisins. [17:26:50] oh [17:26:59] AzaToth: It depends how the account was originally created for gerrit. [17:27:03] ok [17:27:17] (I.e. group svn have users who existed back then) [17:27:33] I assume as well the ownership of /home/icinga is correct as well then? [17:28:09] Also, but for other reasons. [17:28:12] k [17:28:53] anyway, /home/pleclown is chmodded 777 [17:28:58] which must be wrong [17:29:20] aude: I can't seem to find rhyme or reason to the oddities in your home. [17:29:21] that's not good [17:29:32] [bz] (8NEW - created by: 2Quim Gil, priority: 4Unprioritized - 6enhancement) [Bug 51050] Connecting wikitech.wikimedia.org user profiles with community metrics - https://bugzilla.wikimedia.org/show_bug.cgi?id=51050 [17:29:33] Coren: we can just see if it happens again [17:29:52] as always, once i have anything i care about, i will have a backup and stuff in git [17:29:55] Coren: is /home/nettrom a "real" normal file, or an other fubar? [17:29:58] AzaToth: It is -- but it's also not entirely surprising. User education on why to not chmod 777 their homes is... harder than first appears. :-) [17:30:11] although worried about my files ending up some place else [17:30:17] Coren: I don't think you as a normal user can rechmod your own home dir? [17:30:26] AzaToth: Sure you can. You own it. [17:30:33] aint that dependent on the permission of /home? [17:30:37] e.g. for bots, not sure where people put passwords etc [17:30:40] * Coren shakes head. [17:30:44] AzaToth: I chmoded it to copy files. [17:30:45] k [17:30:51] would be bad for something like that of someones to end up some place else [17:31:02] or be readable [17:31:19] aude: As far as I can tell, that's a freak accident and you seem to be the only one to whom that happened. I'm going to preuse the entire filesystem to see if anything else is out of place, but it seems unlikely. [17:31:21] I was able to log as projectname, but I couldn't access anything.... [17:31:27] Coren: ok [17:31:43] Coren: /home/nettrom? [17:31:44] and can you restore my tool to a new state [17:31:49] e.g. recreate it [17:31:52] Also, it looks like things may have been misplaced, but permissions unaffected (as one would expect if inodes got shuffled around) [17:32:00] yeah [17:32:09] * aude would like to work more on the tool [17:32:19] So it's not a major security concern, even though it's troubling. [17:32:27] Coren: ok [17:32:28] aude: Your tool is back. I'll clean up your home now. [17:32:32] yay! [17:33:57] Oh! [17:34:07] huh? [17:34:13] You seem to have gotten *hardlinks* to random other files. [17:34:30] Including, get this, a hardlink to someone else's home *directory* [17:34:35] oh really? [17:35:14] ... a hardlink to a directory is definitely filesystem corruption. Damn. [17:35:25] makes sense [17:35:28] Apparently, the snapshot I restored from already had some corruption. [17:35:38] It just hadn't surfaced yet. [17:35:41] hopefully it's isolated [17:36:15] I've had nobody else report issues, and it's been over a week, so what cases there may be seem to be rare. [17:36:26] hope so [17:36:38] i'm still getting used to toollabs [17:36:40] * Coren finishes cleanup and tries to find out more precisely the extent of the problem. [17:36:46] not logging in so often yet [17:37:25] aude: Yeah, that should hopefuly be a freak even that will not reoccur; having a filesystem crash this way is not commonplace. [17:37:39] sure [17:37:39] anyone who can add me to deployment-prep? [17:38:27] aude: was that 'sure' for me? :D [17:38:39] YuviPanda: no [17:38:41] aww [17:39:02] don't know if i have permission to do that [17:39:11] YuviPanda: Normally, any member of the project can add you. [17:39:17] looks like i can [17:39:20] i can add you [17:39:23] aude: woo! [17:39:36] Coren: hmm, chrismcmahon says he doesn't have the perms? [17:39:44] success [17:40:00] woo [17:40:02] yurik: Oh, it's possible that this was restricted to project /admins/ [17:40:13] tabcomplete strikes again! [17:40:16] i don't know about project admin, though [17:40:41] let me log in [17:40:51] looks like most people are project admin [17:40:55] even me :) [17:40:56] chrismcmahon: you are project admin, so you should've been able to add me. [17:40:56] yeah [17:41:05] aude: I think I've cleaned up the residue right. [17:41:10] is... everyone project admin?! [17:41:11] Coren: thanks [17:41:17] YuviPanda: most everyone [17:41:28] * aude trusts you not to go around deleting stuff [17:41:36] :D [17:42:07] !log deployment-prep added Yuvipanda to the project [17:42:09] Logged the message, Master [17:42:31] aude: Just in case you're curious, a cursory random sample of other homes and project dirs show no obviously out of place files. [17:42:40] Coren: interesting [17:42:50] * aude wonders how i got singled out :) [17:43:16] anyway, thanks Coren for fixing stuff [17:43:24] things happen.... [17:43:30] aude: That's what I'm here for. [17:43:36] :) [17:43:44] alright, time to go home and eat [17:43:49] * aude around later [17:44:37] * Damianz looks at Coren [17:44:43] Coren: I seem to be having the same problem dan-nl was having earlier. ssh's to bastion, can't to others from there. stuck at 'debug1: Offering RSA public key: /Users/yuvipanda/.ssh/licheking' [17:45:28] * YuviPanda adds more vs [17:45:49] debug2 tells me debug2: we sent a publickey packet, wait for reply [17:45:59] YuviPanda: Needz moar contexts! What instance? [17:46:08] Coren: deployment-sql02 [17:46:11] in deployment project [17:46:22] nothing extra in debug3 [17:46:51] YuviPanda: At first glance, the box is dead/dying. [17:47:06] too many things seem to be dying around here :( [17:47:29] YuviPanda: AFAICT, it's trashing its poor little core out. Runaway processes? [17:47:40] Coren: probably. i can ssh to other places fine [17:47:45] deployment-bastion works [17:47:54] hmm, I see *** /dev/vdb will be checked for errors at next reboot *** [17:47:57] wonder if that is ominous [17:48:05] mm, petan is listed as 'emergency contact' :D [17:48:34] petan: around? [17:48:58] YuviPanda: It normally isn't (omnious); debian preemptively checks filesystems at interval and after a number of mounts. [17:49:06] ah, okay [17:50:27] YuviPanda: Yeah, that box's userspace is dead or dying; but I can't log in with root to see what's going on. What's on the console log? [17:51:10] Coren: just DHP noise [17:51:21] Coren: but last entry was in Jul 2 06:29:49 deployment-sql02 dhclient: bound to 10.4.0.248 -- renewal in 54 seconds. [17:52:23] Krinkle: would you know where I can run maintanance scripts on deployment-prep? [17:52:53] YuviPanda: I think we need to call TOD on that box. [17:53:01] TOD? [17:53:06] Total "O" "D"? [17:53:06] Time of Death. :-) [17:53:08] ah [17:53:11] :) [17:53:22] hmm, if only things had documentation... :( [17:53:39] YuviPanda: There is a main host in the beta cluster (like "fenari" and "tin [17:53:39] * YuviPanda runs a find [17:53:46] YuviPanda: There is a main host in the beta cluster (like "fenari" and "tin" in production) that should be used for things like this [17:53:53] Krinkle: bastion? [17:53:55] Krinkle: [17:53:55] I don't know which one it is, but I bet it is in the documentation. [17:54:05] Should be 'tin' nowadays. [17:54:13] Last I heard, anyways. [17:54:15] Well, not in beta, or is it called tin there as well? [17:54:25] there's no 'tin' here, bastion looks closest [17:54:30] tin isn't a bastion, fenari is. You ssh from fenari to tin. [17:54:31] let me try to find documentation [17:54:49] use the host where the canonical source of mediawiki and extensins and wmf-config are [17:54:57] from where you run sync/scap for beta etc. [17:55:00] hmm, okay [17:55:11] Krinkle: just to confirm, betalbs is the deployment-prep project, right? [17:55:18] Yes [17:56:12] okay, https://wikitech.wikimedia.org/wiki/Deployment/Overview is what I found [17:56:45] ah [17:56:45] https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep [18:11:07] petan: You could use https://wikitech.wikimedia.org/w/api.php?action=ask&query=%5b%5bCategory%3aShell%20Access%20Requests%5d%5d%20%5b%5bIs%20Completed%3a%3aNo%5d%5d%7c%3fShell%20Request%20User%20Name&format=json and https://wikitech.wikimedia.org/w/api.php?action=ask&query=%5b%5bCategory%3aTools%20Access%20Requests%5d%5d%20%5b%5bIs%20Completed%3a%3aNo%5d%5d%7c%3fTools%20Request%20User%20Name&format=json for wm-bot's @requests. [19:34:36] !log deployment-prep Attempting to reboot a bunch of instances prevent ssh access because /home is borked . uploadtest08 uploadtest07 -cache-upload04 -cache-text01 parsoid2 cache-mobile01 deployment-sql02 cache-upload03 [19:34:39] Logged the message, Master [19:35:45] hashar: home is messed up? [19:36:05] yeah that randomly happen from time to time [19:36:16] Coren: https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Labs%20NFS%20cluster%20pmtpa&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1373398541&g=cpu_report&z=large [19:36:17] hashar: thanks for the help with -prep :) [19:36:17] though that the first time I see it happening on that many instances [19:36:24] it is usually a single instace [19:37:02] hashar: in which way is it screwed up? [19:37:07] YuviPanda: now you are an expert :-] [19:37:16] :D [19:37:19] Ryan_Lane: I see nothing worrisome in those graphs? Something you want me to notice? [19:37:23] hashar: apparently [19:37:38] Ryan_Lane: ssh on port 22 works and get us authentified, but never reach the mot / prompt [19:37:42] Coren: hashar is saying /home is screwed up on a bunch of instances [19:37:52] NFS instances? [19:37:59] that is why I wanted to add the role::labsnfs directly in base.pp [19:38:03] and the cpu started spiking [19:38:09] * Coren checks. [19:38:20] hashar: did you see my −1 rather than -2? [19:38:22] deployment-prep has been screwed since at least sunday evening (eu time) [19:38:31] hashar: dit betalabs just die? [19:38:38] Ryan_Lane: only looked at the emails replies :-] [19:38:50] hashar: is this nfs or gluster homes? [19:39:26] Coren: waitio is at like 90% right now [19:39:56] Ryan_Lane: good question. I have no idea honestly [19:39:59] As far as I can tell, there is quite a bit of access going on, but the server is responsive from tools at least. [19:40:50] YuviPanda: seems good again [19:40:58] yeah was just out for a minute [19:41:24] I see access stabilizing down to more typical levels. Someone started something very heavy on an instance? [19:42:01] There's still ~2000 opens/second going on. [19:43:00] * Coren wishes there was a clean way to figure out /where/ that load came frmo. [19:43:16] Smells like a grep -R [19:43:27] Coren: I restarted a bunch of instances a few minutes ago [19:43:46] I... ran a find, but I killed it after a few seconds. shouldn't be it [19:44:06] The server nevertheless responds relatively well to the load, as far as I can tell. [19:44:20] we had a similar issue months and months ago, some cronjob on the dumps project was taking all the GlusterFS I/O for at least a couple hours every day at 6am utc [19:46:02] It's a big write pattern, actually, so something like a git clone of something /huge/ [19:46:17] I had a bug or RT filled a while ago to request a disk io plugin for Ganglia [19:46:21] can't find it anymore :( [19:47:21] Whatever it is, it started a bit under four hours ago or so. [19:47:55] ahh https://bugzilla.wikimedia.org/show_bug.cgi?id=36994 :-] [19:47:58] I should puppetize it [19:48:09] a few weeks ago I have added a plugin for Jenkins :-] [19:48:43] 16:00 UTC. Does that time have any significance for deployment-prep? [19:49:15] https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&c=Labs+NFS+cluster+pmtpa&m=cpu_report&s=by+name&mc=2&g=cpu_report isn't all that ambiguous about when it started. :-) [19:50:16] Coren: Can you see from the (relative) network traffic which instance is causing this? [19:51:16] maybe by looking at the labs ganglia http://ganglia.wmflabs.org/latest/ [19:51:37] hmm it is missing most of the projects apparently [19:51:42] * Coren installs ntop [19:52:00] Coren: solr project [19:52:23] Coren: it has a huuuuge network spike that started around that time [19:52:49] week view of solr-mw2 http://ganglia.wmflabs.org/latest/?r=week&cs=&ce=&c=solr&h=solr-mw2&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [19:52:59] manybubbles: are you doing something nasty on solr-mw2 instance ? :) [19:53:04] 10.4.0.106 seems to be the biggest culprit [19:53:21] hashar: absolutely. [19:53:45] I can stop it and start it if you need me to [19:53:53] can it be paused ? [19:54:12] though it might be unrelated to the issue Coren is investigating [19:54:25] paused [19:55:27] manybubbles: is your script some NFS server maybe ? [19:55:42] hashar: it certainly beats nfs as hard as it can [19:56:06] * hashar likes stress tests [19:56:46] Yep. Definitely solr-mw2. That things is hammering on the NFS server. :-) [19:56:46] I like having a nice big dataset to work with. [19:56:50] it should be mostly stopped [19:57:02] Ryan_Lane: finally managed to get the instance to reboot (it did not work for some reason on monday). They are all back up with /home on glusterfs [19:57:07] manybubbles: It's not that horrible, actually, and the stress test is not bad per se. :-) [19:57:38] I was pleased to see that tools-login remained reasonably responsive despite the poor NFS server crying out in pain. :-) [19:57:50] Coren: I didn't mean to break stuff. It shouldn't be doing it any more. [19:58:35] manybubbles: No worries there; you broke nothing and the whole setup /does/ need to be able to keep working in such conditions. We were just a little surprised at the sudden load. [19:59:37] hashar: That said, whatever affected your /homes could not have been that. File I/O was a little laggy, but still working on bastion and tools. [19:59:59] Coren: yeah that was a different issue I guess. Rebooting fixed it up anyway [20:00:23] hashar: Do you know which of those instances are on gluster vs NFS? [20:00:45] oh it wasn't my network responsible for the very slowness? [20:01:09] hashar: If they're not all one or the other, the problem probably needs to be looked into more closely. [20:01:52] henna: It may or may not have been. The filesystem was certainly not very fast, but you'd only see this in file I/O, not in - say - responsiveness to keyboard input and such. [20:02:47] Coren: I'm looking at etc/hosts - how do I find out the dbname for say he.wikipedia ? [20:03:08] OrenBochman: Ah, you don't know the convention? It'd be named 'hewiki' [20:03:22] Hey Ryan_Lane [20:03:32] I'm writing up the role::puppet::self documentation [20:03:33] So you'd want to connect to the hewiki_p database on hewiki.labsdb [20:04:02] I noticed that role::puppet::self is in the list of default enable-able classes on the instance configuration page [20:04:15] OrenBochman: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Help#Production_replicas [20:04:19] but the parameters that are settable ther are 'is_labs_puppet_master' and 'is_puppet_master' [20:04:32] role::puppet:;self requires a configurable paramater just named 'puppetmaster' [20:04:48] I use enwiki_p for english wikipedia [20:05:24] ottomata, I added rule::puppet::self to that list. [20:05:41] I think the 'is_*_puppet_master' vars are for something unrelated. [20:05:47] i thikn so too [20:06:04] Feel free to grep for things on that page and remove them if they don't turn up in the puppet repo. [20:06:08] it looks like the paramaters are just all stuck together at the bottom of the class grouping [20:06:17] how do I remove/add things to that page? [20:06:20] Yeah, probably should make a new group and rearrange things. [20:06:33] ottomata, here: https://wikitech.wikimedia.org/wiki/Special:NovaPuppetGroup [20:06:39] If you have privs for that page. [20:06:49] If not, just send me a laundry list and I'll do the cleanup. [20:07:13] yeah, i thikn I can only modify things in my projects there [20:07:27] ok. [20:07:35] hey, so what should I do to my thing so I can turn it on without squishing everyone? [20:07:47] Coren: files weren't opening :) [20:07:47] Are you planning to make a new doc page, or write inline docs, or edit this page? https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [20:07:51] but I was having network isseus at the time as well [20:08:49] henna: Maybe a little bit from column a, a little bit from column b. :-) [20:09:38] andrewbogott: edit that page [20:09:50] i asked Ryan_Lane if we shoudl deprecate puppetmaster::self in favor of role::puppet::self [20:09:50] ottomata: Cool. Can I make requests? [20:09:51] he said yes [20:09:54] totally [20:09:57] i was about to hit save :) [20:10:06] Yep, I agree that we should deprecate puppetmaster::self. [20:10:38] I'd like there to be a simple section at the top about self-hosting instances, and then an 'advanced' section about the client/server use with a disclaiming telling people that they probably don't actually need to do that. [20:10:57] Hm, ok [20:11:00] i kinda have it in two sections [20:11:03] but i'll add the disclaimer [20:11:11] I mention this because people periodically show up here assuming that they need to set up separate master/client, so folks are already inclined to do that... [20:11:22] so would be good to disuade folks unless they really know what they're doing. [20:11:25] thanks. [20:13:29] andrewbogott: how's it look [20:13:30] https://wikitech.wikimedia.org/wiki/Help:Self-hosted_puppetmaster [20:13:40] comments and more requests appreciated [20:14:44] [bz] (8RESOLVED - created by: 2Matthew Flaschen, priority: 4Unprioritized - 6enhancement) [Bug 50999] Install git-review on Labs - https://bugzilla.wikimedia.org/show_bug.cgi?id=50999 [20:16:10] So yeah, on the positive side, that impromptu stress test showed that the NFS server handles load gracefully and keeps on tickin'. [20:17:25] Coren: so can I put my load back one it?:) [20:18:18] manybubbles: You can, though it would be polite if you could lighten it a little bit so that other labs user don't suffer quite so much. Can you throttle your speed a little bit? Otherwise, don't worry about it - it's part and parcel of running a shared infrastructure. :-) [20:18:29] Coren: Damn, I miss the 'randomly lockup, oom and die with splitbrain files' feature of our storage [20:18:32] looks good, ottomata, thanks for updating. [20:19:07] Coren: let me see if I can slow it down. I can certainly crank down the number of concurrent inserts. [20:19:13] manybubbles: It might also be a nice thing to send an email to labs-l giving a heads' up when you start a large job like this so that people don't wonder where all the performance is gone. :-) [20:19:28] great, ok [20:19:30] andrewbogott: [20:19:32] Coren: I wonder if I'm even on that one.... [20:19:33] ... "inserts"? [20:19:44] I think we need to leave those extra is_blabla_puppetmaster parameters [20:19:45] manybubbles: You should be, this is where Labs announcements go. :-) [20:19:49] i see them used in puppet by main labs puppetmaster [20:20:00] but, we also need a 'puppetmaster' [20:20:02] paramater [20:20:03] added [20:20:07] ottomata, yeah, they're used, but I'll create a different group. [20:20:08] so that role::self::puppetmaster can be configured [20:20:10] manybubbles: Did you say "inserts"? Do you have a *database* on NFS? [20:20:13] ok danke [20:20:24] This lemon-lime and mind drink is weird... hmm [20:20:30] sorry: role::puppet::self* [20:20:33] * Damianz goes back to seeing if vagrant will work with Parallels [20:23:15] Coren: yup - it is the only place with space that isn't gluster. It was suggested to me yesterday. [20:23:25] ottomata, it's just the one role, right? And it's switched via the global? [20:23:30] 'puppetmaster'? [20:23:44] manybubbles: It'll /work/, but by god the performance will be teh suxx0rz. :-) [20:23:45] Coren: I've started it up again with fewer actors. I've also requested to get into labs-l. [20:24:05] Coren: surprisingly, the performance is better than local disk and way way way better than gluster. [20:24:45] manybubbles: not surprising its better than gluster [20:24:59] yes, just one role [20:24:59] Ryan_Lane: everything is better than gluster [20:24:59] manybubbles: when you say local disk, you mean /mnt? [20:25:00] exactly [20:25:10] I'd find that slightly surprising [20:25:15] though we're not using raw disk [20:25:34] Ryan_Lane: yeah - [20:25:46] Ryan_Lane: what kind of storage backends those two? [20:25:46] manybubbles: so, yeah, one thing you may have run into yesterday [20:25:47] ottomata, ok, click on the config page for an instance and tell me if the options look right? [20:26:14] manybubbles: one of your instances was on a host that had its disk fill up [20:26:21] andrewbogott: looks good to me [20:26:29] 'k thanks [20:26:31] manybubbles: local storage is 10k SAS [20:26:42] i'll send an email to labs list [20:26:43] very doubtful the NFS server has better performance :) [20:26:54] it may have more contention, though [20:26:56] Ryan_Lane: yeah - I think I was involved in that. I destroyed it because it wasn't responding at all and rebuilt it half size. [20:27:22] manybubbles: that would be why 170G suddenly became available :D [20:27:34] I was wondering [20:27:43] it's best not to use too much local disk [20:27:49] we only have 1.1T per host [20:27:56] Ryan_Lane: that is just about what I was using, yeah. The machine wasn't connectable. [20:27:57] and they are all getting quite full [20:28:29] Ryan_Lane: if we're worried about local disk you might want to make the instance limits lower with instrucions to use nfs.... [20:29:01] well, there's some legitimate cases [20:29:11] and your use of NFS is already stretching its limits [20:29:36] we may need to get some attached storage for the hosts [20:31:24] Ryan_Lane: I wish I had something better that didn't hurt everyone but I really do need to have a bunch of machines all connected together which labs is good for. [20:33:00] andrewbogott: Opinion on https://gerrit.wikimedia.org/r/#/c/72504/ ? How hard would it be to add project-wide puppet groups/variables? [20:36:11] manybubbles: yeah, our storage options are limited [20:36:28] they work for like 99% of use cases, but we could use some other options [20:36:53] Ryan_Lane: High performance storage sucks in all the cloud infrastructures I've ever heard of. [20:36:54] Ryan_Lane: SSDs! [20:37:11] Damianz: SSD + ceph + cinder would be a good option [20:37:18] Ryan_Lane: we had it pretty nice at lulu but the price per Gb was super duper stupid high. [20:37:22] I'd love to see ceph in labs [20:37:58] Life without SSDs suck after you've SSD'd your life, it's the future [20:38:53] Coren: Adding project-wide puppet settings that get added at instance creation time would not be hard. Adding them so they can be changed for existing instances... [20:39:05] I'm not sure. [20:39:08] Damianz: :D [20:39:16] does anyone know who maintains the labs Ganglia ? [20:39:21] manybubbles: yeah. even a small SSD cluster will be pricy [20:39:26] andrewbogott: I was considering that the "easy" solution would be to simply apply an union of both whenever we'd have applied the per-instance? [20:39:28] hashar: no one, really [20:39:39] hashar: sara set it up ages ago and no one has touched it since [20:39:44] oh [20:40:06] "sara"? I'm going to sound extra dumb now, but who is "sara"? [20:40:07] Can we have sara back, google don't need her :P [20:40:12] Coren, right, but I'm not positive that we can do that without modifying the puppet client. [20:40:15] Ryan_Lane: we used netapps - far and away more expensive the regular old ssds [20:40:16] Ah, this answers that. [20:40:46] Ryan_Lane: should I fill an infrastructure bug regarding ganglia on labs? [20:41:01] who is Sara? [20:41:01] andrewbogott: I'm not clear how that's different? Or does the puppet group interface twiddle directly the LDAP info rather than make a 'commit' to it on change? [20:41:05] Coren: she was a part time engineer. [20:41:09] Well, wait... [20:41:20] hm. [20:41:27] hashar: yes. it's slightly broken [20:41:31] I don't know exactly why [20:41:33] manybubbles: The whole netapp, normal disks with ssd shelf is pretty neat tbf [20:41:41] it re-breaks itself every once in a while [20:41:47] Ryan_Lane: we could get whoever in prod knows ganglia to look at it :-) [20:41:51] there's a cron that should be running that isn't [20:41:59] and I think puppet is what's breaking it [20:42:08] this isn't in prod [20:42:11] it's on a labs instanc [20:42:11] Coren, you're right. [20:42:13] *instance [20:42:17] There should be several ways to do it. [20:42:17] I love how we half use puppet features and half just make it create cron files that do what puppet could... totally could use puppetdb and sexify stuff loads :( [20:42:48] then you'd have 1/3 puppet features, 1/3 cron files that do what puppet could and then 1/3 puppetdb doing... :D [20:42:56] Damianz: it is nice but eats money. [20:43:14] Damianz: we can't use exported resources in labs [20:43:32] because the resources leak between projects and could be used to project hop [20:43:37] [bz] (8NEW - created by: 2Antoine "hashar" Musso, priority: 4Unprioritized - 6normal) [Bug 51068] ganglia.wmflabs.org is missing most projects - https://bugzilla.wikimedia.org/show_bug.cgi?id=51068 [20:43:56] Ryan_Lane: Sucks [20:44:11] resources work properly for environments, right? [20:44:33] Damianz: You're not kidding. Have you seen the ugly hack I had to do in toollabs:: to work around the lack of exported resources when sharing host keys? :-) [20:44:37] Think so, not touched them in mine for a while [20:44:56] environment support won't be around for a while, though [20:45:09] Coren: Yeah - I've got lovely python to generate monitoring stuff from ldap to get around puppet [20:45:12] Coren: I'm not a fan of using puppet for host keys anyway [20:45:19] we may as well not even use host keys [20:45:33] Ryan_Lane: For public keys? How is that problematic? [20:45:47] host gets owned, changes its host key, it propogates [20:45:54] no one knows the wiser [20:46:20] I guess it's more for MITM, but if we have MITM in Labs we're totally fucked [20:46:22] The end result is no different no matter /where/ to collect host keys, though. [20:46:32] because that means someone owned the network node [20:46:35] Why go to that effort when I could just root bastion [20:46:36] :P [20:46:41] Damianz: indeed ;) [20:46:58] Ryan_Lane: regarding ganglia on labs, can you dump the bits you know about it on https://bugzilla.wikimedia.org/show_bug.cgi?id=51068 please ? [20:47:04] yep [20:47:21] Ryan_Lane: i could look it up with one of the european folks. Any bits would help :) I could even write a bit of doc on wikitech along the way [20:47:35] I like bugzilla, my 2 year old tickets got marked as fixed, a year after they where fixed =D [20:47:46] andrewbogott: If you want, I'll take a crack at it but I'll expect you to review my changesets. :-) [20:48:29] I think it's only the UI that's tricky, really. [20:48:48] But yeah, have at. [20:49:25] andrewbogott: I expect I'll be able to copypasta much of the per-instance UI and add it to the per-project configuration now that it exists. [20:50:14] It would be good to be able to see both when looking at an instance config. [20:50:39] Like, settings that are inherited from the project should appear checked/filled in/whatever but be disabled with some indication that they come from the project page [20:50:57] Or be overridable per instance [20:51:10] Yeah, I guess if we want them overridable then we don't have to do anything at all... [20:51:19] the values will appear in the per-instance gui anyway [20:51:24] So, maybe this is easy :) [20:53:41] hashar: look what up? [20:54:05] Ryan_Lane: I mean, find out what is broken on the ganglia labs instance [20:54:21] hashar: ah, yeah. would be great if you and ariel could look at it [20:54:29] cc her on the bug? [20:54:48] hashar: I merged your change [20:54:54] Ryan_Lane: thankkkk you :-] [20:55:07] will make sure to report anything going wrong [20:55:24] * Ryan_Lane nods [20:55:43] andrewbogott: this is regarding per-project puppet settings? [20:57:26] Ryan_Lane, yeah. [20:58:27] I'd kind of like to straighten all of that up [20:58:56] I think a puppet group should be a collection of roles and variables needed for a specific thing, rather than just a grouping [20:59:01] How do you mean? [20:59:17] right now we just collect all nfs roles and variables into a grouping [20:59:22] same with apache [20:59:25] etc [20:59:29] and mediawiki [20:59:34] so, let's take MW as an example [20:59:49] I am off, see you tomorrow [20:59:57] we'd have a grouping like: single-instance-mediawiki [21:00:01] hashar: see ya [21:00:14] Ryan_Lane: Yep, I agree. But you're talking about doing that by convention, not by automating it right? [21:00:16] andrewbogott: and it would only show classes and variables specific to that use [21:00:20] yeah [21:01:08] I'm not totally sure how to handle project-level classes and variables [21:01:09] I'd still love to import all the variables from classes automagically to the interface and present them... but it's a bit tricky [21:01:17] Damianz: yeah, that would be ideal [21:01:27] I'd like for us to only allow role classes and variables [21:01:42] roles would be perfect [21:01:45] but we still have things that aren't in role classes [21:02:50] andrewbogott: one way to handle project level puppet stuff [21:03:05] Ryan_Lane, Coren is going to work a bit on making a GUI for making project-level settings. Presumably it'll reuse existing code so we won't have to fix that part twice. [21:03:17] when a variable or class is added, it would modify all existing instances [21:03:31] * andrewbogott nods [21:03:39] and when new instances are created, it would add them [21:03:52] a worry, of course, is that people will add classes that make puppet not run for new instances [21:03:55] The only problem with that is how to handle conflicts when a value is overridden for an instance. [21:04:06] the instance config always wins [21:04:16] Ryan_Lane: For variables, sure. [21:04:32] if there's a conflict with classes, then people just need to fix that [21:04:36] Ryan_Lane: For classes, I intended to disable the checkbox in the instance config with a 'set in the project-wide' tooltip. [21:04:52] Coren: why disable the checkbox? it should just show as enabled [21:04:58] ah. I see what you mean [21:04:59] Yes, set but disabled. [21:05:37] setting disabled isn't really doable [21:05:40] So, a project has $foo=1 [21:05:44] and a new instance is created [21:05:51] and then the instance sets $foo=2 [21:05:58] well, i guess it is in this case [21:06:00] and then someone changes the project setting to $foo=3 [21:06:01] what happens? [21:06:23] Do we simply not allow the instance to set it if it has a value in the project? [21:06:31] andrewbogott: when it's going through and setting things, it should be checking to see if its set on the instance fist [21:06:32] *first [21:06:39] if it's already set, skip that instance [21:06:49] ok. [21:07:15] So we're storing the project settings in ldap but puppet itself never actually looks at that record, it's just used for reference by the web ui? [21:07:39] we should probably store the setting in the databasse [21:07:42] database [21:07:53] hm. [21:08:00] actually, that's problematic [21:08:24] yeah, we should add the puppet class to the project [21:08:24] It's a lot easier if we do it in ldap because we can use the same code to read/write [21:08:27] and add the info to the project [21:08:56] then if we implement this somewhere other than MW it'll still work [21:08:58] that too :) [21:09:15] so, another way to handle this... [21:09:20] we could write an ENC [21:09:34] it would look at both the instance and the project [21:09:46] and it would combine the classes and variabls [21:09:50] *variables [21:09:58] it's simpler and less error prone [21:10:13] ENC = a puppet thing? [21:10:18] and we could also add support for paramaterized classes, if we wanted [21:10:18] Yeah, that'd work too. [21:10:26] andrewbogott: yep [21:10:31] external node classifier [21:10:41] it can be written in any language, as well [21:10:48] ENC is definitely the more elegant and general way to go, but I thought we wanted to avoid that? [21:10:52] Perl! Perl! [21:10:57] hahaha [21:11:03] python, obviously :) [21:11:11] anyway, I didn't want to avoid it [21:11:18] Bah! Haskell! [21:11:19] I kind of wanted to do this a while ago [21:11:26] exactly for this use case [21:11:35] but we've been working around it [21:11:53] Yeah, it sounds like the right tool for the job. [21:12:22] I don't care so much about parameterized classes since I'm kind of a fan of having everything wrapped in roles anyway. [21:12:29] same [21:12:30] Being able to support parametrized classes would be doubleplusgood since it'd allow us to eschew global variables. [21:12:41] hm... [21:12:57] Coren: we're trying to eliminate parameterized classes in our node definitions as a whole [21:13:24] global variables are teh evilz! [21:13:28] Yeah, although, are we being circular? Was there a reason for eliminating them other than the fact that it was hard to handle them with the web ui? [21:13:47] andrewbogott: no, we've been working towards eliminating them in production too [21:13:52] to use roles [21:14:09] handling them in the ui will definitely suck, though [21:14:11] I still haven't figured out why roles can't be parametrized. [21:14:18] Coren, Ideally we don't have either globals /or/ parametrized classes, just facts and roles. [21:14:21] Coren: why do they need to be? [21:14:52] role::blah::production [21:14:55] this-is-a-x-for-y is a clear pattern for roles, so is this-is-the-foo-of-bar [21:14:56] role::blah::labs [21:16:47] I still don't get why role::foo::bar has any intrinstic value over role::foo(something => bar) unless you're trying to do 1:1 between hosts and roles, in which case you are doing nodes wrong. :-) [21:17:53] Coren: because the former requires the use of param classes, where the latter doesn't [21:18:06] and there's no major gain with doing the former over the latter [21:18:23] Ryan_Lane: You're being circular now. You want to avoid using param classes in favor of roles, and you use roles to avoid using param classes? :-) [21:18:33] with the former you need to handle the logic using ifs or cases [21:18:36] Ryan_Lane: Well, you necessarily save at least one level of indirection. [21:19:12] in either case, if you need to handle something differently, you need to modify the manifests [21:19:31] using param classes only makes things harder in this case [21:20:04] and it's not a 1:1 between hosts and roles [21:20:11] I still don't see it. It's also not an issue as I've said before -- I've no strong philosophical stance either way so long as we pick one and are clear. :-) I just don't get it is all. [21:20:12] it's a 1:1 between a use-case and a role [21:21:01] supporting param classes doesn't gain us much and adds a requirement to our UI and our backends [21:21:29] I think it gains a great deal of clarity and expressiveness, but YMMV [21:21:44] only if you're going to do logic in your nodes [21:21:46] which is evil :) [21:22:06] * Coren feels it oddly akin to making functions multiplyby2(float), multiplyby3(float), multiplyby4(float)... :-) [21:22:37] (i.e.: poor factoring) [21:22:54] a role is meant more as a configuration item [21:25:18] float multiply2by3() { return multiplyby3(multiplyby2(1.0)); } [21:25:19] :-) [21:26:14] * Damianz rewrites Coren in go [21:27:57] * YuviPanda rewrites Damianz in rust [21:28:18] * Damianz feels all concurrent [21:38:25] New review: coren; "That's... impressively thorough." [labs/toollabs] (master) C: 2; - https://gerrit.wikimedia.org/r/71115 [21:38:39] New review: coren; "LGM" [labs/toollabs] (master) C: 2; - https://gerrit.wikimedia.org/r/71114 [22:09:59] petan: is bots still being used? [22:14:43] Ryan_Lane: yes [22:14:53] ok [22:15:04] needing to start freeing up disk space in places [22:15:10] bots sql is eating like 90G [22:15:33] 90G? That's less than my ~/code :P [22:15:59] heh [22:16:27] yeah, but it's eating up local disk [22:16:38] Coren, andrewbogott: please update https://www.mediawiki.org/wiki/Wikimedia_engineering_report/2013/June