[00:02:40] maplebed: I wonder if there is way to keep those in ram more [00:03:24] * AaronSchulz wonders if the swiftstack people have any experience with this [00:03:36] It's on my list of things to ask Joe about tomorrow. [00:03:44] I'm pretty sure I know what his answer will be. [00:03:49] SSDs? [00:03:52] 'put the container listings on SSDs.' [00:06:32] curious, the docs say that Rackspace puts their containers on the same boxes as those running object and account servers [00:06:49] unless they have some SSDs mounted and have the containers just map to those [00:07:12] ...then probably don't do that [00:07:49] or maybe those docs are just outdated ;) [00:11:17] same boxes != same disks, necessarily. [00:11:51] what we'll likely do, if Joe confirms SSDs, is put 2 in each storage node [00:12:11] map containers and accounts to those and objects to the spinning media. [00:18:22] that's what I was saying above [00:18:38] either they do that or don't use ssds [00:18:53] in any case, that sounds reasonable [00:20:06] maplebed: do we have ssds lying around now? [00:20:24] nope. I'd have to order them. [00:20:28] rats [00:20:45] along with some special adapters so that they'll fit in the c2100 chassis. [00:21:00] so puring will just suck until then "p [00:21:05] * :p [00:21:17] *purging [00:21:21] * AaronSchulz sits up straight now [00:21:51] the 3-5s ones shouldn't be so bad. the 30s ones are an issue. [00:21:59] did you look at the gdoc graph I shared with you? [00:22:08] it shows you the distribution of container listing times [00:23:19] yeah, I saw [00:24:31] it's 4.5-5 for the p50 at peak hours, and the deletes add .5 sec or so...still sucks to me [00:24:37] but at least tolerable [00:25:24] p90 is atrocious though [00:25:49] New patchset: Bhartshorne; "adding kaldari and bsitu to the list of people that can deploy software" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8432 [00:25:55] binasher: would you review ^^^ for me? [00:26:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8432 [00:26:26] sure [00:27:17] New patchset: Bhartshorne; "adding kaldari and bsitu to the list of people that can deploy software RT-2957 RT-2958" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8432 [00:27:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8432 [00:27:57] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/8432 [00:28:26] thanks binasher [00:28:34] maplebed: did kaldari have shell access and then have it revoked at some point? weird [00:29:07] yeah, I promise I won't hack the donation system again ;) [00:29:13] lol [00:29:16] uh huh. [00:29:49] maplebed: yeah actually, is my key already in there? [00:29:56] yeah, his account was already there and enabled, just not in the mortals list (or others). [00:30:08] kaldari: yup. [00:30:33] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8432 [00:30:35] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8432 [00:30:53] andrewbogott_: are you still here? [00:31:05] your partman change is still waiting to be merged into sockpuppet. can I merge it? [00:31:32] (the one that adds backslashes to the lines with only periods) [00:31:52] hmm. idle 6 hrs; probably not. [00:31:59] well, it looks like a valid change. merging anyways. [00:32:07] maplebed: that's a very good question. untested. [00:32:49] it'll likely only hurt the systems that use it if it's not right. [00:33:11] ha [00:33:18] it barfs [00:33:22] just tested [00:33:45] i suspect the issue is with how the definition is named when called [00:33:51] Duplicate definition: Useradd[mark] [00:34:54] although--it seems like a Very Bad Idea to include users in multiple groups [00:35:44] really? [00:35:45] the whole point of doing it this way is to take control of the insane inheritance and scoping issues we have now which result it it being virtually impossible to predict what group membership, etc will look like [00:35:51] you're part of the fundraising group and you're part of the ops group. [00:36:14] well [00:36:17] if each person can exist in only one group, groups will become individuals. [00:36:28] what do you mean by group exactly? [00:36:38] do you mean i.e. the junk at the bottom of admin.pp? [00:36:48] the thing you added to the host in your example. [00:37:04] ah, that's one of many ways you can add a user to a host [00:37:07] you said 'include sysops and include fr-tech.' [00:37:11] right [00:37:22] you can just as easily do it right in the node definition [00:37:35] useradd { 'awjrichards': jgreen => 'absent' } [00:37:36] for example [00:37:38] I would expect you to be in both the fr-tech and sysops group. [00:37:57] and in the end what would that mean exactly? [00:38:15] that you have access to the system. [00:38:30] the problem we have now is that once you have access you have full access [00:38:40] what I needed was knobs for different flavors of access [00:39:09] i.e. sysops end up with more supplemental groups for example [00:39:16] huh. [00:39:40] I think I need to talk through it in person. I was expecting users:xxxx was a collection of people, but maybe it's a collection of people/privs? [00:40:13] oh, maybe we're not speaking in the same terms [01:41:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 242 seconds [01:44:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [02:05:34] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [02:27:28] New patchset: Jeremyb; "dedupe code: foreachwiki vs. foreachwikiindblist" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8434 [02:27:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8434 [02:42:34] i the purge-checkuser script still in use? i'm betting it doesn't do what's expected [02:42:56] 2>&1 > filenaem != > filename 2>&1 [02:43:11] also, should be >> ? [02:44:57] there's a comment in misc-servers.pp that says it's a hume cronjob [02:53:44] hello [03:00:58] New patchset: Hashar; "make 'puppet parser validate' errors monospaced" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4145 [03:01:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4145 [03:20:03] hashar: i think maybe you just need to make it \n\n [03:20:11] anyway, lets see what yours does ;) [03:20:16] well [03:20:24] jeremyb: I try to test it [03:20:30] but I guess my bash skill is very limited :-( [03:20:38] i can help with bash [03:20:41] oh [03:20:56] i'm hacking scap and friends now ;) [03:21:03] OH MY GOD [03:21:08] then you will hack the parser haha [03:21:23] let me dpaste stuff [03:21:45] (I am really happy to have someone to support me at 5:20am ) [03:22:38] haha [03:22:58] huh, i wonder if php -l is recursive [03:23:09] (lint or syntax check i guess?) [03:23:30] * jeremyb goes to just run it on beta and see [03:24:09] here is my super dumb bash skill http://dpaste.org/ebF1I/ [03:24:22] jeremyb: php -l is not recursive you have to pass it one file after the other [03:24:45] which makes it slow thanks to PHP cli startup overhead [03:25:09] hashar: https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=files/misc/scripts/sync-dir;hb=HEAD#l21 [03:25:36] see [03:25:39] php -l $FILE [03:25:48] (disregard the fact that FILE can be a directory) [03:26:13] that is a bug great [03:26:22] I guess that sync-dir has been copy/pasted from sync-file [03:28:53] hashar: find -type f -iname '*.php' /home/wikipedia/common/$FILE -exec php -l \; [03:28:56] ? [03:29:05] err [03:29:25] and .inc & .phtml [03:30:17] and we should abort as soon as a faulty one is found [03:31:05] find /home/wikipedia/common/$FILE -type f -a $ -iname '*.php' -o -iname '*.inc' -o -iname '*.phtml' $ -a -exec php -l \; [03:31:09] idk about fail early [03:33:08] > If any invocation of the command exits with a status of 255, xargs will stop immediately without reading any further input. An error message is issued on stderr when this happens. [03:34:50] from my test, GNU find happily execute all commands [03:35:01] maybe something like... find /home/wikipedia/common/$FILE -type f -a $ -iname '*.php' -o -iname '*.inc' -o -iname '*.phtml' $ -a -print0 | xargs -0 -n 1 bash -c 'php -l "$2" || exit 255' [03:35:42] \O/ [03:35:45] ll [03:35:49] 11? [03:36:02] then you will want to run php -l jobs in parallel [03:36:23] what's 11? [03:36:32] maybe using GNU make -j [03:36:43] but that would be evil [03:37:08] idk how good make is about failing early. or maybe that's just cc or something [03:37:55] xargs has --max-procs [03:38:01] not sure what it does though [03:38:48] Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possi‐ [03:38:51] ble at a time. Use the -n option with -P; otherwise chances are that only one exec will be done. [03:40:33] so -n 1 --max-procs 2 [03:40:37] that might work [03:42:16] gfind /srv/trunk -name '*.php' -print0 | gxargs -0 -n 1 --max-procs 2 php -l [03:42:17] yeah! [03:44:13] failed early? [03:45:33] oh I have no idea [03:46:36] well then the not early version's going in ;) [03:46:37] yeah seems to fail early [03:46:51] I have just made the second file to get a failure [03:46:55] and that correctly crashed [03:47:39] make php a wrapper script that logs when it's called and see if it really does get called less ;) [03:47:51] what's a good syntax error? [03:48:13] error [03:48:30] can't you get the message from php -l ? [03:48:34] it gives the line number iirc [03:48:47] ugh, damnit [03:49:07] i made the mistake of writing to $HOME on labs ;P [03:49:21] only took 3 secs to write though, i was lucky [03:51:03] damn labs is slow [03:52:50] http://ganglia.wikimedia.org/latest/?c=Virtualization%20cluster%20pmtpa&h=virt1.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2#mg_load_div [03:53:24] it must be hung on removing a swapfile or something [03:53:29] i just exited without save [03:53:40] ;-( [03:53:47] and it's not responding [03:54:01] (where it is vim) [03:54:10] well /home is mounted on some virtual instance NFS export [03:54:16] back! [03:54:18] which in turns uses a Gluster FS file system [03:54:19] yeah [03:54:22] yeah [03:54:24] wich seems to have tons of issues [03:54:37] that basically makes labs hard to use whenever Gluster is f***ed [03:54:45] as it seems to be the case right now [03:54:50] guuaaae [03:55:06] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [03:56:18] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [03:57:00] GlusterFS report a in traffic surge [03:57:04] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=network_report&s=by+name&c=Glusterfs+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [03:57:08] 30MBytes/sec [03:57:20] it is most probably what is kiling it [03:57:44] !log GlusterFS receiving 30Mbytes/sec of input traffic. Killing labs again :-D [03:57:49] Logged the message, Master [04:00:21] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [04:00:34] it's not 6:30 yet! i promise! [04:01:01] not being dead for me actually [04:01:24] i've not been speaking/watching here because it *has* been responsive [04:01:27] ahh [04:01:40] got 30Mbits of output traffic from dataset1001 [04:01:51] 30MBytes [04:01:51] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [04:02:08] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=dataset1001.wikimedia.org&m=network_report&r=hour&s=by%20name&hc=4&mc=2 [04:02:31] so that must be a file transfer between dataset1001 and some labs instance [04:02:45] PROBLEM - udp2log log age for emery on emery is CRITICAL: CRITICAL: log files /var/log/squid/orange-ivory-coast.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [04:02:46] looks like it slowed down [04:06:05] so, tcpdump tomorrow morning on dataset1001 ;) [04:06:10] but it's a fairly new box... [04:06:20] (also auth.log) [04:11:41] hashar: anyway, the short version is throw some double quotes around the token at the end of ssh -p 29418 hashar@gerrit.wikimedia.org gerrit $ARGS [04:11:51] so "$ARGS" [04:12:13] might also need to have some parens earlier at ARGS="$@" [04:12:26] like ARGS=( "$@" ) [04:12:34] what are parens for ? [04:12:46] (too lazy to read man bash right now :-D ) [04:14:36] that works btw [04:14:39] hmm [04:14:45] with and without? [04:15:00] without parens [04:15:04] change closed by the way [04:15:18] which? [04:16:15] * jeremyb wonders what the deal is with gerrit-wm in #mediawiki [04:16:23] * hashar opens a change [04:16:53] 21 19:13:56 -!- mode/#mediawiki [-q gerrit-wm!*@*] by Reedy [04:17:04] https://gerrit.wikimedia.org/r/#/c/8436/ \O/ [04:17:22] that's not what was quieted [04:17:33] don't spend anytime on that gerrit-wm bot [04:17:34] haha [04:17:48] we had a discussion about it 7 hours ago during our weekly meeting [04:17:53] and? [04:17:53] we all agreed it was low priority [04:18:08] somehow some people complain that a translation bot is spamming the channel for 10 minutes once per day [04:18:14] so they want the bot to be quieted [04:18:23] translation?!! [04:18:24] don't waste your time on that [04:18:26] :-D [04:18:26] oh [04:18:29] l10n bot yes [04:18:39] another topic you don't want to start looking at [04:18:41] ;-D [04:18:42] now i get it. i thought you meant logmsgbot in $-tech [04:18:47] #-tech* [04:19:08] basically a bot submit translations changes made to all mw extensions [04:19:11] which is only 4 lines per day [04:19:12] and approve them automatically [04:19:14] yeah, i got it [04:19:30] (which is a totally dumb process but he .. nothing better to do right now) [04:19:43] so about the short bash story [04:19:51] I got ARGS="$@" [04:19:56] I would expect it to quote my args [04:19:59] but it does not :-( [04:20:19] that shell escaping has always confused me [04:25:31] ahh it is broken again [04:25:32] yeah [04:31:16] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [04:31:52] hashar: ARGS=( "$@" ) [04:32:10] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [04:32:37] jeremyb: yeah that does not work with multine comments :-( [04:33:03] maybe it's not bash's fault? [04:33:09] maybe ssh [04:33:14] no [04:33:15] well the way the args are passed to ssh [04:33:17] gerrit's [04:33:40] ssh -p 29418 hashar@gerrit.wikimedia.org gerrit "review -m '

Beginpre text

' --submit 8c92ff0f399b1889342dfa0c2cd041b0a9b82232" [04:33:42] that one does work [04:33:55] that's one line [04:34:06] printf '%s\n' "$ARGS" [04:34:14] in place of or right before the ssh line [04:35:35] I am not more wasting my time on that gerrit() stufff [04:36:05] http://dpaste.org/rpp24/ [04:36:05] heh [04:36:11] I just used the normal way [04:36:49] I sent my review with the last line prefixed with a space [04:37:08] that made all the text to be rendered monospaced [04:37:08] https://gerrit.wikimedia.org/r/#/c/8436/ [04:37:10] ;( [04:39:28] ahhh [04:39:50] gerrit-wm, we missed you! [04:46:36] hashar: i have the answer [04:46:41] i don't know what i was thinking [04:47:39] New patchset: Hashar; "make 'puppet parser validate' errors monospaced" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4145 [04:47:45] for "$ARGS", double quotes aren't enough [04:47:58] you need "${ARGS[@]}" [04:48:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4145 [04:48:03] oh my god [04:48:11] that looks really hacky :-D [04:48:11] "$@" is a special case [04:48:25] not really [04:48:29] it's in the manual [04:48:45] it's not actually contortionary [04:51:42] hmm [04:51:50] $@ refers to the args passed to the function [04:52:08] where as ${ARGS[@]} are the one passed to the script ? [04:52:10] ;-D [04:52:15] don't waste your time on that anyway [04:52:16] New patchset: Jeremyb; "cleanup scap scripts, sql, etc." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8438 [04:52:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8438 [04:52:40] ARGS was only ARGS because that's what you assigned too [04:52:42] to* [04:52:51] you can do that with any variable [04:52:56] ohh [04:53:12] so I should use the magic trick in the command line shouldn't i ? [04:53:18] the ssh cmd I mean [04:53:26] erm? [04:53:41] ssh -p 29418 hashar@gerrit.wikimedia.org gerrit "${ARGS[@]}" [04:54:37] did you try that? [04:54:43] yeah does not work [04:54:48] that was the last try [04:54:56] I am giving up already spent too much time on that [04:55:16] ARGS may be special. try either just "$@" or an arg name that's definitely not special [04:55:24] anyway I probably fixed the long standing https://gerrit.wikimedia.org/r/#/c/4145/ ;-D [04:55:28] going to write ryan a mail [04:55:35] Ryan_Lane's here ;) [04:55:49] no I'm not [04:56:07] ;-) [04:56:54] New review: Jeremyb; "see also I994fda0b2819ff499b83a04bc5632962475f5d1f" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/8434 [04:57:19] New review: Jeremyb; "see also I994fda0b2819ff499b83a04bc5632962475f5d1f" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/5778 [04:58:16] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:58:35] New review: Jeremyb; "see also Ie297cf8cbb2fe209c19fe5bb4a6e7f7708a43ec1, Icd1d431cb92d4b3f32b62c4e12fb2cea86a8a2f2" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/8438 [05:00:33] New review: Hashar; "Sent a private mail to Ryan so he get a look at that patch again." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4145 [05:06:15] and it's back [05:06:17] ok, nacht [05:06:20] * jeremyb hopes for some reviews ;-) [05:09:29] 7am there [05:09:36] wife woke up [05:09:47] so it is breakfast time before our crying daughter starts … crying [05:09:52] see you late [05:10:00] thanks for the support jeremyb [05:23:17] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:47] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [05:27:56] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [05:30:47] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [05:34:41] RECOVERY - udp2log log age for emery on emery is OK: OK: all log files active [05:40:50] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [08:09:28] would anyone know where the stdout/stderr is redirected when using debian start-stop-daemon ? [08:09:55] I am looking at the mw-job-runner script [08:19:02] looks like both STDOUT & STDERR are unconditionally sent to /dev/null [08:56:22] mornin [08:56:57] hashar: yes, daemons go to /dev/null :) [08:57:14] hello :) [08:57:17] yeah figured that out [08:57:38] anyway I found out that the runJobs script are logging using the MediaWiki logging infrastructure [08:57:42] though nothing is logged [08:57:44] ;-D [08:57:54] and now I am trying to figure out what happened to udp2log [08:58:02] which is running but not writing anywhere haha [08:59:14] nice [08:59:28] ahhhh [08:59:39] so its writing to some file descriptor #11 [08:59:47] which I have no idea what it could actually be [09:00:03] so what does /proc/nn/fd/11 point to? [09:00:25] nothing [09:00:41] I keep forgetting about /proc/ [09:00:48] oh no [09:00:52] it changed [09:01:01] that FD is great [09:01:08] -> pipe:[263821] [09:01:09] \O/ [09:01:43] so now you can find out who has the other end [09:01:47] I love my job, it is like solving puzzle every days [09:01:50] heh [09:02:04] open("/home/wikipedia/logs/cli.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = -1 EACCES (Permission denied) [09:02:05] YEAHHH [09:02:10] I love linux [09:02:13] heh [09:02:49] ah /home/wikipedia/logs belong to root:root [09:02:51] thanks apergos !! [09:02:56] sure! [09:02:58] for the /proc/pid tip [09:03:32] * hashar dig in puppet to find out why /home/wikipedia/log no more belong to udp2log [09:04:53] git blame : 7b523a68 (Antoine Musso [09:04:54] yeah [09:05:00] :-D [09:05:00] that guy is breaking everything [09:05:08] gotta watch him :-D [09:07:09] hashar: lsof is useful too. [09:12:35] paravoid: lsof is great indeed [09:13:11] so /home/wikipedia/logs was not belonging to udp2log user https://gerrit.wikimedia.org/r/8442 fix it [09:13:24] paravoid | apergos > could you look / merge 8442 above (in test branch) [09:13:26] please ;) [09:17:34] Greek ops are lunching ;-D [09:17:44] no, it's too early [09:17:47] I'm looking at the change [09:18:16] and musign over the "wikidevs not available in labs" comment [09:18:19] *musing [09:18:48] yeah there is no wikidev group yet [09:18:59] we might use deployment-prep [09:19:03] or the svn (550) one [09:19:05] just seemsodd [09:19:06] I am not sure [09:19:10] something I need to write about [09:19:13] anyways it doesn't matter right now [09:19:44] the issue is that users are in the 550 (svn) user group per default :D [09:20:06] and I have no idea how to create a wikidev group on labs and have users from deployment-prep to use that as a primary group [09:21:32] are you able to push out to test now [09:21:38] or do I need to do something after the merge? [09:21:41] I can apply puppetd -tv [09:21:46] but not able to submit a change / merge it [09:21:54] ohhh [09:22:05] once merged a cronjob pull the change every minute [09:22:06] ;)D [09:22:08] thanks! [09:22:10] ok great [09:23:19] works! [09:23:42] fixed! [09:24:22] I got logs!! [09:24:24] yeah [09:24:27] thanks apergos !! [09:24:33] sure [09:25:55] sorry, I was having breakfast [09:26:14] no worries [09:31:03] it s ok paravoid ;-D [09:31:52] New patchset: Dzahn; "add eqiad labs-hosts a-d subnets to autoinstall" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8443 [09:32:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8443 [09:35:42] New review: Dzahn; "Andrew Bogott will need them." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8443 [09:35:44]