[00:02:40] maplebed: I wonder if there is way to keep those in ram more [00:03:24] * AaronSchulz wonders if the swiftstack people have any experience with this [00:03:36] It's on my list of things to ask Joe about tomorrow. [00:03:44] I'm pretty sure I know what his answer will be. [00:03:49] SSDs? [00:03:52] 'put the container listings on SSDs.' [00:06:32] curious, the docs say that Rackspace puts their containers on the same boxes as those running object and account servers [00:06:49] unless they have some SSDs mounted and have the containers just map to those [00:07:12] ...then probably don't do that [00:07:49] or maybe those docs are just outdated ;) [00:11:17] same boxes != same disks, necessarily. [00:11:51] what we'll likely do, if Joe confirms SSDs, is put 2 in each storage node [00:12:11] map containers and accounts to those and objects to the spinning media. [00:18:22] that's what I was saying above [00:18:38] either they do that or don't use ssds [00:18:53] in any case, that sounds reasonable [00:20:06] maplebed: do we have ssds lying around now? [00:20:24] nope. I'd have to order them. [00:20:28] rats [00:20:45] along with some special adapters so that they'll fit in the c2100 chassis. [00:21:00] so puring will just suck until then "p [00:21:05] * :p [00:21:17] *purging [00:21:21] * AaronSchulz sits up straight now [00:21:51] the 3-5s ones shouldn't be so bad. the 30s ones are an issue. [00:21:59] did you look at the gdoc graph I shared with you? [00:22:08] it shows you the distribution of container listing times [00:23:19] yeah, I saw [00:24:31] it's 4.5-5 for the p50 at peak hours, and the deletes add .5 sec or so...still sucks to me [00:24:37] but at least tolerable [00:25:24] p90 is atrocious though [00:25:49] New patchset: Bhartshorne; "adding kaldari and bsitu to the list of people that can deploy software" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8432 [00:25:55] binasher: would you review ^^^ for me? [00:26:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8432 [00:26:26] sure [00:27:17] New patchset: Bhartshorne; "adding kaldari and bsitu to the list of people that can deploy software RT-2957 RT-2958" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8432 [00:27:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8432 [00:27:57] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/8432 [00:28:26] thanks binasher [00:28:34] maplebed: did kaldari have shell access and then have it revoked at some point? weird [00:29:07] yeah, I promise I won't hack the donation system again ;) [00:29:13] lol [00:29:16] uh huh. [00:29:49] maplebed: yeah actually, is my key already in there? [00:29:56] yeah, his account was already there and enabled, just not in the mortals list (or others). [00:30:08] kaldari: yup. [00:30:33] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8432 [00:30:35] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8432 [00:30:53] andrewbogott_: are you still here? [00:31:05] your partman change is still waiting to be merged into sockpuppet. can I merge it? [00:31:32] (the one that adds backslashes to the lines with only periods) [00:31:52] hmm. idle 6 hrs; probably not. [00:31:59] well, it looks like a valid change. merging anyways. [00:32:07] maplebed: that's a very good question. untested. [00:32:49] it'll likely only hurt the systems that use it if it's not right. [00:33:11] ha [00:33:18] it barfs [00:33:22] just tested [00:33:45] i suspect the issue is with how the definition is named when called [00:33:51] Duplicate definition: Useradd[mark] [00:34:54] although--it seems like a Very Bad Idea to include users in multiple groups [00:35:44] really? [00:35:45] the whole point of doing it this way is to take control of the insane inheritance and scoping issues we have now which result it it being virtually impossible to predict what group membership, etc will look like [00:35:51] you're part of the fundraising group and you're part of the ops group. [00:36:14] well [00:36:17] if each person can exist in only one group, groups will become individuals. [00:36:28] what do you mean by group exactly? [00:36:38] do you mean i.e. the junk at the bottom of admin.pp? [00:36:48] the thing you added to the host in your example. [00:37:04] ah, that's one of many ways you can add a user to a host [00:37:07] you said 'include sysops and include fr-tech.' [00:37:11] right [00:37:22] you can just as easily do it right in the node definition [00:37:35] useradd { 'awjrichards': jgreen => 'absent' } [00:37:36] for example [00:37:38] I would expect you to be in both the fr-tech and sysops group. [00:37:57] and in the end what would that mean exactly? [00:38:15] that you have access to the system. [00:38:30] the problem we have now is that once you have access you have full access [00:38:40] what I needed was knobs for different flavors of access [00:39:09] i.e. sysops end up with more supplemental groups for example [00:39:16] huh. [00:39:40] I think I need to talk through it in person. I was expecting users:xxxx was a collection of people, but maybe it's a collection of people/privs? [00:40:13] oh, maybe we're not speaking in the same terms [01:41:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 242 seconds [01:44:16] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [02:05:34] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [02:27:28] New patchset: Jeremyb; "dedupe code: foreachwiki vs. foreachwikiindblist" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8434 [02:27:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8434 [02:42:34] i the purge-checkuser script still in use? i'm betting it doesn't do what's expected [02:42:56] 2>&1 > filenaem != > filename 2>&1 [02:43:11] also, should be >> ? [02:44:57] there's a comment in misc-servers.pp that says it's a hume cronjob [02:53:44] hello [03:00:58] New patchset: Hashar; "make 'puppet parser validate' errors monospaced" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4145 [03:01:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4145 [03:20:03] hashar: i think maybe you just need to make it \n\n [03:20:11] anyway, lets see what yours does ;) [03:20:16] well [03:20:24] jeremyb: I try to test it [03:20:30] but I guess my bash skill is very limited :-( [03:20:38] i can help with bash [03:20:41] oh [03:20:56] i'm hacking scap and friends now ;) [03:21:03] OH MY GOD [03:21:08] then you will hack the parser haha [03:21:23] let me dpaste stuff [03:21:45] (I am really happy to have someone to support me at 5:20am ) [03:22:38] haha [03:22:58] huh, i wonder if php -l is recursive [03:23:09] (lint or syntax check i guess?) [03:23:30] * jeremyb goes to just run it on beta and see [03:24:09] here is my super dumb bash skill http://dpaste.org/ebF1I/ [03:24:22] jeremyb: php -l is not recursive you have to pass it one file after the other [03:24:45] which makes it slow thanks to PHP cli startup overhead [03:25:09] hashar: https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=files/misc/scripts/sync-dir;hb=HEAD#l21 [03:25:36] see [03:25:39] php -l $FILE [03:25:48] (disregard the fact that FILE can be a directory) [03:26:13] that is a bug great [03:26:22] I guess that sync-dir has been copy/pasted from sync-file [03:28:53] hashar: find -type f -iname '*.php' /home/wikipedia/common/$FILE -exec php -l \; [03:28:56] ? [03:29:05] err [03:29:25] and .inc & .phtml [03:30:17] and we should abort as soon as a faulty one is found [03:31:05] find /home/wikipedia/common/$FILE -type f -a \( -iname '*.php' -o -iname '*.inc' -o -iname '*.phtml' \) -a -exec php -l \; [03:31:09] idk about fail early [03:33:08] > If any invocation of the command exits with a status of 255, xargs will stop immediately without reading any further input. An error message is issued on stderr when this happens. [03:34:50] from my test, GNU find happily execute all commands [03:35:01] maybe something like... find /home/wikipedia/common/$FILE -type f -a \( -iname '*.php' -o -iname '*.inc' -o -iname '*.phtml' \) -a -print0 | xargs -0 -n 1 bash -c 'php -l "$2" || exit 255' [03:35:42] \O/ [03:35:45] ll [03:35:49] 11? [03:36:02] then you will want to run php -l jobs in parallel [03:36:23] what's 11? [03:36:32] maybe using GNU make -j [03:36:43] but that would be evil [03:37:08] idk how good make is about failing early. or maybe that's just cc or something [03:37:55] xargs has --max-procs [03:38:01] not sure what it does though [03:38:48] Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possi‐ [03:38:51] ble at a time. Use the -n option with -P; otherwise chances are that only one exec will be done. [03:40:33] so -n 1 --max-procs 2 [03:40:37] that might work [03:42:16] gfind /srv/trunk -name '*.php' -print0 | gxargs -0 -n 1 --max-procs 2 php -l [03:42:17] yeah! [03:44:13] failed early? [03:45:33] oh I have no idea [03:46:36] well then the not early version's going in ;) [03:46:37] yeah seems to fail early [03:46:51] I have just made the second file to get a failure [03:46:55] and that correctly crashed [03:47:39] make php a wrapper script that logs when it's called and see if it really does get called less ;) [03:47:51] what's a good syntax error? [03:48:13] error [03:48:30] can't you get the message from php -l ? [03:48:34] it gives the line number iirc [03:48:47] ugh, damnit [03:49:07] i made the mistake of writing to $HOME on labs ;P [03:49:21] only took 3 secs to write though, i was lucky [03:51:03] damn labs is slow [03:52:50] http://ganglia.wikimedia.org/latest/?c=Virtualization%20cluster%20pmtpa&h=virt1.pmtpa.wmnet&m=load_one&r=hour&s=by%20name&hc=4&mc=2#mg_load_div [03:53:24] it must be hung on removing a swapfile or something [03:53:29] i just exited without save [03:53:40] ;-( [03:53:47] and it's not responding [03:54:01] (where it is vim) [03:54:10] well /home is mounted on some virtual instance NFS export [03:54:16] back! [03:54:18] which in turns uses a Gluster FS file system [03:54:19] yeah [03:54:22] yeah [03:54:24] wich seems to have tons of issues [03:54:37] that basically makes labs hard to use whenever Gluster is f***ed [03:54:45] as it seems to be the case right now [03:54:50] guuaaae [03:55:06] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [03:56:18] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [03:57:00] GlusterFS report a in traffic surge [03:57:04] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=network_report&s=by+name&c=Glusterfs+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [03:57:08] 30MBytes/sec [03:57:20] it is most probably what is kiling it [03:57:44] !log GlusterFS receiving 30Mbytes/sec of input traffic. Killing labs again :-D [03:57:49] Logged the message, Master [04:00:21] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [04:00:34] it's not 6:30 yet! i promise! [04:01:01] not being dead for me actually [04:01:24] i've not been speaking/watching here because it *has* been responsive [04:01:27] ahh [04:01:40] got 30Mbits of output traffic from dataset1001 [04:01:51] 30MBytes [04:01:51] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [04:02:08] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=dataset1001.wikimedia.org&m=network_report&r=hour&s=by%20name&hc=4&mc=2 [04:02:31] so that must be a file transfer between dataset1001 and some labs instance [04:02:45] PROBLEM - udp2log log age for emery on emery is CRITICAL: CRITICAL: log files /var/log/squid/orange-ivory-coast.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [04:02:46] looks like it slowed down [04:06:05] so, tcpdump tomorrow morning on dataset1001 ;) [04:06:10] but it's a fairly new box... [04:06:20] (also auth.log) [04:11:41] hashar: anyway, the short version is throw some double quotes around the token at the end of ssh -p 29418 hashar@gerrit.wikimedia.org gerrit $ARGS [04:11:51] so "$ARGS" [04:12:13] might also need to have some parens earlier at ARGS="$@" [04:12:26] like ARGS=( "$@" ) [04:12:34] what are parens for ? [04:12:46] (too lazy to read man bash right now :-D ) [04:14:36] that works btw [04:14:39] hmm [04:14:45] with and without? [04:15:00] without parens [04:15:04] change closed by the way [04:15:18] which? [04:16:15] * jeremyb wonders what the deal is with gerrit-wm in #mediawiki [04:16:23] * hashar opens a change [04:16:53] 21 19:13:56 -!- mode/#mediawiki [-q gerrit-wm!*@*] by Reedy [04:17:04] https://gerrit.wikimedia.org/r/#/c/8436/ \O/ [04:17:22] that's not what was quieted [04:17:33] don't spend anytime on that gerrit-wm bot [04:17:34] haha [04:17:48] we had a discussion about it 7 hours ago during our weekly meeting [04:17:53] and? [04:17:53] we all agreed it was low priority [04:18:08] somehow some people complain that a translation bot is spamming the channel for 10 minutes once per day [04:18:14] so they want the bot to be quieted [04:18:23] translation?!! [04:18:24] don't waste your time on that [04:18:26] :-D [04:18:26] oh [04:18:29] l10n bot yes [04:18:39] another topic you don't want to start looking at [04:18:41] ;-D [04:18:42] now i get it. i thought you meant logmsgbot in $-tech [04:18:47] #-tech* [04:19:08] basically a bot submit translations changes made to all mw extensions [04:19:11] which is only 4 lines per day [04:19:12] and approve them automatically [04:19:14] yeah, i got it [04:19:30] (which is a totally dumb process but he .. nothing better to do right now) [04:19:43] so about the short bash story [04:19:51] I got ARGS="$@" [04:19:56] I would expect it to quote my args [04:19:59] but it does not :-( [04:20:19] that shell escaping has always confused me [04:25:31] ahh it is broken again [04:25:32] yeah [04:31:16] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [04:31:52] hashar: ARGS=( "$@" ) [04:32:10] PROBLEM - Puppet freshness on brewster is CRITICAL: Puppet has not run in the last 10 hours [04:32:37] jeremyb: yeah that does not work with multine comments :-( [04:33:03] maybe it's not bash's fault? [04:33:09] maybe ssh [04:33:14] no [04:33:15] well the way the args are passed to ssh [04:33:17] gerrit's [04:33:40] ssh -p 29418 hashar@gerrit.wikimedia.org gerrit "review -m '
Beginpre text
' --submit 8c92ff0f399b1889342dfa0c2cd041b0a9b82232" [04:33:42] that one does work [04:33:55] that's one line [04:34:06] printf '%s\n' "$ARGS" [04:34:14] in place of or right before the ssh line [04:35:35] I am not more wasting my time on that gerrit() stufff [04:36:05] http://dpaste.org/rpp24/ [04:36:05] heh [04:36:11] I just used the normal way [04:36:49] I sent my review with the last line prefixed with a space [04:37:08] that made all the text to be rendered monospaced [04:37:08] https://gerrit.wikimedia.org/r/#/c/8436/ [04:37:10] ;( [04:39:28] ahhh [04:39:50] gerrit-wm, we missed you! [04:46:36] hashar: i have the answer [04:46:41] i don't know what i was thinking [04:47:39] New patchset: Hashar; "make 'puppet parser validate' errors monospaced" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4145 [04:47:45] for "$ARGS", double quotes aren't enough [04:47:58] you need "${ARGS[@]}" [04:48:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4145 [04:48:03] oh my god [04:48:11] that looks really hacky :-D [04:48:11] "$@" is a special case [04:48:25] not really [04:48:29] it's in the manual [04:48:45] it's not actually contortionary [04:51:42] hmm [04:51:50] $@ refers to the args passed to the function [04:52:08] where as ${ARGS[@]} are the one passed to the script ? [04:52:10] ;-D [04:52:15] don't waste your time on that anyway [04:52:16] New patchset: Jeremyb; "cleanup scap scripts, sql, etc." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8438 [04:52:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8438 [04:52:40] ARGS was only ARGS because that's what you assigned too [04:52:42] to* [04:52:51] you can do that with any variable [04:52:56] ohh [04:53:12] so I should use the magic trick in the command line shouldn't i ? [04:53:18] the ssh cmd I mean [04:53:26] erm? [04:53:41] ssh -p 29418 hashar@gerrit.wikimedia.org gerrit "${ARGS[@]}" [04:54:37] did you try that? [04:54:43] yeah does not work [04:54:48] that was the last try [04:54:56] I am giving up already spent too much time on that [04:55:16] ARGS may be special. try either just "$@" or an arg name that's definitely not special [04:55:24] anyway I probably fixed the long standing https://gerrit.wikimedia.org/r/#/c/4145/ ;-D [04:55:28] going to write ryan a mail [04:55:35] Ryan_Lane's here ;) [04:55:49] no I'm not [04:56:07] ;-) [04:56:54] New review: Jeremyb; "see also I994fda0b2819ff499b83a04bc5632962475f5d1f" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/8434 [04:57:19] New review: Jeremyb; "see also I994fda0b2819ff499b83a04bc5632962475f5d1f" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/5778 [04:58:16] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:58:35] New review: Jeremyb; "see also Ie297cf8cbb2fe209c19fe5bb4a6e7f7708a43ec1, Icd1d431cb92d4b3f32b62c4e12fb2cea86a8a2f2" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/8438 [05:00:33] New review: Hashar; "Sent a private mail to Ryan so he get a look at that patch again." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4145 [05:06:15] and it's back [05:06:17] ok, nacht [05:06:20] * jeremyb hopes for some reviews ;-) [05:09:29] 7am there [05:09:36] wife woke up [05:09:47] so it is breakfast time before our crying daughter starts … crying [05:09:52] see you late [05:10:00] thanks for the support jeremyb [05:23:17] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [05:24:47] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [05:27:56] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [05:30:47] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [05:34:41] RECOVERY - udp2log log age for emery on emery is OK: OK: all log files active [05:40:50] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [08:09:28] would anyone know where the stdout/stderr is redirected when using debian start-stop-daemon ? [08:09:55] I am looking at the mw-job-runner script [08:19:02] looks like both STDOUT & STDERR are unconditionally sent to /dev/null [08:56:22] mornin [08:56:57] hashar: yes, daemons go to /dev/null :) [08:57:14] hello :) [08:57:17] yeah figured that out [08:57:38] anyway I found out that the runJobs script are logging using the MediaWiki logging infrastructure [08:57:42] though nothing is logged [08:57:44] ;-D [08:57:54] and now I am trying to figure out what happened to udp2log [08:58:02] which is running but not writing anywhere haha [08:59:14] nice [08:59:28] ahhhh [08:59:39] so its writing to some file descriptor #11 [08:59:47] which I have no idea what it could actually be [09:00:03] so what does /proc/nn/fd/11 point to? [09:00:25] nothing [09:00:41] I keep forgetting about /proc/ [09:00:48] oh no [09:00:52] it changed [09:01:01] that FD is great [09:01:08] -> pipe:[263821] [09:01:09] \O/ [09:01:43] so now you can find out who has the other end [09:01:47] I love my job, it is like solving puzzle every days [09:01:50] heh [09:02:04] open("/home/wikipedia/logs/cli.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = -1 EACCES (Permission denied) [09:02:05] YEAHHH [09:02:10] I love linux [09:02:13] heh [09:02:49] ah /home/wikipedia/logs belong to root:root [09:02:51] thanks apergos !! [09:02:56] sure! [09:02:58] for the /proc/pid tip [09:03:32] * hashar dig in puppet to find out why /home/wikipedia/log no more belong to udp2log [09:04:53] git blame : 7b523a68 (Antoine Musso [09:04:54] yeah [09:05:00] :-D [09:05:00] that guy is breaking everything [09:05:08] gotta watch him :-D [09:07:09] hashar: lsof is useful too. [09:12:35] paravoid: lsof is great indeed [09:13:11] so /home/wikipedia/logs was not belonging to udp2log user https://gerrit.wikimedia.org/r/8442 fix it [09:13:24] paravoid | apergos > could you look / merge 8442 above (in test branch) [09:13:26] please ;) [09:17:34] Greek ops are lunching ;-D [09:17:44] no, it's too early [09:17:47] I'm looking at the change [09:18:16] and musign over the "wikidevs not available in labs" comment [09:18:19] *musing [09:18:48] yeah there is no wikidev group yet [09:18:59] we might use deployment-prep [09:19:03] or the svn (550) one [09:19:05] just seemsodd [09:19:06] I am not sure [09:19:10] something I need to write about [09:19:13] anyways it doesn't matter right now [09:19:44] the issue is that users are in the 550 (svn) user group per default :D [09:20:06] and I have no idea how to create a wikidev group on labs and have users from deployment-prep to use that as a primary group [09:21:32] are you able to push out to test now [09:21:38] or do I need to do something after the merge? [09:21:41] I can apply puppetd -tv [09:21:46] but not able to submit a change / merge it [09:21:54] ohhh [09:22:05] once merged a cronjob pull the change every minute [09:22:06] ;)D [09:22:08] thanks! [09:22:10] ok great [09:23:19] works! [09:23:42] fixed! [09:24:22] I got logs!! [09:24:24] yeah [09:24:27] thanks apergos !! [09:24:33] sure [09:25:55] sorry, I was having breakfast [09:26:14] no worries [09:31:03] it s ok paravoid ;-D [09:31:52] New patchset: Dzahn; "add eqiad labs-hosts a-d subnets to autoinstall" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8443 [09:32:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8443 [09:35:42] New review: Dzahn; "Andrew Bogott will need them." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8443 [09:35:44] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8443 [09:45:08] New patchset: Dzahn; "and add eqiad private1-c and 1-d subnets to autoinstall as well" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8444 [09:45:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8444 [09:46:47] New review: Dzahn; "info from existing reverse DNS in 10.in-addr.arpa" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8444 [09:46:49] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8444 [09:56:21] ugh, "Cannot use opaque URLs 'puppet::///files..." [09:56:26] opaque? [09:57:02] ok..hmm [10:01:19] New patchset: Dzahn; "fix puppet runs on brewster, typo, one : too much and you get "Cannot user opaque URLs.."" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8446 [10:01:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8446 [10:01:57] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8446 [10:02:00] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8446 [10:03:01] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Tue May 22 10:02:46 UTC 2012 [10:14:21] New review: Dzahn; "this appears to have broken puppet on sodium:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6798 [10:18:51] I got two easy changes for review, meant to ask git to ignore './private/' directory at the root of operations/puppet https://gerrit.wikimedia.org/r/#/c/6471/ https://gerrit.wikimedia.org/r/#/c/6470/ [11:34:31] New patchset: Hashar; "database & memcached configuration for 'beta' project" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/8448 [11:34:37] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/8448 [11:35:39] New review: Hashar; "I will do the FileRepo and squid stuff in a similar way." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/8448 [12:06:40] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [13:01:45] anyone around able to deal with the lag on db12? [13:01:54] it's getting pretty bad [13:11:54] hey mark, you around? [13:12:04] question about the lighttpd_config define [13:12:10] ok [13:12:49] https://gist.github.com/2768958 [13:12:50] so [13:12:53] oh oops [13:12:55] not that [13:13:17] https://gist.github.com/2768958 [13:13:18] ok fixed [13:13:19] yeah so [13:13:24] if $install == true [13:13:28] it puts the enabled file in place [13:13:30] and then [13:13:31] no matter what [13:13:41] creates a symlink to the enabled file in the available/ dir [13:13:56] i would think you would want [13:14:07] to create the available no matter what [13:14:21] no [13:14:26] and create the enabled symlink conditionally [13:14:27] no? [13:14:28] it may be created by other means, outside that definition [13:14:48] ah I see [13:15:03] so by default [13:15:13] lighttpd_config just creates the symlink? [13:15:25] yes [13:16:32] hmm, ok [13:16:34] in that case [13:16:42] can I change it so that install => false still defines the file [13:16:46] but does nothign but ensure => present? [13:17:08] it seems the spots where install => false are for pre packages config files that come with lighttpd, right? [13:17:26] or if you want to generate it with a template [13:17:42] why would you want to do that? [13:17:43] naw, its ok, i'm just trying to fix the reload exec [13:17:52] and I had it subscribed to the available file too [13:17:57] i guess I could just not subscribe it to that [13:17:59] just notify the service [13:18:02] if install => false [13:18:02] yeah [13:18:04] in the other direction [13:18:15] i think peter was trying that and was having trouble [13:18:18] but i don't remmeber why [13:18:19] so I did this [13:18:42] that's just because peter is a trouble maker ;-) [13:18:44] but now I'm sorry I did because I don't really have anything to do with what they were working on, and now I have to fix it :p [13:18:44] heheh [13:18:57] ok, will just try that [13:22:18] so, I'm not going to bother with this right now, but for defines like this, it's good if you build them with an $ensure parameter [13:22:27] so that they work like most of puppet's native resource types [13:22:54] so in this case, one should be able to do ensure => false and have the symlink removed [13:22:59] and even [13:23:06] ensure => purged and have the available file removed [13:25:46] New patchset: Ottomata; "generic-defintions.pp - fixing lighttpd reload" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8454 [13:26:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8454 [13:26:25] mutante, could you check that and see if it fixes the problem on sodium? [13:27:42] mutante: re: mailman, just followed wikitech to create the list? [13:33:37] New patchset: Ottomata; "check_udp2log_log_age - Adding orange-ivory-coast to list of slow logs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8456 [13:33:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8456 [13:53:16] heyyyyy hashar [13:53:18] you around? [13:54:55] want to ask you about your misc::contint::jdk class [13:55:01] i'd like to make it generic [14:07:52] i like generic :D [14:13:48] PROBLEM - MySQL Replication Heartbeat on db12 is CRITICAL: CRIT replication delay 184 seconds [14:14:06] PROBLEM - MySQL Slave Delay on db12 is CRITICAL: CRIT replication delay 194 seconds [14:14:12] great [14:16:57] ottomata: go ahead ;-D [14:17:17] ottomata: that was one of my very first puppet class, I probably copy /pasted from some internet site [14:17:32] ottomata: then some op probably made me rewrite it entirely (twice or so) ;-D [14:17:36] yeah, i see it, it looks cool [14:17:38] hah [14:17:44] but, there is a module from puppetlabs for it too [14:17:46] http://forge.puppetlabs.com/puppetlabs/java [14:17:49] so i'm going to try that first [14:17:52] and maybe leave your stuff alone [14:18:27] ottomata: one problem is that Oracle no more provide Debian packages of Sun JDK [14:18:36] nor security update [14:18:51] so we need to provides whatever free / new version is packaged in ubuntu instead [14:19:10] what repo is the .deb from? [14:19:14] in your stuff? [14:19:20] ottomata: I think that was written by LeslieCarr btw (she on vacation in Europe this week I think) [14:19:27] oh ok [14:19:38] sun-java6-jre [14:19:56] that's the package name, right? [14:19:57] .deb packages are always served from http://apt.wikimedia.org/wikimedia/ [14:20:00] ahhhh [14:20:00] k [14:20:11] sun-java6-jre was packaged by Oracle IIRC [14:20:13] that's fine then, as long as the package name is the same it should be fine [14:20:16] ja? [14:20:16] or if not, by ubuntu [14:20:25] it looks like by oracle into canonical? [14:20:26] maybe? [14:20:28] and Oracle or Ubuntu just decided to drop support for that packaging [14:20:53] so once upon a time, we had a nice company Sun [14:20:55] then it died [14:20:57] :-( [14:21:18] PROBLEM - MySQL Slave Delay on db12 is CRITICAL: CRIT replication delay 181 seconds [14:21:20] ha, yeah [14:21:30] hmmm, i don't see the java debs in apt.wm.org [14:22:06] ahhh [14:22:06] sorry [14:22:09] sun starts with S [14:22:10] not with J [14:22:29] don't use sun-java* anymore [14:22:32] ottomata: here is the reference https://lists.ubuntu.com/archives/ubuntu-security-announce/2011-December/001528.html [14:22:36] they have known security vulnerabilities [14:22:40] relevant RT ticket is 2147 [14:22:51] and Oracle is not allowing people to distribute the new versions anymore [14:22:53] http://apt.wikimedia.org/wikimedia/pool/main/s/sun-java6/ [14:23:13] okay, we have to fix that [14:23:14] that's what we are using now, right? [14:23:15] yeah we should not do that anymore :-D [14:23:19] ok [14:23:20] you should use openjdk [14:23:26] hmm ok [14:23:59] should be equivalent [14:29:11] hmm, with the openjdk stuff [14:29:17] I guess we don't ahve to agree to the license? [14:29:50] which license? [14:29:54] openjdk is gplv2+ [14:29:58] for sun [14:30:01] um [14:30:03] (kind of) [14:30:15] https://gist.github.com/2769415 [14:30:18] that's what we are doing now [14:30:28] that's the DLJ [14:30:32] this is only for binary releases [14:30:41] and it's not being offered anymore by Oracle [14:30:59] so, no? I can just install the openjdk package? [14:31:13] it was a special license that was created back in '06 for distributions to be able to distribute binary releases until OpenJDK caught on [14:31:27] ottomata: no; yes :) [14:31:27] hm, ok [14:31:38] cool [14:31:47] yes, isn't free software nice? :) [14:31:49] man forget this module then, I just need to isntall the package! [14:31:51] so much easier [14:32:42] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [14:42:57] i think we can still use java sun jdk: https://wiki.ubuntu.com/LucidLynx/ReleaseNotes#Sun_Java_moved_to_the_Partner_repository [14:43:24] AFAIK, there is a big performance diff between openjdk and sun jdk [14:49:17] a bad one? heheh [14:50:44] paravoid: yes, more or less. to just create a new one i use the web: https://lists.wikimedia.org/mailman/create , there are some policies around naming (no -l suffix anymore, etc) and always gotta make sure if public/private, the archiving options etc [14:50:53] drdee2 [14:50:53] sorry, gotta run for travel. out towards Berlin [14:50:56] OpenJDK is handy to have on a development system as it has more source for you to step into when debugging something. OpenJDK and Sun JDK mainly differ in (native?) rendering/AWT/Swing code, which is not relevant for any http://wiki.apache.org/hadoop/MapReduce Jobs that aren't creating images as part of their work. [14:50:59] from http://wiki.apache.org/hadoop/HadoopJavaVersions [14:54:21] ottomata: I think I had to install Sun JDK because of the android platform or at least as a requisite of some software used by the WMF mobile team [14:54:30] ottomata: that might have changed since then ;-D [14:58:57] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:22:38] paravoid: So, I'm now progressed to the point of getting an error message on virt1001, and it's telling me there's more info on virtual terminal 4. Any idea how I can actually view terminal 4? [15:26:25] what's telling you that? [15:26:26] d-i? [15:26:44] yeah. [15:26:52] Well, d-i or the partition tool that it's running. [15:27:17] It also suggests I look at a logfile, but since I don't have a file-system, that seems a bit useless. [15:27:28] Oh, wait, maybe I do have a filesystem... seems to be booting natively this time. Lemme see... [15:27:47] No, my mistake, just trying to install again. [15:28:04] virtual terminal 4 is tty4 [15:28:10] in a normal system you would do alt+f4 [15:28:20] but now I presume that you run over serial redirection? [15:28:24] or do you have a kvm? [15:28:24] right. [15:28:47] I mean, from my point of view I'm ssh'ing. But I believe I'm actually connecting via a serial terminal. [15:30:19] lemme check our puppet config [15:30:19] It only displays that message for ~60 seconds before automatically disconnecting. I'm just about to come 'round to the message again. [15:30:36] oh? [15:31:26] 'Check /var/log/syslog or see virtual console 4 for the details.' [15:31:30] I'm back! But not for long :( [15:31:42] that means d-i failed somewhere [15:31:51] let's see how we can see that while on serial [15:31:54] I know! But I would like to see the log file. [15:32:02] thanks [15:32:35] I'm pretty excited that there's even the hint of a log... googling suggests that usually when partman fails it's totally silent. [15:34:24] It appears that by twiddling the arrow keys I can keep that message alive for longer. [15:38:45] can you navigate through the menu? [15:38:54] hit Go Back? [15:39:00] and then go on the menu? [15:39:08] and then spawn a shell? [15:39:27] andrewbogott_: ^ [15:39:48] Maybe! [15:40:26] Any idea where 'spawn a shell' is in install process? [15:40:33] in the main menu [15:40:37] should be among the last entries [15:40:47] Oh, it scrolls! [15:40:49] OK, found it. [15:40:54] andrewbogott_: or just use alt fX [15:40:58] where X is number 1-12 [15:41:06] that should spawn shell too, maybe [15:41:18] OK, I have a shell [15:41:32] petan|wk: he's on serial :-) [15:41:37] oh [15:41:47] ...and here's a syslog. let's see what's in here. [15:41:49] andrewbogott_: tail /var/log/syslog (captain obvious) [15:42:08] sad that I don't have vi or less [15:42:18] there should be a "more" [15:42:25] and nano probably [15:42:27] yeah, I just hate it... [15:42:32] indeed [15:42:33] Can I search in 'more'? [15:42:56] depends on the more [15:43:00] this is busybox, so probably no [15:43:05] PROBLEM - Puppet freshness on sodium is CRITICAL: Puppet has not run in the last 10 hours [15:43:07] the error should be on the last lines though [15:43:09] try tailing [15:51:55] andrewbogott_: feel free to paste lines from the log [15:52:21] Everything in the log looks fine :( [15:53:13] what are the last lines in the log? [15:56:25] It does say '10 multiraid doesn't exist' a lot of times. But it seems to say that things don't exist often, right before loading them. [15:57:56] Here we go... May 21 23:42:40 partman-auto-raid: mdadm: invalid raid level: raid8 [15:57:56] May 21 23:42:40 partman-auto-raid: Error creating array /dev/md0 [15:58:24] why are we doing software raid btw? [15:58:40] No hardware raid support on these boxes, reportedly. [15:59:16] oh, okay [15:59:21] trying cat /proc/mdstat [15:59:44] The partman recipe, by the way, is in the puppet repo: files/autoinstall/partman/virt-raid10.cfg [16:00:11] I suck at that :) [16:00:23] No more than everyone else [16:00:23] could never get the hang of it [16:00:28] # cat /proc/mdstat [16:00:28] Personalities : [16:00:29] unused devices: [16:01:09] nice [16:01:11] so no arrays assembled [16:01:14] try mdadm -A [16:01:28] well, wait, my partman file is obviously wrong, since I tell it to use raid 8 when I should be using raid 10. [16:01:39] So, that's a good place to start. [16:01:56] um... well, actually... [16:02:03] raid 8? that's a new one for me [16:02:22] It's a typo. I'm missing the raid type field, and I'm configuring 8 drives. [16:02:34] ha [16:02:35] So, that is most likely the problem. Or at least /a/ problem. [16:03:00] what's weird is, that line is copy/pasted from another working file... [16:05:03] New patchset: Andrew Bogott; "Try using raid "10" rather than raid "". Maybe that'll help." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8463 [16:05:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8463 [16:05:46] New review: Andrew Bogott; "Lookin' good, dude!" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8463 [16:05:48] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8463 [16:07:34] OK, so, another thing I don't understand: How do I get my patch from git into actual production so I can test it? [16:08:33] we could tell you, but then we'd have to kill you [16:09:34] :-D [16:09:53] on that note, why do we have both stafford and sockpuppet? [16:10:08] it's called legacy [16:10:15] we'll make sockpuppet go away at some point [16:10:23] it's very popular, the legacy thing ;-) [16:10:28] yes [16:10:39] My standard approach is to wait a while, and then ben and/or leslie say "do you want this merged?" and I say "yes" and then a miracle occurs. [16:10:44] what's pending though? [16:10:52] just someone to do the work [16:10:54] I'm cool with that system, except when they're not online. [16:11:03] migrate the CA off sockpuppet, a few other tidbits [16:11:18] documented somewhere? [16:11:23] of course not [16:11:25] what are you thinking [16:11:38] care to tell me so I can do it? :) [16:11:45] if I had to document what was going to need to happen I might as well just do it in the same time ;-) [16:12:44] I dunno, just make stafford work as the CA [16:12:55] make sure everyone syncs on stafford instead [16:12:59] that's probably it by now [16:13:12] we've turned off dashboard [16:13:18] so sockpuppet isn't needed for that anymore [16:13:38] I wonder if that's more than an scp [16:13:41] but we'll see [16:13:53] clients may complain about the different cert [16:14:23] So... my patch? [16:16:43] andrewbogott_: https://labsconsole.wikimedia.org/wiki/Git#Production [16:18:06] New patchset: Jgreen; "adding khorn to deploy group, remove deprecated users from erzurumi" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8464 [16:18:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8464 [16:18:29] RECOVERY - MySQL Replication Heartbeat on db12 is OK: OK replication delay 28 seconds [16:18:52] mark: And when leslie and/or ben ping me, that's because they're merging their own changes and notice mine jumbling up their diff? [16:18:59] exactly [16:19:07] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8464 [16:19:10] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8464 [16:19:16] And on production you just do all of the fetch&merge stuff as root? [16:20:03] andrewbogott_: I've caught your diffs [16:20:08] RECOVERY - MySQL Slave Delay on db12 is OK: OK replication delay 12 seconds [16:20:23] i'll merge them now if there's no objection [16:20:45] No objection, other than that I should do it myself someday. [16:20:54] k [16:20:57] You're doing the merge on sockpuppet, right? Where do the files actually live? [16:21:48] afaik they're on sockpuppet and on stafford [16:22:21] although I haven't looked very closely how they're kept sync'd there, looks like maybe a post merge rsync? [16:24:05] I mean, where /on/ sockpuppet? (I could of course search for them myself, just trying to save myself a 'find' ) [16:24:40] andrewbogott_: /root/puppet [16:24:48] Jeff_Green: it's just a pull from sockpuppet to stafford [16:24:50] that's easy to remember :) [16:25:03] anyway... Jeff_Green, merged? [16:25:05] mark: oic [16:25:11] yup merged [16:25:14] ah yes, that too [16:25:22] paravoid: change the git remote to gerrit [16:25:25] on stafford [16:26:31] * andrewbogott_ reruns autoinstall, crosses fingers [16:41:45] andrewbogott_: so? [16:42:10] so... my patch isn't actually taking effect for some reason. Exact same behavior as last time. [16:42:59] did you run puppetd -vt on brewster? [16:43:21] nope! [16:43:34] See, there turn out to be all these things that Ryan did last night and didn't narrate :( [16:43:54] Doing that now. [16:44:21] paravoid: Are there any more steps, that you know of? [16:44:30] not that I know of :-) [16:44:51] ok, puppet run on brewster was clean. So, lemme cycle virt1001 yet again. [16:48:13] isn't brewster the tftp host? [16:48:23] or did I just tell you to run puppet on a random host? :) [16:48:49] Dunno... 'brewster' sounds vaguely familiar from yesterday. [16:49:08] And anyway, it 'should be harmless' to run puppet on a random host, since it runs itself now and then anyway. [16:59:37] paravoid: Ok, still failing, but failing differently! [17:06:22] paravoid: OK, I suspect that my problem now has to do with logical vs. primary partitions (something that, strangely, I never understood properly even in the 1990's.) Note, for example, that cp-varnish.cfg just marks all three partitions as primary, and I do not. [17:06:26] any thoughts? [17:07:05] sec [17:08:31] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8448 [17:10:03] andrewbogott_: I think you should mark them as primary [17:10:11] Sure, worth a try :) [17:10:31] you can't have more than 4 primary partitions [17:10:40] hence the need for logical, which are number 5 onwards [17:10:57] I rarely use logical partitions, I usually have 2 or 3 primary and then use lvm [17:11:00] Huh, I thought the rule was 'can't have more than two' for some reason. [17:11:09] thankfully no :) [17:11:39] ok, let's see if I remember these other steps [17:11:43] New patchset: Andrew Bogott; "Make all the partitions primary because why not?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8467 [17:12:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8467 [17:12:19] New review: Andrew Bogott; "You're on a roll today!" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8467 [17:12:21] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8467 [17:12:47] you're changing the source (gerrit), deploying it on the puppetmasters (two at the moment) and then instruct the server to fetch changes from the master and apply them [17:14:19] ok... git is asking me for the root password on stafford. [17:14:30] I'm pretty sure I've never needed that before. Does that mean I'm doing something wrong? [17:14:51] * andrewbogott_ thinks this conversation should immediately move to mediawiki_security [17:15:27] analytics1001.eqiad.wmnet [17:15:33] 1001-1010 [17:15:39] oops wrong room [17:15:46] <^demon|away> andrewbogott_: "password" ;-) [17:16:15] It makes no sense that a 'git merge' should ask me for a password. And, the second time, it didn't. I'm baffled. [17:16:35] <^demon> If it was pulling it over ssh and you didn't have a key, it would. [17:17:10] But I did a 'fetch' beforehand. [17:17:14] So 'merge' should be totally local. [17:17:27] <^demon> hrm. [17:17:37] I am going to pretend that that didn't happen, for now. [17:18:07] <^demon> That's been my outlook on transient git failures :) [17:18:15] <^demon> Did it work the second time? Yes? Ok moving on :) [17:18:59] modifying sockpuppet should cause a consequent rsync (or something) on stafford. So maybe I've caused things to be out-of-sync now. mark, does this worry you? [17:21:14] hi guys [17:21:29] is there a generic place I can include the debconf-utils package? [17:21:37] i can put it in a standalone class and just include it [17:21:48] i'm just not sure where that class should live [17:22:52] or, is debconf utils pretty much always available? [17:30:32] woo segfault! [17:31:00] andrewbogott_: there's a post-merge hook that rsyncs to the other box [17:31:45] yeah, which didn't happen in my case. so I wonder if my config is coming from sockpuppet or stafford [17:31:53] cat .git/hooks/post-merge [17:31:57] and run that by hand [17:33:17] Getting lots of this "partman-auto-raid: mdadm: partition table exists on /dev/sdb2 but will be lost or..." [17:34:05] anyway, I have to run... back after lunch. [17:38:05] New patchset: Aaron Schulz; "Moved all wikis to use Swift thumb copy hook." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/8472 [17:38:11] New review: jenkins-bot; "Build Successful " [operations/mediawiki-config] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/8472 [17:39:17] New patchset: Bhartshorne; "settting swift to write thumbs to no wikis; mediawiki will write thumbs for all wikis." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8473 [17:39:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8473 [17:42:05] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/8473 [17:42:07] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8473 [17:47:08] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8472 [17:47:10] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/8472 [18:12:10] New patchset: Ottomata; "Adding new java class to generically manage JRE and JDK installation." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8477 [18:12:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8477 [18:12:35] yay! [18:12:43] let's see now, who can review that for me [18:12:45] its really cool! [18:12:48] mayyyybeeeeeee [18:12:56] mark would like it if he was around [18:13:05] maplebed would probably like it, but I think he is busy [18:13:31] maybe notpeter? [18:13:38] ^ [18:13:44] i dunno, who likes pretty puppet stuff? [18:15:14] New patchset: Ottomata; "Adding new java class to generically manage JRE and JDK installation." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8477 [18:15:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/8477 [18:17:51] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8477 [18:17:54] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8477 [18:18:31] <^demon> Ryan_Lane: Don't suppose you could poke those commits I asked about yesterday? [18:18:47] I have like 10 mins [18:18:51] if I can do so in that time, yes :) [18:18:55] best work fast then ;) [18:19:16] you want me to push through this gerrit change right before I get on a plane? [18:19:34] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6005 [18:20:33] ^demon: https://gerrit.wikimedia.org/r/#/c/6005/ fails to merge [18:20:38] <^demon> 6005 needs rebasing probably, not as urgent. [18:21:10] <^demon> 6578 and 8037 are the 2 most urgent and should merge [18:21:15] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/6578 [18:21:18] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6578 [18:22:12] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/8037 [18:22:14] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8037 [18:22:55] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7724 [18:22:57] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7724 [18:23:29] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7727 [18:23:32] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7727 [18:27:03] New patchset: Demon; "Re-attempting links for RT and CodeReview." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6005 [18:29:59] boarding. [18:30:03] * Ryan_Lane waves [18:36:54] New patchset: Asher; "first commit of dbtree" [operations/software] (master) - https://gerrit.wikimedia.org/r/8481 [18:39:13] dooodeee doooo, gonna poke again, then head to a cafe [18:39:15] ummmm [18:39:19] paravoid? [18:39:26] apergos maybe? [18:39:29] yes? [18:39:37] https://gerrit.wikimedia.org/r/#/c/8477/ :D [18:39:48] OH!!! [18:39:50] NM [18:39:54] Ryan_Lane already did it! [18:39:56] sorry bout that! [18:40:05] yes, thought so [18:40:12] thanks Ryan_Lane!!! [18:40:17] he left already [18:40:19] and to you for responding, faidon :) [18:40:21] aye [18:47:57] New patchset: Asher; "first commit of dbtree patch 2: fix ishmael links" [operations/software] (master) - https://gerrit.wikimedia.org/r/8481 [18:59:29] any of my fellow ops folks wanna review my dns changes before i commit? [19:00:53] RobH: sure. [19:05:09] RobH: I can't see the section of the file that determines the second and third octet of the IP address, but so long as the analytics, labstore, and osm folks are in 10.65.3 and teh ms-be are in 10.65.5, we're good to go. +1 commit. [19:06:54] PROBLEM - Host db1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:14:06] RECOVERY - Host db1001 is UP: PING OK - Packet loss = 0%, RTA = 26.45 ms [19:17:49] cool [19:18:19] !log dns update for new servers mgmt ips [19:18:23] Logged the message, RobH [19:24:49] New patchset: Ottomata; "java.pp - need to accept license before installing sun-java6 packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8486 [19:29:55] New patchset: Ottomata; "java.pp - need to accept license before installing sun-java6 packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8486 [19:31:17] !lot reimaging db1001 and db1020 [19:31:43] !log reimaging db1001 and db1020 [19:31:46] Logged the message, notpeter [19:32:41] paravoid, Ryan is actually gone this time [19:32:49] can I bother you with an approval request? [19:32:53] https://gerrit.wikimedia.org/r/8486 [19:34:49] done [19:34:56] variable in the class definition? [19:34:57] PROBLEM - Host db1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:35:00] didn't even know you could do that [19:35:32] doesn't look very clean to me, although it may just be that I'm not used to it... [19:35:41] parameterized classes, ja [19:35:42] PROBLEM - Host db1020 is DOWN: PING CRITICAL - Packet loss = 100% [19:35:55] http://docs.puppetlabs.com/guides/parameterized_classes.html [19:36:50] oh I know about parameterized classes [19:37:14] I was referring to class java::jre($package_prefix) { [19:37:18] aaah [19:37:24] I thought I read jre::$package_prefix [19:37:38] ok, it's getting late, I really should stop [19:37:50] I had a major "wtf does this even work" [19:37:52] thanks! [19:40:30] RECOVERY - Host db1001 is UP: PING OK - Packet loss = 0%, RTA = 26.42 ms [19:41:15] RECOVERY - Host db1020 is UP: PING OK - Packet loss = 0%, RTA = 26.63 ms [19:43:30] PROBLEM - MySQL Recent Restart on db1001 is CRITICAL: Connection refused by host [19:43:48] PROBLEM - MySQL Slave Running on db1001 is CRITICAL: Connection refused by host [19:43:57] PROBLEM - MySQL disk space on db1001 is CRITICAL: Connection refused by host [19:44:06] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: Connection refused by host [19:44:15] PROBLEM - MySQL Idle Transactions on db1001 is CRITICAL: Connection refused by host [19:44:15] PROBLEM - MySQL Slave Running on db1020 is CRITICAL: Connection refused by host [19:44:15] PROBLEM - SSH on db1020 is CRITICAL: Connection refused [19:44:33] PROBLEM - Full LVS Snapshot on db1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:44:33] PROBLEM - SSH on db1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:33] PROBLEM - MySQL Slave Delay on db1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:44:51] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:45:00] PROBLEM - MySQL disk space on db1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:45:09] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:45:18] PROBLEM - MySQL Recent Restart on db1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:45:36] PROBLEM - Full LVS Snapshot on db1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:45:45] PROBLEM - MySQL Idle Transactions on db1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:47:15] RECOVERY - SSH on db1020 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:47:24] RECOVERY - SSH on db1001 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [19:47:48] New patchset: Ottomata; "java.pp - ah, there is no default to these. Just a comment doc change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8488 [19:52:18] hey sooooo, is anyone available to help me with this RT? [19:52:19] https://rt.wikimedia.org/Ticket/Display.html?id=2992 [19:52:21] RECOVERY - MySQL Recent Restart on db1001 is OK: OK seconds since restart [19:52:28] I need to commit something to the private puppet files repo [19:52:31] RECOVERY - MySQL Slave Running on db1001 is OK: OK replication [19:52:31] RECOVERY - MySQL disk space on db1001 is OK: DISK OK [19:52:39] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay seconds [19:52:48] RECOVERY - MySQL Idle Transactions on db1001 is OK: OK longest blocking idle transaction sleeps for seconds [19:53:06] RECOVERY - MySQL Slave Delay on db1001 is OK: OK replication delay seconds [19:53:06] RECOVERY - Full LVS Snapshot on db1001 is OK: OK no full LVM snapshot volumes [19:55:09] !log starting xtrabackup dump from db1033 to db1001 for new eqiad s1 slave [19:55:13] Logged the message, notpeter [20:00:36] RECOVERY - MySQL disk space on db1020 is OK: DISK OK [20:00:54] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay seconds [20:00:54] RECOVERY - MySQL Recent Restart on db1020 is OK: OK seconds since restart [20:01:12] RECOVERY - Full LVS Snapshot on db1020 is OK: OK no full LVM snapshot volumes [20:01:30] RECOVERY - MySQL Slave Running on db1020 is OK: OK replication [20:01:39] RECOVERY - MySQL Idle Transactions on db1020 is OK: OK longest blocking idle transaction sleeps for seconds [20:02:15] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay seconds [20:05:40] !log starting xtrabackup dump from db1004 to db1020 for new eqiad s4 slave [20:05:43] Logged the message, notpeter [20:24:06] !log powering up db1003 [20:24:10] Logged the message, notpeter [20:28:26] RECOVERY - Host db1003 is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [20:29:07] New patchset: Pyoungmeister; "decom of db13 and storage2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8494 [20:31:48] paravoid, do you have access to private puppet repo? [20:32:11] PROBLEM - mysqld processes on db1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [20:35:42] maplebed, I know you are busy, but do you know who I should ask to get something into the private puppet repo? [20:38:06] woosters, do you know who I should ask to get a file committed to private puppet repo? [20:39:13] ben is interviewing, asher out for lunch ...let me try jeff_green [20:39:28] i'm here [20:39:48] hi Jeff, do you have access to the private puppet repo? [20:39:56] I need to check in a GeoIP.conf file that has some license keys in it [20:39:59] https://rt.wikimedia.org/Ticket/Display.html?id=2992 [20:40:08] i do [20:41:03] yay, got a sec to put this file in there somewhere? [20:41:09] i guess I want to be able to do [20:41:19] source => "puppet:///volatile/GeoIP.conf" [20:41:25] or whatever path it ends up being [20:42:32] looking [20:51:28] ottomata: sure, I can add it to the private repo [20:51:40] where can I fetch it? [20:52:05] the one attached to the ticket as-is? [20:52:05] should be attached to rt ticket [20:52:07] yeah [20:52:23] once it is in place [20:52:29] which of those two solutions do you think is better [20:52:49] the first one requires fewer changes, and keeps the license keys only on the puppetmaster [20:53:07] the 2nd is less complicated, but has the keys on each machine that uses geoip [20:53:13] ? [20:55:44] brb [20:56:01] i think we already do and dislike #2 somewhere [21:03:47] so #1 is better then? [21:03:49] i'm fine with that [21:04:00] i'm not sure [21:04:02] the geoipupdate script will install the files in /usr/share/GeoIP [21:04:23] if we are going to have puppet distribute the downloaded .dat files from puppetmaster [21:04:28] I vaguely remember a thread about this somewhere in the past few months, and people had opinions on both approaches, and I don't remember the outcome [21:04:36] I'll have to symlink that directory into somewhere that puppet fileserver can get at [21:04:40] right [21:04:43] or add a fileserver module for that dir [21:04:53] i kinda like #1 better [21:05:00] because then theses files only have to be downloaded once [21:05:03] right [21:05:05] and then distributed internally [21:05:13] if for some reason we had a huge # of machines using GeoIP [21:05:16] then they'd all have to dl the same file [21:05:20] burning question though is whether puppet is sane for propagating huge files [21:05:53] it puts all that file serving pressure on puppetmaster [21:07:32] ottomata: I think it'd be wise to run #1 by ops@ and see if anyone has an objection [21:07:46] that's true [21:08:16] I think the City one is around 50M or 60M [21:08:16] so yeah [21:09:04] yeah huge [21:10:28] ok, ops@wikimedia.org [21:10:28] ? [21:11:25] ya [21:11:28] cool [21:11:30] ok, well eithe rway [21:11:35] the GeopIP.conf file needs to be in puppet repo [21:11:46] ok I'll tweeze it now [21:11:48] so if you get that in there and let me know how to distribute it, I can implement either solution [21:11:48] cool [21:15:08] it'll be at puppet:///private/geoip/GeoIP.conf [21:15:56] danke! [21:15:58] thanks so much! [21:16:02] for #1 I think you need to pull the file to stafford now [21:16:27] ? [21:16:34] I'm a little confused about why we still have host sockpuppet [21:16:43] puppet points to stafford now [21:16:44] ha, I don't know anything about the puppetmaster setups [21:16:59] I don't have access to those machines, so if you could pull them for me, I'd be much obliged [21:17:07] we're in a funky in-between state [21:17:59] the geoip conf is on both stafford and sockpuppet now [21:18:41] I'm not sure what you're asking re. pulling files, are you hoping to get the dat files onto the active puppetmaster right now? [21:21:46] naw, i'm going to wait til they answer me [21:21:50] just emailed ops@ [21:22:07] um, you said [21:22:07] or irc://irc.freenode.org:6667/#1 I think you need to pull the file to stafford now [21:22:25] and I was responding, maybe I misunderstood you [21:22:45] I will set up puppet stuff to put the file in place wherever it ends up being [21:22:50] i just needed it available, which you have done for me [21:22:52] so thank youuuuuuuuuu [21:25:21] ha ok [21:26:07] yeah, the conf file is available now, i just meant if you're doing #1 it's stafford that will need to retrieve the dat files and serve them to puppet clients, not sockpuppet [21:28:43] hmmm [21:28:46] virt0.wikimedia.org [21:28:47] ? [21:28:53] seems to be what is distributing them right now [21:29:02] maybe? [21:29:26] were do you see that? [21:30:13] ahh, I see the others too now [21:30:16] i betcah that is labs or something [21:30:17] um [21:30:24] openstack::controller [21:30:43] but yeah, both stafford and sockpuppet include puppetmaster class [21:30:53] which is where I would configure the GeoIP.conf installation for #1 [21:30:54] so yeah [21:32:01] yeah, that makes sense [21:34:33] !log updating dns for mgmt of new servers in eqiad [21:34:36] Logged the message, RobH [22:07:28] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [22:20:13] PROBLEM - udp2log processes for emery on emery is CRITICAL: CRITICAL: filters absent: /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/local/bin/packet-loss, /var/log/squid/filters/india-filter, /usr/local/bin/sqstat, /var/log/squid/filters/latlongCountry-writer, [22:22:01] RECOVERY - udp2log processes for emery on emery is OK: OK: all filters present [22:31:20] hello opsen [22:31:36] im trying to create an rt ticket but i keep getting the following error: [22:31:36] Error 324 (net::ERR_EMPTY_RESPONSE): The server closed the connection without sending any data. [22:36:53] New patchset: Bhartshorne; "updating ring files to move container storage to two dedicated drives on ms-be1 to test container read speed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8512 [23:01:17] New patchset: Ryan Lane; "Fixing l10n user" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8554 [23:04:44] New patchset: Ryan Lane; "Fixing l10n user" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8554 [23:06:12] New patchset: Ryan Lane; "Fixing l10n user" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/8554 [23:28:55] binasher: ping [23:29:16] Ryan_Lane: ping [23:29:23] pong [23:29:36] preilly: ? [23:29:48] Can someone invalidate the mobile caches!? [23:29:50] * Reedy guesses [23:31:50] !log flushed the varnish cache for mobile [23:31:53] Logged the message, Master [23:32:27] Reedy: good guess [23:32:40] :d [23:49:43] !log flushed the varnish cache for mobile again [23:49:46] Logged the message, Master