[00:04:33] notpeter: still about? [00:06:01] RobH: I had officially stopped working, but is it importante? [00:06:15] just running into partman error on cp installs [00:06:28] was wondering if you had tweaked squid-raid1.cfg or whatever [00:06:40] seems we need some of the new cp servers online tonight to cover mobile [00:06:44] spinning up mobile cps? [00:06:49] yea [00:07:24] RobH: I could try checking the partman's out -- though it sounds like by hand might be the quickest for now… [00:07:42] by hand is awful, i want to try to fix this [00:07:48] its not acceptable for it to remain broken [00:08:08] when i did the cp1044/45 install, it wasnt broken [00:08:16] so something since then has made it inoperable, which needs fixing. [00:08:35] RobH: I spent a lot of time trying to make a working partman config. I'm not sure partman can do that. might need a post-install script [00:08:47] I did it by hand [00:08:49] it used to work though. [00:09:01] not for mobile CPs [00:09:03] not that long ago, on cp# [00:09:23] the installer for this is no different for mobile versus non [00:09:30] partman is same for all squids [00:09:45] unless it was recently changed, and broken. [00:10:27] notpeter: so when you went to do it, it was broken? [00:10:28] RobH: I don't think I changed it [00:10:43] was it giving an error during the installer? [00:11:02] New patchset: Asher; "ugly hack to get varnish purges working while w3/wp sends broken purge reqs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1941 [00:11:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1941 [00:11:42] RobH: when I went to make mobile CPs, I didn't use squid-raid1.cfg [00:11:46] I went to making a new one [00:11:57] so mobile is using varnish? [00:12:00] that has two xfs partitions [00:12:00] yes [00:12:06] because if so that directly conflicts with what woosters told me to install these for [00:12:08] god dman it. [00:12:27] well, nice to know i wasted a couple of hours on this already. [00:12:37] ugh, I'm sorry [00:12:44] so varnish needs a different partitioning setup than normal squid? [00:12:47] want help on it? [00:12:47] yes [00:13:11] /dev/sda5 /a/sda xfs nobarrier,noatime 0 2 [00:13:11] /dev/sdb5 /a/sdb xfs nobarrier,noatime 0 2 [00:13:19] RobH: building as squids are fine.. the actual disk partitioning is the same, except that sda5 and sdb5 are mounted with xfs [00:13:20] that's what it wants to look like [00:13:32] instead of being used as raw devices as squid does for coss [00:15:02] Ryan_Lane: https://gerrit.wikimedia.org/r/#patch,sidebyside,1941,1,templates/varnish/blog.inc.vcl.erb === lulz [00:15:16] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1941 [00:15:17] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1941 [00:15:24] hahaha [00:36:25] New patchset: RobH; "added cp103X to mobile varnish range" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1943 [00:36:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1943 [00:37:39] notpeter: check that please? ^ [00:37:43] being good, not checking my own shit [00:38:13] notpeter: argh, it rebooted back in to the installer [00:38:16] so bios may not be set right [00:38:34] yeah, but I had to flip mine too [00:38:45] i didnt, so setting it on cp1040 now [00:39:13] RobH: that will only do 31-34 and 41-44 [00:39:26] bugger =P [00:39:33] decline it and i redo [00:39:39] well, i have not had a mistake yet [00:39:46] i guess i should do some odd command to append to that one? [00:39:59] yeah, you can just append [00:40:02] er [00:40:02] amend [00:40:16] New patchset: Lcarr; "tryuing to see where ganglia chokes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1944 [00:40:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1944 [00:40:33] Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1944 [00:40:50] notpeter: so i did my change locally, how exactly do i amend it? [00:41:05] git commit -a --amend [00:41:09] then push again [00:41:10] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [00:41:31] notpeter: sound like we dont need as many as i thought, so puppet change is fine for that many when i fix it [00:41:56] but otherwise we will only install down to cp1036, thats 5 servers when they wanted 2 [00:42:11] down to 36 soudns good [00:42:22] New patchset: RobH; "fixed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1946 [00:42:30] meh, abandon and redo is faster. [00:42:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1946 [00:43:48] notpeter: i am doign the even installs, figure we will do post os install stuff in a moment after these are done [00:43:50] sound good? [00:44:22] RobH: sure [00:44:40] I already statred doing some crap on 39, so I'm going to finish that [00:44:43] but after that, yes [00:44:58] New patchset: Lcarr; "another ganglia test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1947 [00:45:00] thats cool [00:45:11] wrong pfr- doh [00:45:14] i have cp1038 installer started, once its into the software part i move on to 36 [00:45:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1947 [00:45:22] Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1947 [00:47:06] New review: RobH; "once more with feeling" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1946 [00:47:53] notpeter: when you are in bios [00:47:57] turn off the logical processor under cpu [00:48:06] thats hyperthreading, misrepresents cpu [00:48:11] its not end of world, but its crappy [00:48:33] kk [00:48:58] heh, it used to break ganglia back in the day [00:49:07] it had no idea where to put the ton of cpu core apaches [00:49:23] hah [00:51:09] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [00:53:57] wtf cp1038 [00:54:10] RobH: while we're in making filesystems and shit, we should also run puppet [00:55:05] RobH: want me to look at it if you finish up changes to site.pp? [00:55:26] site.pp is checked in, making live now [00:58:18] Change abandoned: RobH; "redid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1943 [01:00:42] New patchset: Asher; "use mod_rpaf on blog server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1950 [01:00:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1950 [01:01:08] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1950 [01:01:09] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1950 [01:01:44] Change abandoned: RobH; "tired of dealing with it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1946 [01:12:41] ok, so cp1036 is in the installer now [01:12:55] New patchset: Pyoungmeister; "this one is for comrad robh" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1953 [01:13:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1953 [01:13:21] New review: RobH; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1953 [01:13:32] Your change could not be merged due to a path conflict. [01:13:32] Please merge (or rebase) the change locally and upload the resolution for review. [01:13:34] wtf [01:13:41] notpeter: reads that on the change. [01:14:12] RobH: woops... [01:14:22] aint ever easy ;] [01:14:27] can you abandon? [01:14:35] wrong branch.... [01:14:51] Change abandoned: Pyoungmeister; "wrong brach" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1953 [01:15:31] heh, you beat me to it [01:16:31] New patchset: Pyoungmeister; "for comrade robh" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1954 [01:16:37] ok, that one won't be fucked up [01:16:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1954 [01:16:57] New review: RobH; "once more with feeling!" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/1954 [01:17:26] New review: RobH; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1954 [01:17:26] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1954 [01:18:36] its live [01:19:33] ok [01:19:50] so, 37, 39, and 40 have file systems all squared away [01:19:55] so I'm going to run puppet on them [01:20:08] 38 should be ready to have opst install [01:20:26] ugh. need to specify puppetmaster [01:20:27] ugh [01:21:40] yea, want me to do that [01:21:44] and you do post install on 38 [01:22:08] 37, 39, and 40 all have keys waiting to be signed on sockpuppet now [01:22:11] will do 38 [01:23:09] signed all three [01:23:12] running on 37 now [01:23:40] hrmm [01:23:46] 37 must have hit the other server first [01:23:51] cuz its not hitting right server now [01:23:58] wiping its keys and trying again [01:24:28] kk [01:24:33] I might have fucked on up [01:24:42] thought I fixed it, though [01:24:49] if it runs and hits the toher puppetmaster [01:24:55] ya have to do the remove keys on the client [01:25:09] I did so [01:25:10] if you just run the test with the right server, it will hit it to sign [01:25:15] huh [01:25:17] eh [01:25:18] well, trying again [01:25:19] whatevs [01:25:23] indeed [01:25:37] 36 good to go? [01:26:10] its rebooting post install [01:26:11] right now [01:26:21] so will be in 30 seconds or so [01:27:00] kk [01:27:43] I signed cert for 38 [01:27:47] goin to 36 now [01:29:41] puppet is odd [01:29:50] its making me specify the server a second time after the initial one [01:29:57] which returns iwth a different error [01:30:01] then puppetd --test works normal [01:30:05] so odd. [01:30:37] going to run initial puppet run on 36 [01:30:56] i think you will see what i mean [01:31:07] wtf? [01:31:38] weird [01:31:55] I'm glad m_ark emailed us good instructions. very counterintuitive [01:33:18] !log cp1037, cp1038, cp1039 os installed, varnish partitions mounted, and puppet run [01:33:21] Logged the message, RobH [01:33:33] binasher: ^ those are ready for you I do think, notpeter and i are finishing up cp1036 and cp1040 now [01:33:38] thank you! [01:33:55] puppet run on 36 [01:33:57] quite welcome [01:34:07] puppet is finishing run on cp1040 [01:34:10] RobH: want to run puppet on 40 and then we call it done? [01:34:18] yep, near done [01:35:01] !log cp1040 and cp1036 ready for use [01:35:03] Logged the message, RobH [01:35:05] notpeter: all done [01:35:08] binasher: they are all yours [01:35:59] they look good, thanks again [01:36:11] that leaves you like an hour to make them work ;] [01:36:13] heh [01:37:13] wha? blackout doesn't start for 3 hours+ [01:37:28] although, at my rate of whiskey consumption, I might beat it =P [01:37:32] I kid [01:38:22] hah [01:39:21] i wonder how many people will actually switch to the mobile site… and when. if it gets hammered, it might not be til tomorrow morning [01:49:24] black the planet!!! [02:04:55] binasher: hey [02:05:09] sorry, my hands were full of dinner stuff, had my bf transcribe [02:05:20] LeslieCarr: i think i just found the problem, but not why it exists [02:05:38] what's the issue ? [02:05:50] LeslieCarr: gmond.conf on the cp aggregator had deaf = yes [02:05:57] oh [02:06:03] i made a ticket for that and forgot [02:06:13] for some reason puppet isn't parsing the names when they're in a variable [02:06:22] so have to split that out [02:06:35] i'll grab that since i should have gotten to it today anyways [02:11:57] LeslieCarr: after staring at the puppet files and gmond template.. [02:11:59] New patchset: Lcarr; "fixing cp hosts so they will properly alert" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1955 [02:12:08] binasher: want to approve/deny ? [02:12:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1955 [02:12:42] for some reason it isn't always looking at the internal "if host =~ blah, then true" bits [02:12:46] i think the problem is just that $ganglia_aggregator = "true" is after the includes [02:12:53] ah [02:12:57] that could be it [02:13:47] Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1955 [02:13:57] i'll try that first :) [02:15:48] New patchset: Lcarr; ""fixing cp hosts so they will properly alert"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1956 [02:15:57] binasher: check that out ? [02:16:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1956 [02:16:17] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1956 [02:16:17] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1956 [02:17:35] binasher: are you sockpuppeting ro should i ? [02:18:35] i'll give it a shot [02:18:45] doh i hadn't heard you so i just did [02:18:47] :-/ [02:19:05] well the fetch bit, looks like you did the merge bit :) [02:19:18] i'm running a puppetd --test now on cp1044 [02:20:04] LeslieCarr: that fixed it, thanks! [02:22:04] w00t [02:22:19] text me if anything else comes up [02:22:43] i'll be back online at 8pm [02:22:43] will do, thanks for hopping on [02:22:48] no prob [02:25:03] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1799s [02:29:03] New patchset: Asher; "this adds cp1039 and cp1040 to the varnish backend pool (add to pybal conf to also add frontend)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1957 [02:29:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1957 [02:29:33] PROBLEM - Memcached on marmontel is CRITICAL: Connection refused [02:30:05] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1957 [02:30:05] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1957 [02:34:53] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [02:35:29] !log cp1039-40 are now in service for mobile wikipedia [02:35:31] Logged the message, Master [03:49:05] okay, i'm back online :) [04:16:03] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:24:53] RECOVERY - Disk space on es1004 is OK: DISK OK [04:39:53] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [05:23:31] is it expected that hitting "stop" on your browser results in circumvention of the blackout? [05:24:29] looks like the blackout was implemented in javascript, overwriting the page content [05:24:38] so yes pressing stop will stop script execution [05:24:50] sorta odd :) [05:25:09] they didn't exactly have the luxury of a long time to prepare the code :) [05:25:47] well, it's the thought that counts. :) [05:52:58] Ryan_Lane: you might appreciate this: http://torrus.wikimedia.org/torrus/CDN?path=%2FSquids%2FTotals%2FAll_squid_client_requests [05:55:57] heh [05:56:03] that's a pretty quick jump [06:13:53] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [07:27:42] RECOVERY - Squid on brewster is OK: TCP OK - 0.000 second response time on port 8080 [08:01:52] PROBLEM - DPKG on db43 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:35:43] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [09:52:21] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 452125 MB (3% inode=99%): [09:59:41] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 423807 MB (3% inode=99%): [10:34:22] https://bugzilla.wikimedia.org/33509 could use a look. reedy RT'd it at least several days ago [10:40:38] RECOVERY - MySQL slave status on es1004 is OK: OK: [12:48:13] PROBLEM - Puppet freshness on mw1096 is CRITICAL: Puppet has not run in the last 10 hours [12:58:02] PROBLEM - Auth DNS on ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [13:03:47] !log restarted pdns on ns0 [13:03:48] Logged the message, Master [13:10:21] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 6.668 seconds response time. www.wikipedia.org returns 208.80.152.201 [13:21:20] New patchset: Hashar; "gallium: allow postgre restart" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1958 [13:30:01] !log Starting mailman migration [13:30:02] Logged the message, Master [13:30:44] !log Set hold_domains = lists.wikimedia.org on lily, to hold new lists mails on the queue [13:30:45] Logged the message, Master [13:34:35] !log Stopped mailman on lily and sodium [13:34:36] Logged the message, Master [13:35:15] !log Stopped lighttpd on lily [13:35:16] Logged the message, Master [13:37:19] !log Created /var LVM snapshot on lily [13:37:20] Logged the message, Master [13:37:29] !log Removed all test messages on the exim4 queue on sodium [13:37:31] Logged the message, Master [13:38:20] !log Started rsync of selected mailman directories under /var/lib/mailman from lily to sodium [13:38:21] Logged the message, Master [13:58:13] Hi! I have a press inquiry in OTRS. Is anyone from the technical staff online? [13:59:34] <^demon> Lots of people are. What's up? [14:01:03] I have an inquiry from a Norwegian journbalist who wants to know what impact the blackout has on other language editions, especially the Norwegian one, in terms of hits. [14:01:21] please do this in #wikimedia-tech [14:02:11] :) [14:02:15] Thx. Will do. [14:02:44] thanks :) [14:21:51] !log rsync complete. Running dpkg-reconfigure mailman on sodium [14:21:52] Logged the message, Master [14:28:25] !log Setup lily to route lists.wikimedia.org mails to sodium [14:28:26] Logged the message, Master [14:35:28] New patchset: Mark Bergsma; "Disable holding all mail on sodium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1959 [14:36:02] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1959 [14:36:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1959 [14:40:40] !log Disabled hold_domains on sodium and lily [14:40:42] Logged the message, Master [14:45:07] !log Changed service IP addresses of lists.wikimedia.org in DNS to US prefixes [14:45:09] Logged the message, Master [14:52:24] mark: did you bounce a msg to wikitech-l without changing the date? [14:54:02] I guess I did [14:54:11] oh well, done now [14:55:48] mark: well "today" is relative. anyway, forwarding on to the same place i saw sumanah forward the original (wikitech-ambassadors) [14:56:44] nice, google is now blocking some mail because it's coming from a new ip address [14:56:56] grey? [14:57:03] no, after data [14:57:18] mark: also, i mailed the TS announce list yesterday and it was held for moderation and then dab mailed the same thing (independantly) to the list. i think i'm still in the queue? can you reject me? and maybe river should not be the list admin any more? [14:57:44] that's not for me to decide [14:57:56] if the toolserver admins want that changed, they can file a request [14:58:05] sure, i can tell them to do that [14:58:09] thanks [14:58:14] but in the mean time cna you reject me? :) [14:58:17] can* [14:58:22] i'll have a look [14:59:30] mark: see bottom of http://mail.python.org/pipermail/mailman-i18n/2012-January/001765.html for some stats in case you care ;) [14:59:52] I don't see your mail [14:59:57] lunch [15:00:45] hrmmmmm [15:01:34] does this help? 17 Jan 2012 17:14:39 [15:01:37] UTC [15:01:53] no, it's simply not there [15:02:26] i guess someone got to it then. i never got a reject msg [15:03:43] (i mailed another non wikimedia list a few months ago and was moderated ~10-15 days after sending the msg... i think they just moderate all new ppl. anyway it was about an event and the msg got through a week after teh event ;( ) [15:15:03] New patchset: Mark Bergsma; "Google and possibly others are rate limiting our new ip, so use the old server(s) for delayed messages (for now)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1960 [15:15:33] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1960 [15:15:34] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1960 [15:17:38] PROBLEM - Host srv278 is DOWN: CRITICAL - Plugin timed out after 15 seconds [15:22:37] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [15:23:48] j #wikimedia-labs [15:48:02] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [15:59:54] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [16:11:38] PROBLEM - Recursive DNS on 208.80.152.131 is CRITICAL: CRITICAL - Plugin timed out while executing system call [16:22:27] RECOVERY - Recursive DNS on 208.80.152.131 is OK: DNS OK: 6.189 seconds response time. www.wikipedia.org returns 208.80.152.201 [16:23:44] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [17:09:36] apergos: dataset1001 is here [17:10:09] yay [17:21:40] all kinds of shit came in today actually [17:25:07] RECOVERY - Host dataset1 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [17:31:36] Can someone add DNS for wikimedia.pl ? Or should I just put it in RT? [17:31:37] LeslieCarr: the sfp+ modules came in today [17:31:45] yay RobH !!! [17:31:49] now we can get that psw up [17:31:50] and then nag you guys about RT? [17:31:59] hexmode sure, gimme the rt # [17:32:05] That's already in RT [17:32:07] 2277 [17:32:22] LeslieCarr: the safest place for all the spares is in psw2 with the rest of the SFP? [17:32:33] but it's not DNS they want [17:32:35] figured its less likely to be stolen and the like [17:32:36] ah that's much more difficult [17:32:41] Reedy: please put rt tickets in the bz comment :) [17:32:53] I was asked to directly log it on irc [17:32:57] not on the bug [17:33:05] I did reference the bug on RT though :p [17:33:11] New patchset: Lcarr; "adding in curl package" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1961 [17:33:12] heh [17:33:13] k [17:33:25] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/1961 [17:34:53] LeslieCarr: so i need to replace every connection i made for you the other day right? [17:35:14] yep [17:36:45] (I'll be a little more excited about it tomorrow, I'm still mostly in sopa land today) [17:37:04] New patchset: Lcarr; "adding in curl package" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1961 [17:38:13] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1961 [17:38:14] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1961 [17:38:18] LeslieCarr: think i can bump hurricanes port a moment? [17:38:26] the sfp module lever isnt swung up properly [17:38:29] blocking the port below it. [17:38:40] RobH: let me see the traffic on it [17:38:45] might have to drainthe port, then you can bump [17:38:58] if we could do that it would be great [17:39:03] its not end of world, but annoying [17:40:29] !log Draining HE to perform maintenance on the physical port [17:40:30] Logged the message, Mistress of the network gear. [17:42:32] RobH: go for it [17:43:51] LeslieCarr: can pull he sfp? [17:43:56] the other conections are swapped [17:44:43] LeslieCarr: i assume ya meant go for it on pulling sfp, but gonna wait to confirm [17:47:17] RobH: yes [17:47:20] you can pull the sfp [17:47:21] sorry [17:47:25] ok, also, store these sfp+ in switch? [17:47:36] where would you be able to find them best ? [17:47:43] that's where to store them :) [17:47:48] HE pulled and fixed [17:47:58] well, we had some walk off a few years ago [17:48:11] so storing them in switch means you can see if/when they are removed and walk off [17:48:14] okay [17:48:16] cool [17:48:19] wasnt in this facility mind you [17:48:22] but meh, its ok habit [17:48:31] so should i shove the rest of these in psw1? [17:51:46] sure [17:51:56] we'll move everything off psw2 in the nearish future [17:52:54] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Wed Jan 18 17:52:36 UTC 2012 [17:56:12] mutante: so the srv199 repair, chris is dropping a ticekt for it to get reinstalled [17:56:20] mainboard swap means it doesnt know the nic and such [17:56:23] so reinstall is the best route [17:56:36] normally i drop those tickets and resolve the repairs with him, but he is doing that now [17:58:37] RobH: ok, i can re-install [17:58:39] New patchset: Asher; "fix vg naming on db builds, include new server range" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1962 [17:59:42] RobH: he can just move 2209 around [18:08:21] PROBLEM - Recursive DNS on 91.198.174.6 is CRITICAL: CRITICAL - Plugin timed out while executing system call [18:11:25] !log db1004 hard disk replaced per rt#2140, rebuilding [18:11:27] Logged the message, RobH [18:14:44] !log re-preffing tele2 routes [18:14:46] Logged the message, Mistress of the network gear. [18:15:52] traceroute as requested by apergos http://pastebin.com/00CNTVWd 87.5.17.85 [18:15:56] !log searchidx1001 memory being replaced [18:15:58] Logged the message, RobH [18:16:22] thanks Snowolf and to confirm, you're seeing ~40% packet loss ? [18:16:58] LeslieCarr: that was anaconda saying, I'm getting somewhere in that range yes, 25%, 50% [18:16:58] etc [18:17:54] okay, well in a good coincidence, i was just about to switch around a transit provider so that should have a positive impact on your routing…. give me a few minutes and we'll see if that clears the issue [18:18:54] RECOVERY - Recursive DNS on 91.198.174.6 is OK: DNS OK: 9.285 seconds response time. www.wikipedia.org returns 91.198.174.225 [18:19:13] LeslieCarr: for the record my average is 25% [18:20:30] !log tried to PXE boot mw1108 but no DHCP offers received [18:20:31] Logged the message, Master [18:21:02] anyway it's strange, it seems that the packet loss is between the last and the preceding hop [18:21:13] RobH: the "puppetize planet" ticket made progress. got "planet-venus" on a labs instance [18:24:27] LeslieCarr: much better [18:24:52] good, should be going via tele2 now [18:24:55] at least out from our end [18:25:20] from your end, asymmetric path [18:25:32] the excitement of routing :) [18:25:55] heh [18:27:15] !log searchidx1001 memory replaced per rt 2208 [18:27:17] Logged the message, RobH [18:28:22] notpeter: i know you last were poking search stuff [18:28:29] so searchidx1001 is repaired now, can have os install [18:28:36] New patchset: Asher; "fix vg naming on db builds, include new server range" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1962 [18:28:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1962 [18:29:01] RobH: woop! thanks [18:29:11] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1962 [18:29:12] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1962 [18:32:49] PROBLEM - Host dataset1 is DOWN: CRITICAL - Host Unreachable (208.80.152.166) [18:34:14] peoplpe working on ds1? [18:35:28] apergos: i brought it up to check the pdu's and verify the fans were working [18:35:53] ok [18:36:00] have you heard anymore from SM? [18:36:02] LeslieCarr: confirming that I don't see packet drops anymore [18:36:09] I saw your email [18:36:17] All is good now, thanks a lot! [18:36:34] yay :) [18:36:56] !log fixed DHCP config for mw1108 on brewster, had the string "Failed to connect to 10.65.1.108." where the MAC address should have been. [18:36:57] nothing from them though [18:36:57] Logged the message, Master [18:37:32] !log pxe booting mw1108, OS install [18:37:33] Logged the message, Master [18:37:34] yeah...idk what else i can do here [18:42:33] New patchset: Bhartshorne; "deploying a new SOPA filter and sending results to a new log file for Faulkner" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1963 [18:43:44] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1963 [18:43:44] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1963 [18:45:46] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [18:54:37] RECOVERY - RAID on db1004 is OK: OK: State is Optimal, checked 2 logical device(s) [19:01:26] !log mw1108 - OS installed, added to puppet, finished catalog run, free for use [19:01:28] Logged the message, Master [19:16:08] New patchset: Asher; "preparing to upgrade two enwiki db's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1964 [19:17:09] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1964 [19:17:19] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1964 [19:23:15] binasher, would it be safe to add a column and index to the job table on enwiki? It's currently empty so adding both should be cheap... ;) [19:23:37] Reedy: this seems like the perfect time [19:23:57] I was just wondering if it was worth doing most of the 1.19 updates [19:24:16] New patchset: Asher; "fix regex" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1965 [19:24:18] what else is there? [19:24:31] There's a few other things that shouldn't be a problem [19:24:33] Looking... [19:24:55] 2 other indexes, couple of fields to add, 1 field to modify, 1 field to drop [19:25:14] the others are on tables that obviously have more data in them [19:25:20] i'm going to be doing some enwiki db upgrades as well, and probably rotate the master today [19:25:28] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1965 [19:25:29] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1965 [19:26:16] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [19:26:19] I'll have a proper look over them in a little bit [19:33:45] banisher - are u going to test out that flash drive? [19:33:59] binasher i mean [19:34:24] woosters: not today, i'm doing enwiki db upgrades. [19:34:54] good time to do that :-) [19:35:12] we need more strike days! [19:35:44] hear ye! hear ye! [19:40:20] are we no longer supposed to merge puppet changes on sockpuppet? they aren't getting synced to stafford [19:44:50] nm, they are getting synced. was looking at /etc/puppet/manifests vs. /var/lib/git/operations/puppet [20:00:22] binasher, 2 indexes to add, 3 tables with fields to add, 1 table with field to drop (needs some code merge, but that can happen beforehand), 1 table with a length increase [20:01:14] one of the cols to add is on 12,000 rows, the increase length is on 14,000 rows [20:01:30] spence is fubared right now [20:01:30] 2 of the empty column additions are on archive and revision (huge tables) [20:03:19] !log killing puppet processes on spence [20:03:21] Logged the message, Mistress of the network gear. [20:04:39] !log mw1102 coming down for mainboard replacement [20:04:41] Logged the message, RobH [20:06:08] New patchset: Asher; "puppet is being tricky about overlapping node defs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1966 [20:06:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1966 [20:06:37] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1966 [20:06:37] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1966 [20:11:36] !log pulled db38, rebooting for kernel and mysql upgrades [20:11:38] Logged the message, Master [20:14:12] and the sms daemon isn't working on nagios either (erroring out) [20:14:40] i'm going to reboot [20:17:47] !log rebooting spence as it's once again gone crazy [20:17:48] Logged the message, Mistress of the network gear. [20:19:18] i feel like spence is a windows box [20:21:02] oh great, db38 dropped to busybox [20:23:12] RobH: rebooted db38, got "ALERT! /dev/disk/by-uuid/28a52f0e-d7b4-4607-9dfe-13d43ec2b149 does not exist. Dropping to a shell!".. in the shell, /dev/disk/by-uuid/28a52f0e-d7b4-4607-9dfe-13d43ec2b149 *did* exist, as a symlink to /dev/sda1. exited without doing anything, system booted normally. [20:23:19] ever seen that before? [20:24:17] mayyybeee... it was... waiting for disks to come on line or something.. before it could... who knows [20:24:44] hrm, spence is still insane, binasher coud you jump on it as well and see if you can see what's wrong ? [20:24:45] huh... [20:24:57] binasher: nope, and we have rootdelay90 on boot to prevent disk spinup issues [20:25:04] well, attempt to prevent [20:25:28] so like trying to do service nagios restart it just hung there for a minute [20:25:37] ah, maybe it was just a but slower.. seems totally fine now [20:25:46] cool [20:25:51] hrm, it seems to be doing something this time, maybe it just needed a minute.... [20:26:20] binasher: yea we found issues like that before on the r610s and how long they take to access the disks sometimes [20:26:22] robh: can you check and verify db17 is out of rotation still [20:26:26] !rt 1996 [20:26:27] https://rt.wikimedia.org/Ticket/Display.html?id=1996 [20:26:28] so the rootdelay was added in, we may wanna do more on those [20:27:10] cmjohnson1: so what are you doing on it, you have a replacement battery? [20:27:30] i believe i found 2 batteries on site hidden deep in the cabinet [20:28:34] hrmm, its sun so its a larger pack if i recall [20:28:41] its the battery off the raid card, not mainboard [20:28:55] so you want to pull it offline and compare? [20:29:02] correct [20:30:51] !log shutting down db17, confirmed not in db rotation and has no mysql instance active [20:30:52] Logged the message, RobH [20:31:07] cmjohnson1: ok, when its down its all yours. you now how to check this stuff? [20:31:24] you may want to try to go into the raid bios before swapping anything, to see if it shows you a battery error there [20:31:32] or if its only accesible via the os, which may be the case [20:31:35] !log db38 in service at a low weight with new lucid kernel and current mysql build [20:31:36] Logged the message, Master [20:31:38] then its the battery pack on the raid controller [20:31:43] LeslieCarr: ct asked me to look at a report of slowness on upload, and the original bug report has traceroute output like this: http://screencast.com/t/iEi2VoT4 [20:31:49] so shouldnt be a small nickel size battery, but a pack [20:32:02] I remember there was a routing loop and some other stuff you were working with yesterday - would any of those have this kind of effect? [20:32:07] its also potentially not worth ordering replacements [20:32:37] thanks, there was, it was fixed and changes reverted but lemme check and see what's up [20:34:47] maplebed: got a source ip ? [20:34:48] LeslieCarr: do you recall the times involved with the routing issues? to see if they correlate with the graph I got (http://screencast.com/t/iEi2VoT4) [20:35:09] i changed everything over yesterday morning, i want to say by 11 or so it was fixed [20:35:15] and it was via tele2 [20:35:17] LeslieCarr: to be clear, I don't think it's going on now, but asking for info about stuff happening this morning. [20:35:24] though this morning telia had some problems [20:35:33] possibly some oversaturation due to tele2 [20:35:41] when i moved some traffic back it got better [20:35:42] this started at 05:00UTC (same time as our blackout) [20:35:48] (ffrom the point of view of a few italians) [20:36:18] hm. [20:36:36] ok, I'm going to keep digging but see if I can blame it on the balckout instead of the network. [20:37:44] if it has been better since about 1830 UTC then it was the telia stuff [20:38:21] PROBLEM - Host db17 is DOWN: PING CRITICAL - Packet loss = 100% [20:38:26] sadly the graph only goes to about 14:00. [20:38:53] huh. [20:39:00] I tihnk we're underprovisioned in esams. [20:39:04] http://ganglia.wikimedia.org/2.2.0/?r=day&cs=&ce=&m=network_report&s=by+name&c=Bits+caches+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [20:39:12] that looks suspiciously like a maxed out network connection. [20:39:24] (on each host individually) [20:46:41] RECOVERY - Puppet freshness on spence is OK: puppet ran at Wed Jan 18 20:46:29 UTC 2012 [20:52:01] i was about to say that's strange but actually, that's ilke 800mBytes, and with tcp especially with higher latency connections, that's not that unreasonable of a max, so yeah, i think an additional cache might be good [20:52:09] or even better [20:52:17] just add a 2nd link [20:52:22] cuz the cpu looks way low [20:52:30] and memory's not too bad [20:52:31] you mean 80-100MB, aka 800Mb, right? [20:52:36] yeah [20:52:43] 800Mb on a gigabit link == saturated. [20:52:44] doh s/bit/bytes/ [20:52:45] :) [20:53:03] ok, thanks for the confirmation. [20:56:19] RECOVERY - Host db17 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [20:56:20] RECOVERY - Host db17 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [20:57:23] maplebed: can you explain to me how you got to that conclusion (maxed out network conn on each host) from the graphs? [20:57:28] (hen you have time) [20:57:30] *when [20:57:38] sure - take a look at http://ganglia.wikimedia.org/2.2.0/graph_all_periods.php?h=cp3002.esams.wikimedia.org&m=network_report&r=day&s=by%20name&hc=4&mc=2&g=network_report&z=large&c=Bits%20caches%20esams [20:58:08] first indication - the day long graph has the classic look of a sine wave with the top lopped off [20:58:09] PROBLEM - NTP on db17 is CRITICAL: NTP CRITICAL: Offset unknown [20:58:10] PROBLEM - NTP on db17 is CRITICAL: NTP CRITICAL: Offset unknown [20:58:20] so there's some resource bottleneck stopping the traffic. [20:58:40] second, 100MBps == 800Mbps, which (with protocol overhead) is about the saturation point for a gigabit link. [20:59:55] (to support the conclusion, the other resources (cpu and memory) are both pretty bored) [21:01:35] hmm ok, I would have just thought that it had peaked then and levelled off [21:01:48] the second reasond makes perfect sense [21:02:01] Asher hasn't come back [21:02:02] boo [21:02:48] apergos: the resolution isn't great, but the curves from earlier in the week (where it stays further away from 80MB look like they have a smoother top. [21:02:56] ok [21:03:02] I could definitely do that comparison [21:03:10] thanks! [21:03:26] yw! [21:05:50] so i am about to be in my car for a bit [21:05:51] I looked at a custom graph [21:06:01] someone will need to help cmjohnson1 with the db he is on [21:06:11] if he cannot confirm the battery in the bios, it has a os and will boot into it [21:06:12] http://ganglia.wikimedia.org/2.2.0/?r=week&cs=1%2F15%2F2012+5%3A6&ce=1%2F19%2F2012+1%3A30&m=network_report&s=by+name&c=Bits+caches+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [21:06:14] its not in service [21:06:18] http://wikitech.wikimedia.org/view/Sun_Fire_X4240 has some info [21:06:27] basically using arcconf to pull adapter info and show battery [21:06:43] http://ganglia.wikimedia.org/2.2.0/?r=custom&cs=1%2F17%2F2012+10%3A18&ce=1%2F19%2F2012+5%3A16&m=network_report&s=by+name&c=Bits+caches+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [21:06:46] so I dunno now [21:06:52] but the numbers are convincing [21:08:47] binasher, http://etherpad.wikimedia.org/enwikiSOPADBUpgrades [21:08:51] All the tables aren't small... [21:09:04] Though, i'd guess the indexes will be worse [21:09:09] cmjohnson1: Ok, maplebed knows what needs checking on db17 battery [21:09:16] RobH, cmjohnson1 I'll do the battery testing on db17 when necessary. [21:09:20] cmjohnson1: so if it doesnt work, he can confirm in OS (if raid bios wont let you) [21:09:28] please ping me when you've got teh system up (if necessary) [21:09:33] and if the one you put in is bad, and you have a third one to try, he can shut it down fo ryou [21:09:49] robh thx [21:09:54] cool [21:10:00] back online shortly [21:10:44] Reedy: i'm triying to get server upgrades done and rotate the master.. should probably do the alters later. also, are they all backwards compatible with 1.18? (i.e. dropping user.user_options) [21:11:31] user_options was migrated away a while back on WMF... so it just needs a couple of revision merges to clean up the code using it [21:11:36] Not doing that one isn't a big deal [21:14:14] maplebed looks like I wont need you today...the new battery is charging and appears ok [21:14:53] cmjohnson1: cool. [21:15:13] if you want a second opinion, please let me know when the host is booted int othe OS and I'll query the card that way. [21:15:52] New patchset: Asher; "fix typo for db36" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1967 [21:16:35] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1967 [21:16:35] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1967 [21:17:42] New patchset: Ryan Lane; "Adding diederik to stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1968 [21:17:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1968 [21:18:12] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1968 [21:18:12] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1968 [21:18:26] maplebed: yes a 2nd opinion is great....another minute or so and plz check [21:18:34] k. [21:20:20] all yours [21:23:34] [21:23:36] Controller Battery Information [21:23:36] -------------------------------------------------------- [21:23:36] Status : Charging [21:23:36] Over temperature : No [21:23:36] Capacity remaining : 19 percent [21:23:46] Time remaining (at current draw) : 0 days, 14 hours, 15 minutes [21:23:47] [21:23:49] looks good to me. [21:24:45] okay...we'll see in 14hrs 15mins [21:25:03] thx [21:28:09] RECOVERY - NTP on db17 is OK: NTP OK: Offset -0.01419138908 secs [21:28:10] RECOVERY - NTP on db17 is OK: NTP OK: Offset -0.01419138908 secs [21:31:33] ok, lunch time. [21:58:49] !log rebooting db36, upgrading kernel + mysql [21:58:51] Logged the message, Master [22:01:18] PROBLEM - Host db36 is DOWN: PING CRITICAL - Packet loss = 100% [22:01:19] PROBLEM - Host db36 is DOWN: PING CRITICAL - Packet loss = 100% [22:05:58] RECOVERY - Host db36 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [22:05:59] RECOVERY - Host db36 is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [22:25:26] !log swapping s1 master to db36 [22:25:27] Logged the message, Master [22:27:40] !log enwiki master changed to db36 - MASTER_LOG_FILE='db36-bin.000599', MASTER_LOG_POS=15773827 [22:27:41] Logged the message, Master [22:28:41] New patchset: Asher; "db36 is now the s1 master" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1969 [22:28:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1969 [22:29:10] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1969 [22:29:10] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1969 [22:35:41] robh: db17 battery looks good...i was able to verify in raid bios and maplebed also checked. [22:35:54] coolness [22:37:47] we'll see how it looks tomorrow after the battery charges [22:57:16] PROBLEM - Puppet freshness on mw1096 is CRITICAL: Puppet has not run in the last 10 hours [22:57:16] PROBLEM - Puppet freshness on mw1096 is CRITICAL: Puppet has not run in the last 10 hours [22:57:36] PROBLEM - DPKG on db32 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:57:36] PROBLEM - DPKG on db32 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:02:42] !log increased the size of db11's logical volume for /a from 500G to 800G. [23:02:44] Logged the message, Master [23:37:27] New patchset: Asher; "upgrading db32" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1970 [23:40:19] RECOVERY - DPKG on db32 is OK: All packages OK [23:40:20] RECOVERY - DPKG on db32 is OK: All packages OK [23:42:21] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1970 [23:42:22] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1970 [23:50:00] !log rebooting db32 for mysql/kernel upgrades [23:50:02] Logged the message, Master