[00:00:19] !log pulling cp1044 from lvs for testing [00:00:22] Logged the message, Master [00:05:02] preilly: the varnish config you deployed to prod is invalid [00:05:14] Message from VCC-compiler: [00:05:14] Symbol not found: 'reg.http.X-Carrier' (expected type BOOL): [00:09:34] PROBLEM - MySQL Idle Transactions on db38 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:09:57] enwiki is in trouble [00:10:18] double trouble? [00:10:45] (Cannot contact the database server: Unknown error (10.0.6.48)) [00:10:51] I guess so [00:11:08] there's a whole bunch of [00:11:09] UPDATE /* User::invalidateCache G fetcher */ `user` SET user_to [00:11:09] uched = '20120511000900' WHERE user_id = '9630782 [00:11:12] all for the same user [00:11:58] PROBLEM - LVS HTTP on m.wikimedia.org is CRITICAL: HTTP CRITICAL - pattern not found [00:12:03] spamming derrors.log [00:13:19] AaronSchulz: why would there be thousands of invalidateCache updates for the same user? [00:13:28] what invokes that function? [00:13:39] lots of things [00:13:45] why doesn't it have a ts condition [00:14:11] can we block this user? [00:14:13] PROBLEM - MySQL Slave Running on db38 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:14:49] RECOVERY - LVS HTTP on m.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 0.109 second response time [00:15:13] We can.. [00:15:34] RECOVERY - MySQL Slave Running on db38 is OK: OK replication [00:15:34] TimStarling: ^^ [00:15:52] Not a new account, no contribs, no log entries [00:16:44] look in xff.log [00:17:21] I hacked User [00:17:29] * AaronSchulz can view RC again [00:18:07] RECOVERY - MySQL Idle Transactions on db38 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:18:09] that was enwiki? [00:18:58] New patchset: preilly; "Partner IP Live testing - Thursday, May 10th, 10am - 12pm PDT" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7252 [00:19:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7252 [00:19:16] AaronSchulz: your hack did the trick [00:19:18] http://en.wikipedia.org/wiki/Special:Contributions/G_fetcher [00:19:21] TimStarling: yup, enwiki [00:19:22] not much there, hmm [00:19:33] yeah, i just tried checkuser [00:19:40] so, in case this was not noticed or reported yet, a few minutes ago I got "(Can't contact the database server: Unknown error (10.0.6.48))" on en.wikipedia.org [00:19:52] a refresh a few seconds later worked. [00:19:55] why did nobody flood #wikimedia-tech? heh [00:20:43] this would be easier if we had backend request logs [00:21:11] AaronSchulz: i was killing enough 'G fetcher' queries to generally keep some connection slots open [00:22:11] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7252 [00:22:13] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7252 [00:26:56] AaronSchulz: can you add the IP to that error message? [00:27:03] then we can correlate it with other logs [00:28:01] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:28:39] sure [00:28:45] who "rotated" the fatal.log on april 28? [00:28:51] socat is still writing to the old file [00:29:20] Leslie did it for me [00:30:13] TimStarling: I will have to add a flag to wfBacktrace to force the cmd-mode output [00:30:22] * AaronSchulz puts that on his todo list [00:30:22] !log restarted socat on fenari so that fatal.log is reopened [00:30:24] And I've got https://gerrit.wikimedia.org/r/#/c/6061/ outstanding to logrotate it [00:30:26] Logged the message, Master [00:30:48] TimStarling: I think that was me [00:31:42] CentralAuthHooks.php line 463 calls User->invalidateCache() [00:31:55] (what kind of bot is "morebots" ?) [00:32:30] this isn't a loop is it? [00:34:11] AaronSchulz: you could have made it a fatal error, then backtraces would have appeared in fatal.log [00:34:32] I'm always afraid to fatal in the middle of execution, lest stuff be half done [00:34:36] binasher: https://gerrit.wikimedia.org/r/7254 [00:34:50] ideally everything would be one transaction, but that's not always the case [00:34:51] it doesn't look like a loop to me [00:34:51] but meh [00:34:55] yeah [00:35:09] if you select just one apache PID, you don't get a lot of entries [00:35:28] binasher: ^ [00:35:59] TimStarling: but it's also another the other processes, right? [00:36:07] *also on the other processes [00:37:30] AaronSchulz: in the processlist i grabbed while there were 856 of those queries running, they were coming from 127 different apache servers [00:37:44] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7254 [00:37:46] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7254 [00:40:37] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:40:46] I see api.php requests for this IP in xff.log, but I don't see anything in sampled-1000.log [00:41:31] we really need an API log, and a backend log [00:41:56] but either would have a very high rate, probably higher than the current POST log [00:42:58] because of android apps requesting action=parse [00:43:03] we need some dedicated logging server [00:43:35] OrgName: Google Inc. [00:43:54] binasher: wtf? [00:44:25] where did that come from? [00:44:33] binasher: yeah, I filed an RT ticket requesting one [00:44:47] but it was closed after the hardware was purchased, and left without an OS [00:45:46] http://rt.wikimedia.org/Ticket/Display.html?id=2400 [00:46:39] TimStarling: so this user is coming through the API? [00:46:45] don't know [00:47:08] don't your backtraces show? [00:47:39] ah, they are truncated [00:48:07] use wfErrorLog() instead [00:48:47] wfErrorLog( $msg, 'udp://10.0.5.8:8420/g_fetcher' ); [00:48:52] then you get 64KB [00:49:51] ok it is a mix of API and page views [00:50:01] maybe it is getting listings and then hitting each item [00:50:18] /w/api.php?action=query&format=xml&revids=491908036&cllimit=max&prop=categories%7Cinfo%7Crevisions&rvprop=comment%7Ccontent%7Cflags%7Cids%7Ctimestamp%7Cuser [00:53:54] binasher: :) [00:54:22] binasher: some of the errors are hacks that were on for a few seconds [00:56:16] I'm pushing out a header log [00:56:46] TimStarling: do we have a generic log group type for temporary things [00:57:27] this works [00:57:36] /home/wikipedia/logs/g_fetcher.log [00:58:00] definitely points to google, right? [00:58:15] ah, we even have an email address in the user agent [00:58:28] see, we tell people to set their user agent but we have no way to actually see it [01:03:22] TimStarling: I just emailed them [01:03:36] saying what? [01:03:49] TimStarling: why are you crawling as a logged in user [01:04:30] it would probably be better for us if they were logged out, but they weren't to know that [01:06:19] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:06:48] maybe the session is corrupted or something [01:06:54] TimStarling: yeah [01:08:02] they always send the same token and the same session ID, so invalidateCache() shouldn't be hit [01:08:57] hmm [01:11:11] ah, but they don't send a *local* session ID [01:11:25] globalloggedin is in the local session [01:12:29] they will be getting a Set-Cookie header in every response, but they are ignoring it [01:13:24] one of the many ways to DoS wikipedia :) [01:13:30] * AaronSchulz should compile a list [01:13:45] there's an app for that [01:13:56] TimStarling: In the meantime, any tips on debugging a segfault issue? [01:14:09] There are certain pages that when you edit them make PHP segfault, but the edit still goes through [01:14:11] gdb [01:14:15] Yeah I figured [01:14:23] But I can't get my hands on a process that segfaults [01:14:27] or a core dump FTM [01:15:01] I've been sending requests to srv200 [01:15:18] Attached gdb to a single Apache worker, but I'd have to send lots of requests before it'd hit "my" worker [01:15:34] I tried to enable core dumps but that didn't work either [01:16:14] well, you can work out why core dumps are not working, there are a number of conditions that need to be satisfied [01:16:21] or you can use apache2 -X [01:16:23] RoanKattouw: limit the number of workers? [01:16:39] running apache with -X runs it as a single worker, without forking at all [01:16:43] ah [01:16:51] Aha, nice [01:17:06] !log Stopping Apache on srv200 so I can use it as my guinea pig for segfault debugging [01:17:07] usually it will hang if you point a browser at such an instance [01:17:12] Logged the message, Mr. Obvious [01:17:27] because the browser will try to open concurrent connections for load.php [01:18:01] it will probably work if load.php goes to bits and that isn't routed to your test server, I guses [01:18:11] alternatively you can use curl [01:18:46] maybe it won't hang on wikimedia, who knows, it's been a while since I tried it there [01:19:07] I often use that trick on localhost, but more often on production servers I use core dumps [01:19:39] I'm netcatting to srv200:80 [01:19:43] So that's not an issue [01:20:16] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:20:43] PROBLEM - Apache HTTP on srv200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:21:00] wtf now I just get a SIGPIPE when PHP tries to write output? This doesn't seem right [01:21:17] handle SIGPIPE noprint nostop [01:21:26] put that in gdb, it's essential [01:21:34] then cont [01:22:07] probably you just had an unrelated pybal request come in, SIGPIPE is completely normal [01:22:43] it occurs when the remote side closes its connection [01:23:29] OK trying that now, thanks [01:23:53] Man running stuff in gdb is slow :D [01:24:06] Or, well, maybe it's not gdb [01:24:19] An untraced apache process also took like 30s to segfault [01:24:40] to get core dumps working, the three things you most often have to do are: [01:24:48] 1. set CoreDumpDirectory [01:24:55] Tried that yeah [01:25:06] There's one commented out in an Apache config file somewhere [01:25:20] 2. add a "ulimit -c" to a startup script [01:25:20] I also created the directory, chown apache, chmod 777 [01:25:26] Urgh right [01:25:33] 3. core dump directory permissions [01:27:39] "man core" has 5 different reasons why a core dump might not be created, if that doesn't work [01:36:05] TimStarling: google emailed me back [01:36:15] TimStarling: Hey Patrick, [01:36:16] We were observing some strange caching behavior on certain pages on Wikipedia and using a logged in user seemed to resolve some of these issues, so we reverted back to our old behavior of using a logged in user on our side. [01:36:16] It looks like that wasn't the best idea on our side though, so we'll stop it ASAP. [01:36:17] Sorry about that, [01:36:18] Shen [01:36:34] seeing a lot of these in the logs now [01:36:36] Fri May 11 1:36:10 UTC 2012 mw13 enwiki UserDailyContribsHooks::articleSaveComplete 10.0.6.48 1205 Lock wait timeout exceeded; try restarting transaction (10.0.6.48) UPDATE `user_daily_contribs` SET contribs=contribs+1 WHERE day = '20120511' AND user_id = '0' [01:36:40] what's up with user_id = 0? [01:37:04] bug [01:37:58] PROBLEM - MySQL Idle Transactions on db38 is CRITICAL: CRIT longest blocking idle transaction sleeps for 686 seconds [01:39:15] that's an AFT thing I think [01:39:19] RECOVERY - MySQL Idle Transactions on db38 is OK: OK longest blocking idle transaction sleeps for 0 seconds [01:39:31] RoanKattouw: who should we complain to about that? [01:40:09] Eww that's not supposed to happen [01:40:12] Not an AFT thing either [01:40:19] I've been trying to get rid of that ext [01:40:28] I think it was ClickTracking originally, but AFT references it [01:40:43] Oh, right, it's for ClickTracking data I believe [01:40:46] Anyway, I'll fix that user_id=0 bug [01:40:53] it stopped for now, but was coming in fast enough to impact db38 [01:41:43] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 251 seconds [01:42:20] TimStarling, binasher: https://gerrit.wikimedia.org/r/7259 [01:42:25] Approve that and I'll deploy it right away [01:43:04] merged [01:44:25] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:46:40] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 224 seconds [01:48:01] OK, lemme deploy that [01:48:10] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 3 seconds [01:52:29] That's not realted to the segfault though, right? [01:52:33] related [01:55:09] No [01:55:16] But we found that one too [01:57:03] New patchset: Catrope; "Set PCRE recursion limits" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7261 [01:57:13] TimStarling: ---^^ I can't approve or deploy that but I think you can? [01:57:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7261 [01:58:39] oh, no I can't approve that, the backtrack_limit should stay at 1M [01:58:49] Oh, sorry [01:58:53] probably nothing would work at all with it at 1000 [02:00:06] actually it was 100000 before, it can probably stay at 100000 [02:00:11] New patchset: Catrope; "Set PCRE recursion limit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7261 [02:00:29] yep, that's fine [02:00:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7261 [02:00:55] the backtrack limit prevents O(N^2) running time in certain cases [02:00:55] !log Started Apache back up on srv200, done debugging [02:00:58] Logged the message, Mr. Obvious [02:01:31] but I think it's a number of characters, not a number of levels [02:01:50] Ah ye [02:01:53] Then 1k would be bad [02:02:16] RECOVERY - Apache HTTP on srv200 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [02:02:19] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7261 [02:02:23] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7261 [02:02:48] it's in puppet now [02:03:02] OK, good [02:03:18] do you need it forced out right now, or is it ok to wait for puppet to do it? [02:03:50] It's causing segfaults at a rate of ~5 per minute [02:04:16] From a user perspective that means you wait for a long time then get a Squid error page, but your edit goes through [02:06:55] We should be happy these segfaults are only on POST, not on GET, otherwise Squid would amplify them [02:08:34] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [02:12:46] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [02:22:56] Joan: If you're interested, the segfault was due to excessive recursion in PCRE, triggered by a regex in PageTriage. Offhand it looks like it would be triggered on pages where there are more than ~18k characters between a '{{' and its matching '}}' [02:23:29] We're fixing it by setting the PCRE recursion limit to 1k, apparently the default value (which it was set to previously) is 100k which is way too high [02:48:56] RoanKattouw: Cool. Thanks for fixing that. :-) [02:49:07] RoanKattouw: I'm gonna quote you to annotate that talk page's history. [02:49:13] heh thanks [02:49:17] Sorry for all the confusion there [02:49:27] Why is the regex recursing so much? [02:49:28] The IPs that made those edits are all remarkable [02:50:29] 0:0:0:0:0:0:0:1 is IPv6 for localhost [02:50:41] The 216. IP is the office [02:50:54] And the 208. IP is fenari (our bastion host) [02:50:59] As for why it was recursing so much [02:51:08] $text = preg_replace( '/\{\{[^\{]((?!\{\{).)*?\}\}/is', '', $text ); [02:51:14] They run that repeatedly to strip templates [02:51:17] It's kind of evil really [02:51:25] Why are they stripping templates like that? [02:51:44] What that regex does is find a template call that is not nested (i.e. contains no other template calls) and remove it, then they repeatedly run it [02:51:52] Right. [02:51:52] It's for the snippet they display in PageTriage [02:52:21] The code that generates the snippet is evil, it strips templates by executing that regex up to 5 times, then parses the remaining text, then truncates the result to 150 chars [02:52:35] To try to get a usable snippet. [02:52:41] Yeah [02:52:44] Sans infoboxes etc [02:52:44] Is the snippet generally usable? [02:52:52] I... don't know, haven't looked [02:52:58] Looked at Special:NewPagesFeed or whatever it's called to see [02:53:06] Anyway, the evil part of that regex is ((?!\{\{).)*? [02:53:50] Its job is to assert that there are no occurrences of '{{' between the opening '{{' and the closing '}}' , i.e. to find only template calls that don't contain any nesting [02:54:19] But it looks like that would probably also cause PCRE to enter a new recursion level (i.e. create a new stack frame) for every character between '{{' and the matching '}}' [02:54:51] So if you have a lot of stuff wrapped in {{ and }} and that lot of stuff doesn't contain '{{' , then you'd trip this bug [02:55:02] For values of 'a lot' of, say, 10-20k [02:55:21] I have no problem believing the database reports page matches that description :) [02:55:22] Heh. [02:55:28] I just got the error again. [02:55:32] So I guess the fix isn't live? [02:55:37] lol. [02:55:48] It's in puppet, so it can take up to 4(?) hours to be live on all machines [02:56:22] Unless Tim forces a manual run [02:57:24] RoanKattouw: what about getting some food while puppet works for ya ? :-D [02:58:11] heh [02:58:12] Yeah good idea [02:58:18] I'll wrap up, gimme ~5 mins [02:58:32] we need to choose a place too [03:02:03] Joan: Just added some more info to that section, thanks [03:02:10] Although ironically I just hit the bug too :) [03:05:54] I've stripped templates before. [03:06:03] But I just counted { and } until the number was even. [03:06:09] Not sure if that'd be more or less horrible. [03:06:16] I'd assume less. [03:06:39] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [03:08:00] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [03:17:45] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [03:30:12] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2600* [03:48:48] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2363 [04:24:53] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:39:17] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:59:23] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:10:38] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:11:59] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [05:11:59] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [05:39:39] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [06:46:35] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:52:45] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:07:36] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:07:54] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [07:18:06] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:19:18] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2363 [07:23:10] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3518 [07:23:13] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/3518 [07:29:58] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:30:34] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:37:38] apergos: which system is pushing the datasets to the gluster storage? [07:37:53] dataset2 I think, lemme look [07:38:04] maybe it's 1001 [07:38:04] oh. I don't see gluster installed there [07:38:21] ah [07:38:22] it is [07:38:22] oh yeah it's 1001, I remember we were sad that there would be latency [07:38:23] I broke it [07:38:26] :-( [07:38:31] um [07:38:48] two questions: "how?" and "now what do we do?" [07:38:53] I'm fixing it :) [07:38:59] ok cool [07:39:03] fixed [07:39:08] I upgraded gluster recently [07:39:10] and it went poorly [07:39:20] so I had to rollback [07:39:24] I missed this system [07:39:28] oohhh [07:43:48] New patchset: ArielGlenn; "adding odysseus.fi.muni.cz as dumps mirror" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7271 [07:43:55] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:44:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7271 [07:46:39] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7271 [07:46:42] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7271 [07:47:22] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [07:48:38] o_O new mirror? [07:48:52] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:50:59] not yet [07:51:20] after they start pulling from us, yes :-P [07:52:13] nice work :) [07:52:25] though its only the last five dumps, but at least its something [07:52:52] most people won't want anything but the recent dumps [07:53:04] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2338 [07:53:44] and the media dumps, which cause chaos on the list haha [07:53:58] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:55:23] see, when I don't announce, there's a reason for it :-P :-D [07:57:06] * Hydriz apologises to apergos for causing the trouble [07:57:13] hahaha [07:57:21] I thought someone else started that thread [07:57:34] lol no [07:57:39] I told emijrp before that [07:57:42] boooo [07:57:56] we were in the beginning of the first test run [07:58:01] I was browsing through the ftpmirror site to find why rsync wasn't working [07:58:05] and then suddenly we had all this extra traffic from downloaders [07:58:06] then I chanced on this [07:58:10] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:58:30] oh, emijrp asked someone to download them :P [07:58:41] *eyeroll* [07:58:50] they might be full of crap. [07:59:08] for example it's not clear that the files with unicode in the filenames are int he tarballs correctly [07:59:14] heh [07:59:28] at least its something worth saving for [07:59:53] no, it will be something worth throwing away in a week and redownloading :-P [08:05:22] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [08:30:13] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [08:33:04] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [08:41:28] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2363 [08:48:22] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [08:48:58] !log upgrading apache/mysql/kernel on marmontel (blog) [08:49:02] Logged the message, Master [08:49:05] meh [08:49:11] guest? [08:49:29] not anymore:) [08:51:13] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2338 [08:51:40] !log rebooting marmontel (blog) [08:51:43] Logged the message, Master [08:52:52] PROBLEM - Host marmontel is DOWN: CRITICAL - Host Unreachable (208.80.152.150) [08:55:25] RECOVERY - Host marmontel is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [08:58:07] !log package upgrades on ekrem (IRC server, WAP, Apple dict...) [08:58:10] Logged the message, Master [09:05:10] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:21:27] !log ekrem was close running out of disk again. logrotated apache logs, changed config to: size 512M,rotate 3 [09:21:31] Logged the message, Master [09:29:32] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:31:03] New patchset: ArielGlenn; "add other hostnames for muni.cz to dump rsync access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7278 [09:31:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7278 [09:31:58] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7278 [09:32:00] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7278 [09:41:32] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 183 seconds [09:42:44] PROBLEM - MySQL Slave Delay on db1033 is CRITICAL: CRIT replication delay 203 seconds [09:46:56] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 0 seconds [09:47:05] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [09:52:29] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2613* [10:02:14] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2625* [10:05:50] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:09:19] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [10:09:26] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:13:20] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:13:38] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:19:29] New patchset: Dzahn; "adding DHCP entries / MAC addresses for analytics1001 to 1010" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6919 [10:19:32] !wt is http://wikitech.wikimedia.org/view/$1 [10:19:35] Key was added! [10:19:42] !wt Cisco | mutante [10:19:42] mutante: http://wikitech.wikimedia.org/view/Cisco [10:19:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6919 [10:20:39] o.0 You guys have a cisco server [10:21:40] yeah, had to get into that mgmt shell, but it isnt that bad actually [10:22:21] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [10:23:06] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6919 [10:23:08] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6919 [10:31:48] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:33:09] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:39:18] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [10:45:34] New patchset: Dzahn; "add logrotate config for lighttpd on install-server (RT-2753), rebased" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7167 [10:45:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7167 [10:46:21] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2350 [10:49:21] PROBLEM - udp2log processes for emery on emery is CRITICAL: CRITICAL: filters absent: /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/local/bin/packet-loss, /var/log/squid/filters/india-filter, /usr/local/bin/sqstat, /var/log/squid/filters/latlongCountry-writer, [10:53:33] RECOVERY - udp2log processes for emery on emery is OK: OK: all filters present [10:54:36] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:00:36] PROBLEM - udp2log processes for emery on emery is CRITICAL: CRITICAL: filters absent: /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/local/bin/packet-loss, /var/log/squid/filters/india-filter, /usr/local/bin/sqstat, /var/log/squid/filters/latlongCountry-writer, [11:06:18] RECOVERY - udp2log processes for emery on emery is OK: OK: all filters present [11:13:30] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:13:57] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2600* [11:23:42] RECOVERY - udp2log log age for oxygen on oxygen is OK: OK: all log files active [11:24:56] !log upgrading packages/kernel on hooper, rebooting (Blog,Etherpad,Racktables) [11:24:59] Logged the message, Master [11:25:21] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [11:33:45] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [11:35:23] !log stat1 - installed new kernel, but waiting to reboot. schedule with aotto [11:35:26] Logged the message, Master [11:36:45] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:37:57] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2650* [11:39:27] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:39:28] !log starting ms-be swift-container-auditors every once in a while [11:39:31] Logged the message, Master [11:43:30] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2375 [11:48:54] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:50:15] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:04:30] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [12:07:48] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours [12:12:26] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2338 [12:20:50] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2638* [12:27:44] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2388 [12:33:26] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [12:40:38] New patchset: Dzahn; "PDF servers: add/change to role class, add font class, install indic fonts on pdf1-3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7282 [12:40:55] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/7282 [12:42:42] New patchset: Dzahn; "PDF servers: add/change to role class, add font class, install indic fonts on pdf1-3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7282 [12:43:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7282 [12:46:02] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [12:47:51] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7282 [12:47:53] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7282 [12:54:35] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [13:01:29] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [13:01:36] New patchset: Jgreen; "adjusting fundraising backups while storage3 is dead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7283 [13:01:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7283 [13:02:50] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7283 [13:02:52] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7283 [13:12:24] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2600* [13:12:43] !log installing package upgrades on pdf1-3 (and installed requested indic fonts via new puppet role class) [13:12:46] Logged the message, Master [13:20:40] New patchset: Jgreen; "split to success/failure notification recipients so only failures go to root@ for reduced cronspam" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7284 [13:20:58] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7284 [13:20:58] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7284 [13:23:39] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [13:29:21] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2600* [13:39:06] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [13:46:09] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [13:53:12] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2338 [13:53:55] thanks jeff ;) [13:58:09] no problem [13:58:16] i don't know why I didn't think of it earlier [14:00:15] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [14:02:30] New patchset: Ottomata; "{role,misc}/statistics.pp - installing generic mysqld on stat1." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7285 [14:02:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7285 [14:04:36] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2600* [14:06:55] !log kernel upgrading / rebooting srv servers where uptime > 200 d order by uptime desc limit 1 [14:06:59] Logged the message, Master [14:16:30] PROBLEM - Apache HTTP on srv235 is CRITICAL: Connection refused [14:17:06] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [14:19:57] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2625* [14:22:48] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [14:27:08] ottomata: anything specific that should be done as a general rule before rebooting stat1? (tell $user / schedule ahead / stop/start something / just do it) [14:30:58] hmm [14:31:10] erik z and andre engels are the only one that really use it right now [14:31:21] but i'm not sure if they are currently doing things there [14:31:25] i installed new kernel but did not reboot [14:31:26] scheduling with them would be good [14:31:28] oh! [14:31:37] precise? or just a kernel upgrade? [14:31:44] we are going to have it reinstalled with precise soon [14:31:47] no, just kernel re the bug with long uptime [14:31:52] oh right, ok [14:31:54] to prevent crashing [14:32:09] aengles is the only user logged in other than me right now [14:32:10] and general package upgrades by ubuntu [14:32:20] within lucid [14:32:36] i guess talk to drdee maybe? he can tell you if it is ok [14:32:40] or at least tell you who you need to ask [14:32:47] thanks, btw [14:32:58] ok,thanks, this is what i wanted. a rule who to ask (also next time) [14:33:50] ottomata: hold on [14:34:12] PROBLEM - Apache HTTP on srv226 is CRITICAL: Connection refused [14:35:33] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [14:37:05] drdee: shall i ask erik z? [14:37:14] ottomata: aengels will let us know when we can start the reinstall, i expect to start within a few hours [14:37:19] ask what? [14:37:24] to reboot stat1 [14:37:41] he is fine, we are waiting for angels to give green light [14:37:45] ok [14:38:09] should be within next 2 hours (hopefully) then we can start make a backup and reinstall precise [14:38:24] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2625* [14:38:31] ok, i came across it while just doing kernel upgrades within lucids [14:39:05] ok [14:40:39] RECOVERY - Apache HTTP on srv235 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.065 second response time [14:45:42] ok, here's where I ask a potentially dumb question [14:45:44] check this out [14:45:52] ahh, I will pastie one sec [14:47:39] https://gist.github.com/2660218 [14:48:21] oh maybe cause my home dir is not readable by www-data! [14:48:25] doh, i knew it was a dumb question [14:48:32] thank you ops room, for being my sounding box [14:49:39] RECOVERY - Apache HTTP on srv226 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [14:50:52] where is this? labs? [14:51:19] yeah [14:51:24] but i got it [14:51:26] that was why [14:51:35] ww-data couldn't read the symlink because it was pointing inside of my home dir [14:51:37] which was 700 [14:52:08] oh.ok. i was going to say there are ongoing changes to the way /home is handled [14:52:25] but /var/www not ..sure [14:53:51] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [14:58:57] PROBLEM - Apache HTTP on srv257 is CRITICAL: Connection refused [14:59:24] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [15:00:09] PROBLEM - Host mw62 is DOWN: PING CRITICAL - Packet loss = 100% [15:03:37] PROBLEM - Host srv257 is DOWN: PING CRITICAL - Packet loss = 100% [15:03:39] !log mw62 -unless somebody was on that right now it died. mgmt also just Create Instance Error [15:03:43] Logged the message, Master [15:05:42] RECOVERY - Host srv257 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [15:11:27] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2375 [15:13:24] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [15:13:24] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [15:14:18] PROBLEM - Apache HTTP on srv256 is CRITICAL: Connection refused [15:15:39] PROBLEM - Apache HTTP on srv254 is CRITICAL: Connection refused [15:26:00] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [15:29:54] RECOVERY - Apache HTTP on srv254 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.025 second response time [15:33:45] !log adding DNS entries for analytics hosts in new vlan 1121 (10.64.21.0/24), hosts starting at .101 to match names analytics1001 = .101 and ++ [15:33:49] Logged the message, Master [15:40:24] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [15:41:28] ottomata, mutante: we can start backing up stat1 and begin reinstalling precise [15:42:57] please start the backup but im afraid im not gonna be here for reinstall right after that is done, (timezone) and still analytics hosts on priority [15:43:28] analytics hosts are nicely in the way of an analytics host. [15:43:58] :) [15:44:02] woot! ok [15:44:09] i will file an RT ticket with instructions to backup /s [15:44:14] /a * [15:46:00] http://rt.wikimedia.org/Ticket/Display.html?id=2946 [15:46:24] should assign or add people to it somehow? [15:47:42] usually no, just put in pool and use CC: for extra attention by specific people, then comment on it [15:48:18] or you can add multiple people into requestor [16:01:06] PROBLEM - Apache HTTP on srv256 is CRITICAL: Connection refused [16:02:49] hi [16:05:47] hi robla [16:06:01] hi aude...how goes? [16:06:53] robla: do you know who handles setting dns config at WMF? [16:07:43] the ops team generally. there's not a specific person that deals with that. what do you need? [16:07:56] robla: wikidata.org redirects to english wikipedia [16:08:16] we're setting up a landing page, with links to our metawiki pages and demos (soon as they are ready) [16:09:21] RECOVERY - Host mw62 is UP: PING OK - Packet loss = 0%, RTA = 2.83 ms [16:09:59] * robla waits patiently for someone from ops to weigh in [16:10:19] ok [16:10:41] aude: if waiting patiently for someone in ops doesn't work, you can always pester woosters when he comes online [16:11:00] RECOVERY - Apache HTTP on srv256 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.026 second response time [16:11:01] robla: ok [16:11:19] on our end, we need to setup the A records but have an ip address [16:11:27] hi woosters [16:11:34] heh [16:11:37] :o [16:11:47] sorry, cant right now, but here's something for the moment [16:11:50] hi aude [16:11:51] RT-2919 [16:11:55] use that to refer to it [16:12:32] woosters: the wikidata team needs some help with the wikidata.org domain [16:13:07] what about [16:13:09] i think there is a RT ticket for it, but we have a simple landing page that links to meta wiki [16:13:31] we have an ip address and setting up a records here, and then want wikidata.org to point to it [16:13:38] it currently redirects to english wikipedia [16:13:51] PROBLEM - Apache HTTP on mw62 is CRITICAL: Connection refused [16:14:09] let me check and get back to u [16:14:13] woosters: he refers to 2919 [16:14:25] but i cant help right the second [16:14:26] woosters: thanks [16:15:21] she:) [16:15:22] we don't need help with this right this minute but sometime soonish would be nice :) [16:15:31] mutante: :) [16:16:28] there has been this comment, btw "using wikidata.org, rather than data.wikimedia.org? It's going to be a [16:16:31] multilingual, single wiki, like commons, right" [16:17:58] aude .. thee is a policy decision needed for it. Will run it up with the team here [16:18:16] woosters: ok [16:20:13] mutante: that's the idea but not sure it's totally decided how we'll handle language specific urls [16:21:43] aude: i see, yeah. i'll paste your comment to ticket, ok? [16:22:44] mutante: [16:22:45] ok [16:23:29] done [16:23:48] thanks [16:24:10] np [16:25:06] RECOVERY - Apache HTTP on mw62 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.103 second response time [16:33:03] PROBLEM - Apache HTTP on srv254 is CRITICAL: Connection refused [16:38:36] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [16:45:30] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2375 [16:52:42] RECOVERY - Apache HTTP on srv254 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [16:57:12] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:05:25] cmjohnson1: [17:05:37] yes? [17:05:46] cmjohnson1: do search21-36 have disks in them? [17:05:50] yes [17:05:54] ok, cool [17:05:55] thanks! [17:06:17] they are good to go...i am mounting new ssd's to install into 13-20 [17:06:40] ah, but they did not previously have disks? [17:06:43] will ping you when they need to go down [17:06:54] ok, cool [17:06:57] 21-36 had no disk....13-20 have 250GB disk [17:07:06] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:07:20] awesome! [17:07:21] ok [17:07:34] I am less confused now [17:08:07] once, i am finsished then they will have 2 300GB SSD in them [17:08:34] excellent! [17:14:15] oh! also, I can't ping/ssh to search34-36 mgmt. any idea as to what that would be due to? [17:14:28] cmjohnson1: ^ [17:14:38] occupy shut them down [17:14:56] occupy the colo would be sweet [17:14:59] and loud [17:15:03] no..but let me investigate [17:16:07] wasn't really for occupy disk space on ekrem [17:18:47] notpeter: you should be good to go [17:18:55] mgmt: cables not connected [17:18:58] cmjohnson1: thanks! [17:19:34] mutante: just to double check, have you been getting pages since we switched to the new system ? [17:19:43] mutante: actually do you mind me emailing to double check ? [17:19:51] New patchset: Bhartshorne; "urlescaping files to delete instead of skipping names with spaces" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7290 [17:19:54] LeslieCarr: no, good that you check again [17:20:00] LeslieCarr: please do [17:20:10] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7290 [17:20:14] just sent you a "test123" page [17:20:24] LeslieCarr: well, i had another issue with my phone as well, but unrelated [17:20:45] PROBLEM - Host srv253 is DOWN: PING CRITICAL - Packet loss = 100% [17:20:46] let me try a second one with test456 [17:20:48] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7290 [17:20:50] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7290 [17:20:52] LeslieCarr: not yet [17:21:32] heh, might just put you on the smsglobal gateway (test789) since that seems the most reliable [17:21:35] LeslieCarr: got test789 [17:21:45] ok, putting you on that one :) [17:22:02] LeslieCarr: and i recently got the manual "ignore the search pages..." as well [17:22:06] RECOVERY - Host srv253 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [17:22:06] thanks [17:23:20] notpeter: search35 mgmt link is not working. i will be messing w/ it for a bit [17:23:30] yeah, 789 stays the only one i got [17:23:37] did you load anything on the server yet? [17:23:38] cmjohnson1: cool. thanks! [17:23:54] no... the drives aren't showing up with ubuntu 12.04 [17:23:57] which is annoying.... [17:24:04] the spindle drives showed up, though [17:24:27] a driver issue seems... possible but unlikely [17:24:45] cmjohnson1: have you put SSDs in any of 12-30 yet? [17:24:49] *13-20 [17:24:59] (want to test various things) [17:25:00] no...i am still putting the ssd's on their mounts [17:25:05] ok, cool [17:25:06] PROBLEM - Apache HTTP on srv253 is CRITICAL: Connection refused [17:31:24] notpeter" search34-36....all set. [17:31:47] cmjohnson1: cool! thanks! [17:35:00] RECOVERY - Apache HTTP on srv253 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.080 second response time [17:36:12] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [17:45:12] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2388 [17:46:56] !log deleted wikipedia-de-local-thumb container from swift. the sharded version is currently being used. [17:46:59] Logged the message, Master [17:50:45] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2600* [17:54:05] New patchset: RobH; "added db61 and db62" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7293 [17:54:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7293 [17:54:51] New patchset: Pyoungmeister; "switching search21 to install lucid to see if it can see the disk" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7294 [17:55:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7294 [17:55:13] New review: RobH; "self review" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7293 [17:55:13] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7293 [17:55:38] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7294 [17:55:40] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7294 [17:56:15] robh: can you help me with ps1-b5-sdtpa...maybe increase threshold or ID servers to be moved. they're all apaches [17:57:03] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:57:44] cmjohnson1: so looking at the power, phase Z is high [17:57:51] the others are at 18a nd 19 amps [17:57:54] and z is at 26 [17:58:16] you should be able to move some xz things to xy and some yz things to xy [17:58:39] try one from xz and one from yz to move each to xy [17:59:06] i suppose these are single psu eh? [17:59:29] they are..i will check ganglia and id regular apaches [17:59:38] so best for you to identify to me one server from xz and one from yz that will be easiest for you to relocate to xy [17:59:51] i can shut them down then, so you can relocate the power [18:03:11] cmjohnson1: So I rebooted db61 and it doesn't appear to be outputting to the serial console [18:03:21] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [18:03:22] I am checking db62 now, did you set console redirection in bios? [18:03:32] cuz it appears to not be set [18:04:02] db62 seems to not be redirecting either, they normally reboot faster than 5 minutes [18:04:19] cmjohnson1: so please check both of those as soon as you can, then we can go back to balancing power [18:04:24] yes I did but I can check it again [18:04:31] you tested the serial via drac? [18:05:00] !log swift: deleting the unsharded version of all sharded containers [18:05:03] Logged the message, Master [18:05:05] huh, power usage went down drastically in b5 [18:05:08] yes i did [18:05:15] hrmm, something is odd then [18:05:37] can i check through racadm? [18:05:51] RobHalsell: do you know if search21-36 have any kind of hardware? [18:05:58] perhpas a kind that doesn't support jbod? [18:06:09] *hardware raide [18:06:12] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2350 [18:06:17] cmjohnson1: nope, it has to be via physical console into bios if its off [18:06:21] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [18:06:25] notpeter: they dont have raid [18:06:29] RobHalsell: kk [18:06:38] just a sas/sata controller [18:06:50] AaronSchulz: just fyi; I just kicked off a delete for the unsharded versions of all sharded containers. [18:06:56] i will reboot db61 now and check settings [18:07:01] ok [18:07:08] cmjohnson1: there's something weird with the SSDs in search21-36. neither the lucid nor precise installer can see the disks. can you take a look at that? or have any ideas? [18:07:27] on the dells, the only way to set console redirection is via bios screen, which can only be reached via already setup redirection, or by actual crash cart [18:07:31] its a drawback of the dell stuff [18:07:58] cmjohnson1: so the controllers see them? [18:08:00] ack, notpeter [18:08:02] not chris [18:08:22] im looking at search21 [18:08:22] RobHalsell: uh... dunno [18:08:24] kk [18:08:35] no, the conotrlllers probably odn't see them [18:08:37] but I'm not sure [18:08:45] checkin [18:09:23] I'll get out of that console for you :) [18:09:25] notpeter: are you on search21? [18:09:32] just got out [18:09:42] k [18:11:33] so yea, if they just plug into the mainboard, the controller is off in bios [18:11:41] but checking if its a controller or the mainboard [18:12:10] since these were ordered without disks, having the onboard controller turned off is pretty common [18:12:27] notpeter: you may have to turn it on in each system, will know in a moment [18:13:08] RobHalsell: sounds about right [18:13:44] ok, try to do an install now on search21 [18:14:02] if it works, then you have to drop into the bios screen on each, and turn on the sata controller. and set port a and b from off to auto [18:14:35] ok, cool [18:14:37] thanks! [18:15:37] notpeter: i will try and fix that on 13-20 for you when i get new drives in [18:15:51] cool cool [18:15:54] oh, they should be ok [18:16:03] as they've already had disks in them [18:16:24] also, still can't hit search36 mgmt [18:17:10] New patchset: Pyoungmeister; "search21 back to precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7299 [18:17:29] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7299 [18:17:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7299 [18:17:31] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7299 [18:22:48] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:23:20] RobHalsell: worked! thanks! [18:23:51] Time to install ALL of the search servers [18:24:11] notpeter: cool, now you know what you have to do to fix the rest ;] [18:24:44] hurray. [18:27:35] RobHalsell: what mode should hte controller be in? [18:27:45] ata [18:27:50] kk [18:32:27] cmjohnson1: I' mjust going to shut down search13-20 for you now, as they're not currently doing anything [18:33:30] !log shutting down search 13-20 for hd upgrades [18:33:33] Logged the message, notpeter [18:34:18] ok...thx [18:36:00] PROBLEM - Host search15 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:00] PROBLEM - Host search14 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:00] PROBLEM - Host search17 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:00] PROBLEM - Host search16 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:09] PROBLEM - Host search18 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:27] PROBLEM - Host search13 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:54] PROBLEM - Host search19 is DOWN: PING CRITICAL - Packet loss = 100% [18:40:30] PROBLEM - Host rendering.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [18:41:06] PROBLEM - Host search-pool1.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [18:41:06] PROBLEM - Host search-pool2.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [18:41:06] PROBLEM - Host search-pool3.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [18:41:15] PROBLEM - Host search20 is DOWN: PING CRITICAL - Packet loss = 100% [18:42:18] PROBLEM - Host api.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [18:42:45] PROBLEM - Host appservers.svc.pmtpa.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [18:47:51] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:51:18] RECOVERY - Host appservers.svc.pmtpa.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [18:51:45] RECOVERY - Host rendering.svc.pmtpa.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.85 ms [18:52:03] RECOVERY - Host api.svc.pmtpa.wmnet is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [18:54:30] robh: figured out the console problem [18:54:43] you are good for db61....fixing db62 [18:59:02] New patchset: Pyoungmeister; "adding last new search node mac" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7301 [18:59:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7301 [19:00:29] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:00:42] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7301 [19:00:45] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7301 [19:07:03] RobHalsell: ready to move power on apaches...srv261 is one of the apaches that will spike so I want to move to srv260 and 261 (they're on the same Y cable) [19:07:22] cmjohnson1: are the db61 and 62 working on console redirection? [19:07:31] yes...oh ...ur are not robh [19:07:41] yes...there is a different default setting [19:07:57] ok, can you detail that and create a wikitech page on it later today/monday? [19:08:11] yep..np [19:08:13] link it to platform specific page i linked you [19:08:17] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:08:29] so do i need to do something other than 'console com2' or is it the same from a use standpoint? [19:09:14] no..same console com2 [19:09:21] on the apaches, srv260/srv261 are on xz oe yz (which?) and moving to yx? [19:10:06] they are on xz...and the only place i have to move them xy [19:10:06] !log srv261 & srv261 shutting down for power rebalancing within the rack [19:10:10] Logged the message, RobH [19:10:14] cool [19:10:14] PROBLEM - Host ps1-c1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.11) [19:10:23] PROBLEM - Host ps1-b5-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.10) [19:10:23] PROBLEM - Host ps1-b1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.6) [19:10:23] PROBLEM - Host ps1-c3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.13) [19:10:23] PROBLEM - Host ps1-d3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.16) [19:10:23] PROBLEM - Host ps1-a1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.1) [19:10:23] you will notice now that the power isnt overloaded [19:10:32] oh [19:10:35] ignore those [19:10:40] that's just the mgmt network [19:10:41] PROBLEM - Host ps1-d3-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.19) [19:10:41] PROBLEM - Host ps1-b3-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.8) [19:10:41] PROBLEM - Host cr1-sdtpa is DOWN: CRITICAL - Network Unreachable (208.80.152.196) [19:10:41] PROBLEM - Host ps1-c2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.12) [19:10:41] PROBLEM - Host ps1-b2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.7) [19:10:41] PROBLEM - Host ps1-d1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.17) [19:10:42] PROBLEM - Host ps1-d2-sdtpa is DOWN: PING CRITICAL - Packet loss = 100% [19:10:50] LeslieCarr: is mgmt network down? [19:10:50] PROBLEM - Host ps1-a2-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.2) [19:10:50] PROBLEM - Host ps1-d1-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.14) [19:10:50] PROBLEM - Host ps1-d2-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.18) [19:10:50] PROBLEM - Host ps1-b4-sdtpa is DOWN: CRITICAL - Network Unreachable (10.1.5.9) [19:10:54] cuz i was using it =P [19:10:57] oh [19:10:58] sorry :( [19:11:09] we are in the middle of rebalancing power in a rack [19:11:11] i forgot that in sdtpa the mgmt network is single homed [19:11:16] how long will it be down? [19:11:29] next 5-7 minutes as cr1-sdtpa reloads [19:11:49] cmjohnson1: we can resume once mgmt is back up [19:11:53] sigh, i hate sdtpa [19:11:57] sorry RobH and cmjohnson1 [19:12:13] pmtpa is no better ;] [19:12:32] though I suppose it would be ideal for us to have to multiplexing feeds, one for each fiber pair [19:12:44] so each pair carried both mgmt and production traffic, so if a single pair dies... [19:12:52] s/to/two [19:13:08] but the network in tampa is shitty. [19:13:21] the ideal thing to do would be to throw that dc away and see if chris wants to move somewhere :) [19:13:23] PROBLEM - Host mr1-pmtpa is DOWN: CRITICAL - Network Unreachable (10.1.2.3) [19:13:45] well, we should still fix it for what it is now [19:13:57] cuz we won't be moving away from it for a minimum of two fiscal years [19:14:03] i know :( [19:14:22] which makes it four fiscal years longer than planned. [19:14:26] PROBLEM - Host srv261 is DOWN: PING CRITICAL - Packet loss = 100% [19:14:34] i didnt take that offline... [19:14:35] PROBLEM - Host srv260 is DOWN: PING CRITICAL - Packet loss = 100% [19:14:48] hrmm, cmjohnson1 did you do that? [19:14:57] robh: i thought you had...yes [19:15:01] it looks like they powered down... [19:15:05] shit timing [19:15:06] did you hit the button? [19:15:17] cuz if i shut them down, they turn off, i had not turned them off yet [19:15:23] since we didnt have mgmt network. [19:15:34] did you hit the button, or simply unplug them? [19:15:38] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:15:47] PROBLEM - Apache HTTP on mw10 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:47] PROBLEM - Apache HTTP on mw6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:15:47] RECOVERY - Host srv261 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [19:15:50] i unplugged them..... [19:15:59] so they were still on and spun up. [19:16:03] thats not good [19:16:05] PROBLEM - Apache HTTP on mw14 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:05] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:05] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:05] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:05] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:05] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:16:15] if they are powered on, you shouldnt just unplug them [19:16:16] not clean at all..i assumed you had them down..saw your log...my mistake [19:16:25] i know [19:16:28] yea but you have to check physically that its powered down properly [19:16:32] RECOVERY - Host srv260 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [19:17:07] these are apaches, so they don't have nearly so much to mess up, but they do run jobs that just offlined without cleanly shutting down [19:17:08] RECOVERY - Apache HTTP on mw10 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [19:17:08] RECOVERY - Apache HTTP on mw6 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [19:17:15] its not good, but it shouldnt be a problem [19:17:26] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [19:17:26] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [19:17:26] RECOVERY - Apache HTTP on mw14 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [19:17:26] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [19:17:26] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [19:17:26] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [19:17:49] yep..i know it's not good..it was my 100% my error. i usually verify they're off but didn't this time [19:18:47] PROBLEM - Apache HTTP on srv261 is CRITICAL: Connection refused [19:18:53] fyi, mgmt net should come back soon [19:18:56] RECOVERY - Host mr1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [19:19:02] cr1-sdtpa is coming back up [19:19:14] RECOVERY - Host ps1-c1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.71 ms [19:19:23] RECOVERY - Host ps1-b5-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.54 ms [19:19:23] RECOVERY - Host ps1-a1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.75 ms [19:19:23] RECOVERY - Host ps1-c3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 1.96 ms [19:19:23] RECOVERY - Host ps1-d3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.59 ms [19:19:23] RECOVERY - Host ps1-b1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 1.79 ms [19:19:26] and now it no longer is flipping out about its tfeb [19:19:26] so double awesome [19:19:32] RECOVERY - Host ps1-c2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.51 ms [19:19:32] RECOVERY - Host ps1-d1-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 2.81 ms [19:19:32] RECOVERY - Host ps1-b2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.16 ms [19:19:41] RECOVERY - Host ps1-d1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.09 ms [19:19:41] RECOVERY - Host ps1-b3-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.73 ms [19:19:41] RECOVERY - Host ps1-a2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.18 ms [19:19:41] RECOVERY - Host ps1-b4-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 2.65 ms [19:19:41] RECOVERY - Host ps1-d3-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 3.03 ms [19:19:46] binasher: once i have db61 and db62 confirmed ready for OS install (in a few minutes) did you want to do the OS install or have me do it? [19:19:51] RECOVERY - Host ps1-d2-pmtpa is UP: PING OK - Packet loss = 0%, RTA = 3.32 ms [19:19:59] These are larger disks, so I assume you have a different partman you wanna use [19:20:08] different partition setup that is [19:20:53] RECOVERY - Host cr1-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [19:20:59] RobH: mgmt net should be happy now [19:21:04] yep [19:21:21] cmjohnson1: ok, those came off xz? and are now on xy? [19:21:35] that seems right, just confirming [19:21:47] RECOVERY - Host ps1-d2-sdtpa is UP: PING OK - Packet loss = 0%, RTA = 3.29 ms [19:21:48] i think we want to move a ycable from yz to xy as well [19:22:04] if you look now, its x low, y mid, z high [19:22:24] if we move a yz it will move them to all closer to midline [19:22:56] cmjohnson1: so now find two servers sharing a y cable in yz for us to move [19:23:03] !log srv260 and srv261 back in business [19:23:06] Logged the message, RobH [19:24:38] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:24:51] RobH: same partman recipe is fine for now at least. the amount reserved for lvm snapshots will be bigger than needed but /a can always be grown. i want to put precise on one and lucid on the other [19:24:57] is fenari dead? [19:25:07] uhoh, are you seeing an issue awjr ? [19:25:12] im timing out trying to ssh in [19:25:20] pings are timing out too [19:25:24] im in fine [19:25:26] hrm [19:25:31] from inside or outside the world ? [19:25:35] try bast1001.wikimedia.org [19:25:36] outside the world [19:25:39] hrm [19:25:48] got a traceroute for me ? [19:25:51] cmjohnson1: did you get my info before the d/c [19:25:59] nope [19:26:05] http://bast1001.wikimedia.org/ works fine [19:26:08] for me [19:26:26] interesting [19:26:29] yea traceroute it [19:26:37] oh now im in [19:26:40] most folks in the US route to ashburn to go to either of them [19:26:52] dunno what happened - i tracerouted, it completed fine, tried ssh'ing again, no problem [19:26:53] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:26:53] but you may have been routing direct to tampa some other way, a shitty way. [19:26:53] only d/c i had was irc....didn't lose internet [19:27:15] cmjohnson1: so you got my info to find a y cable in yz to move to yx? [19:27:39] in addition to 260 and 261 [19:27:55] yea, if you look at the power there now [19:28:00] x is low, y mid, z high [19:28:09] ok [19:28:14] looking now [19:28:18] the ideal is 3 balanced phases. if we move a y cable from yz to yx it should balance them out [19:28:36] woosters: are you around ? [19:28:39] just an approximation from the results we saw from you moving the one y cable from xz to xy [19:29:53] RECOVERY - Apache HTTP on srv261 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.039 second response time [19:30:22] robh: srv286 and 287 [19:31:17] !log shutting down srv286 and srv286 for power rebalancing [19:31:20] Logged the message, RobH [19:31:25] cmjohnson1: when they shutdown, they can move [19:31:28] and power back up [19:31:35] ok [19:32:58] binasher: seems that the 720 has the 24 disks and all, but the dell raid controller can only assign 16 to a single raid10 [19:33:20] PROBLEM - Host srv286 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:26] so two raid10 arrays, will have to lvm across to the second post install [19:33:56] PROBLEM - Host srv287 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:58] also strip 256k, adaptive read adhead, and write back for options [19:35:17] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.37:11000 (Connection timed out) 10.0.8.36:11000 (Connection timed out) [19:35:44] RECOVERY - Host srv287 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [19:36:11] RECOVERY - Host srv286 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [19:36:47] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [19:37:41] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:38:05] cmjohnson1: so they are moved? [19:38:18] the power didnt seem to change. [19:38:21] yes...they are moved [19:38:30] ahh, mines timed out [19:38:44] PROBLEM - Apache HTTP on srv287 is CRITICAL: Connection refused [19:38:45] uhhh [19:38:52] it looks like you moved it from yz to xz [19:38:59] not from yz to yx [19:39:13] please confirm? [19:39:36] y has not changed much, z has gone much higher, and x is higher [19:39:37] i moved to xy.... [19:39:41] hrmm [19:39:42] ok [19:39:43] 3 outlet down [19:40:05] PROBLEM - Apache HTTP on srv286 is CRITICAL: Connection refused [19:40:05] i dont recall the entire layout of the power strip in my head, just looking at the readouts [19:41:05] srv267 268 and srv269 have heavy utilization [19:41:09] at the moment [19:41:28] ok, well, if its moved then we should be ok [19:41:37] Z is just at its peak now and its under the ceiling [19:41:39] so we are ok. [19:42:01] both x and y are higher than they were when we started, so we are good [19:42:23] !log apache restarted by puppet run on srv286 [19:42:25] Logged the message, RobH [19:42:56] RECOVERY - Apache HTTP on srv286 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [19:43:21] New patchset: RobH; "wrong adapter info inputted for db61/62, corrected" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7307 [19:43:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7307 [19:44:00] New review: RobH; "normal change, wrong mac info was there, simple" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7307 [19:44:02] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7307 [19:47:08] RECOVERY - Apache HTTP on srv287 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.611 second response time [19:47:53] srv295: 11 May 19:46:54 ntpdate[2668]: no server suitable for synchronization found [19:47:54] srv295: Error: unable to contact NTP server [19:47:54] mw38: 11 May 19:46:54 ntpdate[4958]: no server suitable for synchronization found [19:47:54] mw38: Error: unable to contact NTP server [19:47:54] srv289: 11 May 19:46:54 ntpdate[5377]: no server suitable for synchronization found [19:47:54] srv289: Error: unable to contact NTP server [19:47:57] srv273: 11 May 19:46:54 ntpdate[7802]: no server suitable for synchronization found [19:47:59] srv273: Error: unable to contact NTP server [19:48:01] srv284: 11 May 19:46:54 ntpdate[5898]: no server suitable for synchronization found [19:48:03] srv284: Error: unable to contact NTP server [19:54:44] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:03:17] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:11:50] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:22:43] New patchset: RobH; "db62 to be precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7309 [20:23:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7309 [20:23:17] New review: RobH; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7309 [20:23:19] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7309 [20:54:42] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [20:56:02] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:10:11] New patchset: Pyoungmeister; "making nrpe check for udp2log procs retry 10 times" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7313 [21:10:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7313 [21:11:21] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7313 [21:11:24] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7313 [21:12:23] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [21:29:04] New patchset: RobH; "precise for db61" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7316 [21:29:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7316 [21:29:39] New review: RobH; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7316 [21:29:41] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7316 [21:35:21] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:40:45] PROBLEM - Lucene on search21 is CRITICAL: Connection refused [21:42:24] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:04:02] RobH: hey so I remember you saying the c switches are hooked up - which serial console are they on ? [22:04:46] scs-c1-eqiad [22:04:56] not all the power strips are connected to mgmt [22:05:08] and those switches are stacked and fibered, but not connected to mgmt [22:05:11] though they are all on serial [22:05:29] actually, i may have lied... I may have plugged them in to the mgmt switches [22:05:43] i ran out of the proper length cables, not sure if it was before or after those got connected [22:05:51] im sure you can tell if there is a link so its all good [22:06:29] oh [22:06:36] hehe, well serial is the important bit :) [22:07:29] if they arent already, i expect to connect the mgmt port on each one to the mgmt switch in the rack [22:07:41] as each mgmt switch in the rack is run back to msw1 [22:08:19] when chris gets here we will be racking all the new row c servers, and retrofitting all the racks with cable mgmt 2u stuff [22:08:21] PROBLEM - Puppet freshness on dobson is CRITICAL: Puppet has not run in the last 10 hours [22:08:26] eqiad is going to be purdy. [22:10:21] :) [22:24:06] RobH are you at the DC today ? [22:24:09] physically ? [22:24:24] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:24:35] LeslieCarr: nope, was yesterday [22:24:39] trying to figure out which switch is which (why can't they learn what rack they are on by themselves?) [22:24:48] the scs is labeled with them [22:25:05] yeah, however when you login to the console, they all route you to the master RE [22:25:06] ports 1-8 are ps1c1-ps1c8 [22:25:15] oh. [22:25:23] which is what they are supposed to do, however doesn't help me figure out which is which :) [22:25:40] no way to tell whose console port is active huh? [22:26:11] yeah, but i can set on the led's a message on each of them [22:26:20] and have you say "oh this one thinks it's number 5 [22:26:22] ahh, i see what yer getting at [22:26:23] and then figure it out [22:26:26] yeah [22:26:42] well next time :) [22:27:17] but config is all there :) as soon as we straighten that out, row c network will be happy! [22:31:02] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:46:38] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:57:44] PROBLEM - Puppet freshness on pdf2 is CRITICAL: Puppet has not run in the last 10 hours [23:06:33] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7070 [23:06:35] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7070 [23:06:44] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:11:50] PROBLEM - Puppet freshness on pdf1 is CRITICAL: Puppet has not run in the last 10 hours [23:15:53] PROBLEM - Puppet freshness on pdf3 is CRITICAL: Puppet has not run in the last 10 hours [23:18:16] LeslieCarr: adding myself to contacts.cfg is enough? [23:29:58] dschoon: do you know about the puppet statistics::mediawiki/git::clone class? [23:30:01] it's failing on stat1. [23:30:08] i do not. [23:30:11] sorry :( [23:30:16] ::sigh:: [23:30:27] i only draw pretty pictures [23:30:29] i'm decorative. [23:30:41] maybe you can decorate the class. [23:49:35] New patchset: Reedy; "Add php5-memcached to apaches for change of memcached client" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7347 [23:49:56] New patchset: Reedy; "Point xinetd at /home/wikipedia/common/wmf-config/extdist/svn-invoker.php rather than /home/wikipedia/common/php/extensions/ExtensionDistributor/svn-invoker.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7127 [23:50:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7347 [23:50:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7127 [23:50:58] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/7347 [23:57:43] Change abandoned: Reedy; "Bah. Screw git branches!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7347 [23:57:48] Change abandoned: Reedy; "Bah. Screw git branches!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7127 [23:59:11] :D [23:59:19] exactly how I was feeling yesterday Reedy :P [23:59:27] New patchset: Reedy; "Point xinetd at /home/wikipedia/common/wmf-config/extdist/svn-invoker.php rather than /home/wikipedia/common/php/extensions/ExtensionDistributor/svn-invoker.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7348 [23:59:35] I committed twice to the same branch it seems [23:59:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7348