[00:00:19] !log pulling cp1044 from lvs for testing [00:00:22] Logged the message, Master [00:05:02] preilly: the varnish config you deployed to prod is invalid [00:05:14] Message from VCC-compiler: [00:05:14] Symbol not found: 'reg.http.X-Carrier' (expected type BOOL): [00:09:34] PROBLEM - MySQL Idle Transactions on db38 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:09:57] enwiki is in trouble [00:10:18] double trouble? [00:10:45] (Cannot contact the database server: Unknown error (10.0.6.48)) [00:10:51] I guess so [00:11:08] there's a whole bunch of [00:11:09] UPDATE /* User::invalidateCache G fetcher */ `user` SET user_to [00:11:09] uched = '20120511000900' WHERE user_id = '9630782 [00:11:12] all for the same user [00:11:58] PROBLEM - LVS HTTP on m.wikimedia.org is CRITICAL: HTTP CRITICAL - pattern not found [00:12:03] spamming derrors.log [00:13:19] AaronSchulz: why would there be thousands of invalidateCache updates for the same user? [00:13:28] what invokes that function? [00:13:39] lots of things [00:13:45] why doesn't it have a ts condition [00:14:11] can we block this user? [00:14:13] PROBLEM - MySQL Slave Running on db38 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:14:49] RECOVERY - LVS HTTP on m.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 0.109 second response time [00:15:13] We can.. [00:15:34] RECOVERY - MySQL Slave Running on db38 is OK: OK replication [00:15:34] TimStarling: ^^ [00:15:52] Not a new account, no contribs, no log entries [00:16:44] look in xff.log [00:17:21] I hacked User [00:17:29] * AaronSchulz can view RC again [00:18:07] RECOVERY - MySQL Idle Transactions on db38 is OK: OK longest blocking idle transaction sleeps for 0 seconds [00:18:09] that was enwiki? [00:18:58] New patchset: preilly; "Partner IP Live testing - Thursday, May 10th, 10am - 12pm PDT" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7252 [00:19:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7252 [00:19:16] AaronSchulz: your hack did the trick [00:19:18] http://en.wikipedia.org/wiki/Special:Contributions/G_fetcher [00:19:21] TimStarling: yup, enwiki [00:19:22] not much there, hmm [00:19:33] yeah, i just tried checkuser [00:19:40] so, in case this was not noticed or reported yet, a few minutes ago I got "(Can't contact the database server: Unknown error (10.0.6.48))" on en.wikipedia.org [00:19:52] a refresh a few seconds later worked. [00:19:55] why did nobody flood #wikimedia-tech? heh [00:20:43] this would be easier if we had backend request logs [00:21:11] AaronSchulz: i was killing enough 'G fetcher' queries to generally keep some connection slots open [00:22:11] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7252 [00:22:13] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7252 [00:26:56] AaronSchulz: can you add the IP to that error message? [00:27:03] then we can correlate it with other logs [00:28:01] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:28:39] sure [00:28:45] who "rotated" the fatal.log on april 28? [00:28:51] socat is still writing to the old file [00:29:20] Leslie did it for me [00:30:13] TimStarling: I will have to add a flag to wfBacktrace to force the cmd-mode output [00:30:22] * AaronSchulz puts that on his todo list [00:30:22] !log restarted socat on fenari so that fatal.log is reopened [00:30:24] And I've got https://gerrit.wikimedia.org/r/#/c/6061/ outstanding to logrotate it [00:30:26] Logged the message, Master [00:30:48] TimStarling: I think that was me [00:31:42] CentralAuthHooks.php line 463 calls User->invalidateCache() [00:31:55] (what kind of bot is "morebots" ?) [00:32:30] this isn't a loop is it? [00:34:11] AaronSchulz: you could have made it a fatal error, then backtraces would have appeared in fatal.log [00:34:32] I'm always afraid to fatal in the middle of execution, lest stuff be half done [00:34:36] binasher: https://gerrit.wikimedia.org/r/7254 [00:34:50] ideally everything would be one transaction, but that's not always the case [00:34:51] it doesn't look like a loop to me [00:34:51] but meh [00:34:55] yeah [00:35:09] if you select just one apache PID, you don't get a lot of entries [00:35:28] binasher: ^ [00:35:59] TimStarling: but it's also another the other processes, right? [00:36:07] *also on the other processes [00:37:30] AaronSchulz: in the processlist i grabbed while there were 856 of those queries running, they were coming from 127 different apache servers [00:37:44] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7254 [00:37:46] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7254 [00:40:37] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:40:46] I see api.php requests for this IP in xff.log, but I don't see anything in sampled-1000.log [00:41:31] we really need an API log, and a backend log [00:41:56] but either would have a very high rate, probably higher than the current POST log [00:42:58] because of android apps requesting action=parse [00:43:03] we need some dedicated logging server [00:43:35] OrgName: Google Inc. [00:43:54] binasher: wtf? [00:44:25] where did that come from? [00:44:33] binasher: yeah, I filed an RT ticket requesting one [00:44:47] but it was closed after the hardware was purchased, and left without an OS [00:45:46] http://rt.wikimedia.org/Ticket/Display.html?id=2400 [00:46:39] TimStarling: so this user is coming through the API? [00:46:45] don't know [00:47:08] don't your backtraces show? [00:47:39] ah, they are truncated [00:48:07] use wfErrorLog() instead [00:48:47] wfErrorLog( $msg, 'udp://10.0.5.8:8420/g_fetcher' ); [00:48:52] then you get 64KB [00:49:51] ok it is a mix of API and page views [00:50:01] maybe it is getting listings and then hitting each item [00:50:18] /w/api.php?action=query&format=xml&revids=491908036&cllimit=max&prop=categories%7Cinfo%7Crevisions&rvprop=comment%7Ccontent%7Cflags%7Cids%7Ctimestamp%7Cuser [00:53:54] binasher: :) [00:54:22] binasher: some of the errors are hacks that were on for a few seconds [00:56:16] I'm pushing out a header log [00:56:46] TimStarling: do we have a generic log group type for temporary things [00:57:27] this works [00:57:36] /home/wikipedia/logs/g_fetcher.log [00:58:00] definitely points to google, right? [00:58:15] ah, we even have an email address in the user agent [00:58:28] see, we tell people to set their user agent but we have no way to actually see it [01:03:22] TimStarling: I just emailed them [01:03:36] saying what? [01:03:49] TimStarling: why are you crawling as a logged in user [01:04:30] it would probably be better for us if they were logged out, but they weren't to know that [01:06:19] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:06:48] maybe the session is corrupted or something [01:06:54] TimStarling: yeah [01:08:02] they always send the same token and the same session ID, so invalidateCache() shouldn't be hit [01:08:57] hmm [01:11:11] ah, but they don't send a *local* session ID [01:11:25] globalloggedin is in the local session [01:12:29] they will be getting a Set-Cookie header in every response, but they are ignoring it [01:13:24] one of the many ways to DoS wikipedia :) [01:13:30] * AaronSchulz should compile a list [01:13:45] there's an app for that [01:13:56] TimStarling: In the meantime, any tips on debugging a segfault issue? [01:14:09] There are certain pages that when you edit them make PHP segfault, but the edit still goes through [01:14:11] gdb [01:14:15] Yeah I figured [01:14:23] But I can't get my hands on a process that segfaults [01:14:27] or a core dump FTM [01:15:01] I've been sending requests to srv200 [01:15:18] Attached gdb to a single Apache worker, but I'd have to send lots of requests before it'd hit "my" worker [01:15:34] I tried to enable core dumps but that didn't work either [01:16:14] well, you can work out why core dumps are not working, there are a number of conditions that need to be satisfied [01:16:21] or you can use apache2 -X [01:16:23] RoanKattouw: limit the number of workers? [01:16:39] running apache with -X runs it as a single worker, without forking at all [01:16:43] ah [01:16:51] Aha, nice [01:17:06] !log Stopping Apache on srv200 so I can use it as my guinea pig for segfault debugging [01:17:07] usually it will hang if you point a browser at such an instance [01:17:12] Logged the message, Mr. Obvious [01:17:27] because the browser will try to open concurrent connections for load.php [01:18:01] it will probably work if load.php goes to bits and that isn't routed to your test server, I guses [01:18:11] alternatively you can use curl [01:18:46] maybe it won't hang on wikimedia, who knows, it's been a while since I tried it there [01:19:07] I often use that trick on localhost, but more often on production servers I use core dumps [01:19:39] I'm netcatting to srv200:80 [01:19:43] So that's not an issue [01:20:16] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:20:43] PROBLEM - Apache HTTP on srv200 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:21:00] wtf now I just get a SIGPIPE when PHP tries to write output? This doesn't seem right [01:21:17] handle SIGPIPE noprint nostop [01:21:26] put that in gdb, it's essential [01:21:34] then cont [01:22:07] probably you just had an unrelated pybal request come in, SIGPIPE is completely normal [01:22:43] it occurs when the remote side closes its connection [01:23:29] OK trying that now, thanks [01:23:53] Man running stuff in gdb is slow :D [01:24:06] Or, well, maybe it's not gdb [01:24:19] An untraced apache process also took like 30s to segfault [01:24:40] to get core dumps working, the three things you most often have to do are: [01:24:48] 1. set CoreDumpDirectory [01:24:55] Tried that yeah [01:25:06] There's one commented out in an Apache config file somewhere [01:25:20] 2. add a "ulimit -c" to a startup script [01:25:20] I also created the directory, chown apache, chmod 777 [01:25:26] Urgh right [01:25:33] 3. core dump directory permissions [01:27:39] "man core" has 5 different reasons why a core dump might not be created, if that doesn't work [01:36:05] TimStarling: google emailed me back [01:36:15] TimStarling: Hey Patrick, [01:36:16] We were observing some strange caching behavior on certain pages on Wikipedia and using a logged in user seemed to resolve some of these issues, so we reverted back to our old behavior of using a logged in user on our side. [01:36:16] It looks like that wasn't the best idea on our side though, so we'll stop it ASAP. [01:36:17] Sorry about that, [01:36:18] Shen [01:36:34] seeing a lot of these in the logs now [01:36:36] Fri May 11 1:36:10 UTC 2012 mw13 enwiki UserDailyContribsHooks::articleSaveComplete 10.0.6.48 1205 Lock wait timeout exceeded; try restarting transaction (10.0.6.48) UPDATE `user_daily_contribs` SET contribs=contribs+1 WHERE day = '20120511' AND user_id = '0' [01:36:40] what's up with user_id = 0? [01:37:04] bug [01:37:58] PROBLEM - MySQL Idle Transactions on db38 is CRITICAL: CRIT longest blocking idle transaction sleeps for 686 seconds [01:39:15] that's an AFT thing I think [01:39:19] RECOVERY - MySQL Idle Transactions on db38 is OK: OK longest blocking idle transaction sleeps for 0 seconds [01:39:31] RoanKattouw: who should we complain to about that? [01:40:09] Eww that's not supposed to happen [01:40:12] Not an AFT thing either [01:40:19] I've been trying to get rid of that ext [01:40:28] I think it was ClickTracking originally, but AFT references it [01:40:43] Oh, right, it's for ClickTracking data I believe [01:40:46] Anyway, I'll fix that user_id=0 bug [01:40:53] it stopped for now, but was coming in fast enough to impact db38 [01:41:43] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 251 seconds [01:42:20] TimStarling, binasher: https://gerrit.wikimedia.org/r/7259 [01:42:25] Approve that and I'll deploy it right away [01:43:04] merged [01:44:25] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:46:40] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 224 seconds [01:48:01] OK, lemme deploy that [01:48:10] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 3 seconds [01:52:29] That's not realted to the segfault though, right? [01:52:33] related [01:55:09] No [01:55:16] But we found that one too [01:57:03] New patchset: Catrope; "Set PCRE recursion limits" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7261 [01:57:13] TimStarling: ---^^ I can't approve or deploy that but I think you can? [01:57:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7261 [01:58:39] oh, no I can't approve that, the backtrack_limit should stay at 1M [01:58:49] Oh, sorry [01:58:53] probably nothing would work at all with it at 1000 [02:00:06] actually it was 100000 before, it can probably stay at 100000 [02:00:11] New patchset: Catrope; "Set PCRE recursion limit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7261 [02:00:29] yep, that's fine [02:00:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7261 [02:00:55] the backtrack limit prevents O(N^2) running time in certain cases [02:00:55] !log Started Apache back up on srv200, done debugging [02:00:58] Logged the message, Mr. Obvious [02:01:31] but I think it's a number of characters, not a number of levels [02:01:50] Ah ye [02:01:53] Then 1k would be bad [02:02:16] RECOVERY - Apache HTTP on srv200 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.023 second response time [02:02:19] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7261 [02:02:23] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7261 [02:02:48] it's in puppet now [02:03:02] OK, good [02:03:18] do you need it forced out right now, or is it ok to wait for puppet to do it? [02:03:50] It's causing segfaults at a rate of ~5 per minute [02:04:16] From a user perspective that means you wait for a long time then get a Squid error page, but your edit goes through [02:06:55] We should be happy these segfaults are only on POST, not on GET, otherwise Squid would amplify them [02:08:34] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [02:12:46] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [02:22:56] Joan: If you're interested, the segfault was due to excessive recursion in PCRE, triggered by a regex in PageTriage. Offhand it looks like it would be triggered on pages where there are more than ~18k characters between a '{{' and its matching '}}' [02:23:29] We're fixing it by setting the PCRE recursion limit to 1k, apparently the default value (which it was set to previously) is 100k which is way too high [02:48:56] RoanKattouw: Cool. Thanks for fixing that. :-) [02:49:07] RoanKattouw: I'm gonna quote you to annotate that talk page's history. [02:49:13] heh thanks [02:49:17] Sorry for all the confusion there [02:49:27] Why is the regex recursing so much? [02:49:28] The IPs that made those edits are all remarkable [02:50:29] 0:0:0:0:0:0:0:1 is IPv6 for localhost [02:50:41] The 216. IP is the office [02:50:54] And the 208. IP is fenari (our bastion host) [02:50:59] As for why it was recursing so much [02:51:08] $text = preg_replace( '/\{\{[^\{]((?!\{\{).)*?\}\}/is', '', $text ); [02:51:14] They run that repeatedly to strip templates [02:51:17] It's kind of evil really [02:51:25] Why are they stripping templates like that? [02:51:44] What that regex does is find a template call that is not nested (i.e. contains no other template calls) and remove it, then they repeatedly run it [02:51:52] Right. [02:51:52] It's for the snippet they display in PageTriage [02:52:21] The code that generates the snippet is evil, it strips templates by executing that regex up to 5 times, then parses the remaining text, then truncates the result to 150 chars [02:52:35] To try to get a usable snippet. [02:52:41] Yeah [02:52:44] Sans infoboxes etc [02:52:44] Is the snippet generally usable? [02:52:52] I... don't know, haven't looked [02:52:58] Looked at Special:NewPagesFeed or whatever it's called to see [02:53:06] Anyway, the evil part of that regex is ((?!\{\{).)*? [02:53:50] Its job is to assert that there are no occurrences of '{{' between the opening '{{' and the closing '}}' , i.e. to find only template calls that don't contain any nesting [02:54:19] But it looks like that would probably also cause PCRE to enter a new recursion level (i.e. create a new stack frame) for every character between '{{' and the matching '}}' [02:54:51] So if you have a lot of stuff wrapped in {{ and }} and that lot of stuff doesn't contain '{{' , then you'd trip this bug [02:55:02] For values of 'a lot' of, say, 10-20k [02:55:21] I have no problem believing the database reports page matches that description :) [02:55:22] Heh. [02:55:28] I just got the error again. [02:55:32] So I guess the fix isn't live? [02:55:37] lol. [02:55:48] It's in puppet, so it can take up to 4(?) hours to be live on all machines [02:56:22] Unless Tim forces a manual run [02:57:24] RoanKattouw: what about getting some food while puppet works for ya ? :-D [02:58:11] heh [02:58:12] Yeah good idea [02:58:18] I'll wrap up, gimme ~5 mins [02:58:32] we need to choose a place too [03:02:03] Joan: Just added some more info to that section, thanks [03:02:10] Although ironically I just hit the bug too :) [03:05:54] I've stripped templates before. [03:06:03] But I just counted { and } until the number was even. [03:06:09] Not sure if that'd be more or less horrible. [03:06:16] I'd assume less. [03:06:39] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [03:08:00] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [03:17:45] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [03:30:12] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2600* [03:48:48] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2363 [04:24:53] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:39:17] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:59:23] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:10:38] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:11:59] PROBLEM - Puppet freshness on search20 is CRITICAL: Puppet has not run in the last 10 hours [05:11:59] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [05:39:39] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [06:46:35] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:52:45] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:07:36] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:07:54] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2563* [07:18:06] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:19:18] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2363 [07:23:10] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3518 [07:23:13] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/3518 [07:29:58] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:30:34] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:37:38] apergos: which system is pushing the datasets to the gluster storage? [07:37:53] dataset2 I think, lemme look [07:38:04] maybe it's 1001 [07:38:04] oh. I don't see gluster installed there [07:38:21] ah [07:38:22] it is [07:38:22] oh yeah it's 1001, I remember we were sad that there would be latency [07:38:23] I broke it [07:38:26] :-( [07:38:31] um [07:38:48] two questions: "how?" and "now what do we do?" [07:38:53] I'm fixing it :) [07:38:59] ok cool [07:39:03] fixed [07:39:08] I upgraded gluster recently [07:39:10] and it went poorly [07:39:20] so I had to rollback [07:39:24] I missed this system [07:39:28] oohhh [07:43:48] New patchset: ArielGlenn; "adding odysseus.fi.muni.cz as dumps mirror" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7271 [07:43:55] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:44:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7271 [07:46:39] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7271 [07:46:42] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7271 [07:47:22] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [07:48:38] o_O new mirror? [07:48:52] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:50:59] not yet [07:51:20] after they start pulling from us, yes :-P [07:52:13] nice work :) [07:52:25] though its only the last five dumps, but at least its something [07:52:52] most people won't want anything but the recent dumps [07:53:04] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2338 [07:53:44] and the media dumps, which cause chaos on the list haha [07:53:58] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:55:23] see, when I don't announce, there's a reason for it :-P :-D [07:57:06] * Hydriz apologises to apergos for causing the trouble [07:57:13] hahaha [07:57:21] I thought someone else started that thread [07:57:34] lol no [07:57:39] I told emijrp before that [07:57:42] boooo [07:57:56] we were in the beginning of the first test run [07:58:01] I was browsing through the ftpmirror site to find why rsync wasn't working [07:58:05] and then suddenly we had all this extra traffic from downloaders [07:58:06] then I chanced on this [07:58:10] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:58:30] oh, emijrp asked someone to download them :P [07:58:41] *eyeroll* [07:58:50] they might be full of crap. [07:59:08] for example it's not clear that the files with unicode in the filenames are int he tarballs correctly [07:59:14] heh [07:59:28] at least its something worth saving for [07:59:53] no, it will be something worth throwing away in a week and redownloading :-P [08:05:22] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [08:30:13] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [08:33:04] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [08:41:28] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2363 [08:48:22] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [08:48:58] !log upgrading apache/mysql/kernel on marmontel (blog) [08:49:02] Logged the message, Master [08:49:05] meh [08:49:11] guest? [08:49:29] not anymore:) [08:51:13] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2338 [08:51:40] !log rebooting marmontel (blog) [08:51:43] Logged the message, Master [08:52:52] PROBLEM - Host marmontel is DOWN: CRITICAL - Host Unreachable (208.80.152.150) [08:55:25] RECOVERY - Host marmontel is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [08:58:07] !log package upgrades on ekrem (IRC server, WAP, Apple dict...) [08:58:10] Logged the message, Master [09:05:10] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:21:27] !log ekrem was close running out of disk again. logrotated apache logs, changed config to: size 512M,rotate 3 [09:21:31] Logged the message, Master [09:29:32] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:31:03] New patchset: ArielGlenn; "add other hostnames for muni.cz to dump rsync access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7278 [09:31:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7278 [09:31:58] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7278 [09:32:00] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7278 [09:41:32] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 183 seconds [09:42:44] PROBLEM - MySQL Slave Delay on db1033 is CRITICAL: CRIT replication delay 203 seconds [09:46:56] RECOVERY - MySQL Slave Delay on db1033 is OK: OK replication delay 0 seconds [09:47:05] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [09:52:29] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2613* [10:02:14] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2625* [10:05:50] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:09:19] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2400 [10:09:26] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:13:20] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:13:38] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:19:29] New patchset: Dzahn; "adding DHCP entries / MAC addresses for analytics1001 to 1010" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6919 [10:19:32] !wt is http://wikitech.wikimedia.org/view/$1 [10:19:35] Key was added! [10:19:42] !wt Cisco | mutante [10:19:42] mutante: http://wikitech.wikimedia.org/view/Cisco [10:19:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6919 [10:20:39] o.0 You guys have a cisco server [10:21:40] yeah, had to get into that mgmt shell, but it isnt that bad actually [10:22:21] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2575* [10:23:06] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6919 [10:23:08] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6919 [10:31:48] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:33:09] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:39:18] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [10:45:34] New patchset: Dzahn; "add logrotate config for lighttpd on install-server (RT-2753), rebased" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7167 [10:45:54]