[00:01:49] Ryan_Lane: you around ? [00:16:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:23:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.843 seconds [00:23:23] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.24693041322 (gt 8.0) [00:58:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:05:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.035 seconds [01:38:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:45:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.028 seconds [01:47:50] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.8184787931 [02:07:22] RoanKattouw_away: job_queue: frwiktionary (17857) [02:13:06] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 3.31827025 [02:18:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:24:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.518 seconds [02:38:06] Jeff_Green: 20 02:37:40 -!- grunny [~grunny@wikia/vstf/countervandalism.user.Grunny] has joined #mediawiki [02:38:10] ;-) [02:43:25] https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=dbrepllag&sishowalldb= [02:43:40] Any chance db36 can be depooled while the schema change is happening? [02:43:50] The lag is screwing up my bots pretty badly. :-( [02:45:36] mutante: ^ [02:57:46] Joan: hmm not sure i wanna do that without confirming with the people involved in the migration since lag has been announced and been told to ignore , at least for the weekend. I will forward your request though [02:58:12] mutante: I dont' understand how any bots are supposed to be able to edit. [02:58:21] All my scripts are getting hung on the crazy lag. [02:58:25] don't [02:59:35] Joan: happen to be on wikitech-l ? [02:59:40] Yes. [02:59:43] I saw your message. [02:59:58] I'm saying that I'm not sure what the response is supposed to be other than writing scripts to ignore the lag. [03:00:08] s/to ignore/that ignore/ [03:00:29] Joan: that wasnt me, i was about to ask you if you could reply to the "[Wikitech-l] enwiki revision schema migration in progress" [03:00:47] Aren't you Asher? [03:00:49] Which one are you? [03:00:51] no:) [03:00:54] Daniel [03:00:54] Oh. [03:01:00] Oh, you're Daniel Zahn? [03:01:09] and with "the people involved" i meant basically asher:) [03:01:14] yep [03:01:18] Fair enough. [03:01:24] Sorry, you're not the target of my ire, then. [03:01:29] no worries [03:01:31] What obscure nick does Asher use... [03:02:21] I'll reply to the mailing list. [03:02:45] thanks [03:07:42] http://lists.wikimedia.org/pipermail/wikitech-l/2012-March/059053.html [03:44:54] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [03:46:52] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [03:54:48] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [03:54:49] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [05:21:33] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [06:56:17] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 13.9469902676 (gt 8.0) [06:58:23] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.84471624 [07:01:05] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [07:10:41] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [07:22:33] hi apergos [07:22:40] hello [07:22:49] what's up? [07:23:21] this time i got a DNS change [07:23:27] really simple.. but still [07:24:28] which file? [07:24:32] /tmp/dns-education.diff [07:25:33] huh it sure is [07:25:36] looks ok to me [07:25:41] it's just supposed to be there as a redirect, VirtualHost in redirect [07:25:45] thx [07:25:45] ok [07:25:49] sure [07:35:26] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Tue Mar 20 07:35:20 UTC 2012 [07:41:09] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [07:42:09] !log running authdns-update after adding education.wm for redirect [[RT:2634]] [07:42:13] Logged the message, Master [08:57:53] !log apache-graceful-all to deploy changed redirects.conf [08:57:56] Logged the message, Master [08:59:48] !log several srv's said they were unable to contact NTP server [08:59:52] Logged the message, Master [09:12:19] !log new URL pointing to Wikipedia Education Program - http://education.wikimedia.org [09:12:22] Logged the message, Master [09:13:59] New review: Dzahn; "per RT-2512" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/3238 [09:15:41] Change abandoned: Dzahn; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3239 [09:19:29] New patchset: ArielGlenn; "allow ms1001 to rsync from ms1002" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3290 [09:19:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3290 [09:20:24] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3290 [09:20:27] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3290 [09:46:55] New patchset: Dzahn; "add the nagios-nrpe-server init.d file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3291 [09:47:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3291 [09:48:00] New review: Dzahn; "the file itself is already there but i got the indentation wrong in the definition" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/3291 [09:51:41] New review: Dzahn; "alright, gotcha. that was just supposed to be a quick fix to stop flooding of Nagios CRITs as the no..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3260 [09:52:16] New patchset: ArielGlenn; "and change ms1001's hostname in the rsync file to the correct one" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3292 [09:52:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3292 [09:52:57] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3292 [09:52:59] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3292 [12:33:47] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [12:46:32] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:56:44] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 179 MB (2% inode=61%): /var/lib/ureadahead/debugfs 179 MB (2% inode=61%): [13:05:17] RECOVERY - Disk space on srv219 is OK: DISK OK [13:11:17] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [13:46:32] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [13:48:38] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [13:56:26] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [13:56:26] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [14:09:47] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:20:15] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:38:19] mark: have a minute? [14:39:59] ok [14:40:05] i'm in a datacenter myself [14:40:09] can you run a hardware log for me on virt1. I don't have root [14:40:35] dell needs one b4 they'll send me a new hdd [14:41:06] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 18.9332980342 (gt 8.0) [14:41:12] !rt 2649 [14:41:12] https://rt.wikimedia.org/Ticket/Display.html?id=2649 [14:55:39] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.10398981982 (gt 8.0) [14:59:06] sorry chris, can't do that now [14:59:09] perhaps apergos can help you? [14:59:28] what do you need? [14:59:32] chrismcmahon: [14:59:34] rats [14:59:40] sorry for the gratuitous ping [14:59:45] cmjohnson1: [15:00:21] apergos: can you run a hardware log for virt1 [15:00:30] see Dell's comment [15:00:35] !rt 2649 [15:00:35] https://rt.wikimedia.org/Ticket/Display.html?id=2649 [15:08:48] that should have gone [15:11:58] thx apergos [15:12:05] sure [15:12:18] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.44617946429 [15:13:46] woosters: shell access to bz? --- robla has approved, who do I need to talk to? [15:13:55] shell rt bug https://rt.wikimedia.org/Ticket/Display.html?id=2584 [15:17:03] fuck those EX4200s are heavy [15:17:07] hex mode .. looking into it [15:17:15] mark - new toys arrived! [15:17:56] are u going to put them into production today? [15:22:57] RECOVERY - Disk space on search1015 is OK: DISK OK [15:23:06] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [15:26:54] woosters: no [15:26:56] no time for that [15:27:14] i may not even get them online today [15:27:23] i'm happy if I get everything in the racks and out of the way [15:27:28] i'll have to come back anyway [15:31:36] apergos: that log was from virt1 right? I don't see any hdd errors but yet disk5 is amber and Ryan's last log show a bad HDD. [15:31:43] virt1 [15:32:03] yes...the dell guy didn't see an error either [15:32:06] I ran the command they gave me (well using "megacli" cause that's what we have installed) [15:32:26] we have about 30 servers that need to go away first [15:32:27] working on that [15:32:36] k. do you think it could be RAID related? [15:32:55] i can wait for Ryan to get on if you are busy [15:34:34] I have no idea about the error, this is the first I'm involved with it [15:41:40] ok I see a bad one, it lists the physical disk as state: failed [15:41:48] Firmware state: Failed [15:41:56] Enclosure Device ID: 32 [15:41:57] Slot Number: 4 [15:41:57] Device Id: 4 [15:41:57] Sequence Number: 2 [15:48:40] apergos: i am not sure why that is not showing on the log you sent me. [15:48:53] different log [15:49:18] I looked at the event log, don't see any errors there either besides the usual pd sense stuff [15:49:27] sorry unexpected sense stuff [15:50:05] Event Description: Enclosure PD 20(c None/p0) element (SES code 0x17) status changed [15:50:08] hmm there are a few of these [15:50:28] Device ID: 32 [15:52:04] not sure those are worth anything [15:56:00] there [15:56:09] rack OE11 is now one big pile of old iron :P [15:56:12] I've just looked at all the stuff in the event log for that device [15:56:13] i think that is raid related [15:56:14] nothing [15:56:24] nothing to indicate an error or a problem [15:57:24] odd...can you check RAID and see if slot 4 is in a foreign state? [15:57:49] i can't do it w/out taking the server down [15:58:41] Foreign State: None [15:58:55] well just for that decide right [15:58:57] device [15:59:34] anyways I did a global check, it shows that for all of them [15:59:44] right...but ok [15:59:45] firmware failed is not going to be a config issue [16:00:40] no...i wouldn't think so but I don't what else to check [16:02:42] apergos: the problem is the error is a logical driver error which has not yet resulted in a hard drive error. [16:03:09] but will eventually so I need to get on Dell and send me a new hdd. [16:03:31] um no [16:03:39] I get my error output looking at [16:03:41] MegaCli -PDList -aALL | more [16:04:19] I'll send them this output [16:05:00] ok [16:05:36] sent [16:06:19] you're talking with them right? [16:06:51] yes [16:06:56] ok [16:07:07] let me know if this output is sufficient for them [16:07:16] k...i will [16:21:16] * apergos gets twitchy [16:22:23] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.0887963964 (gt 8.0) [16:24:32] cmjohnson1 .. did those swift servers arrive? [16:26:35] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.0953002702703 [16:47:35] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.32344088496 (gt 8.0) [16:52:56] anything from dell? [16:57:52] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.718140810811 [17:02:40] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [17:11:11] so, um. [17:11:22] out of curiosity. what host does graphite run on? [17:11:51] fenari. of course. [17:12:00] the high-king bastard of servers. [17:12:33] welp. i hope nothing important is going on there! (i am sorry i requested everything ever.) [17:13:00] That's weird, I thought it was a seperate server [17:13:14] perhaps dig lies to me. [17:17:16] graphite doesn't run on fenari [17:18:43] i'm kinda tired, but [17:18:44] http://pastebin.com/fFnrxhkd [17:19:03] am i reading that wrong, somehow? sure looks like graphite -> noc -> fenari [17:19:18] but maybe that's due to some proxy/LB shenanigans? [17:20:15] i am led to understand our LB are made of unicorn magic and fresh sunshine [17:24:10] fenari... it's only one of our two bastion hosts where everyone logs in to do any work on the cluser :-P [17:25:05] it's true that graphite.wikimedia.org = fenari [17:25:28] :( [17:25:32] i am penitent. [17:25:43] i may or may not have accidently sent it several rather rude queries in a row. [17:25:47] next question is whether something is "running" over there [17:29:26] wifi is down in san francisco [17:31:14] looks like it is back on [17:42:43] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [17:49:15] !log restarting lsearchd on search10 [17:49:19] Logged the message, and now dispaching a T1000 to your position to terminate you. [17:54:51] New patchset: Bhartshorne; "dropping the number of backend workers maintaining the swift cluster to improve overall performance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3294 [17:55:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3294 [17:55:16] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3294 [17:55:18] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3294 [18:02:31] PROBLEM - swift-account-server on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:04:19] PROBLEM - swift-container-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:05:22] PROBLEM - swift-account-server on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:07:10] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 10.4438136036 (gt 8.0) [18:12:03] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.917548660714 [18:15:48] RECOVERY - swift-account-server on ms-be4 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:16:15] RECOVERY - swift-account-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:17:09] RECOVERY - swift-container-server on ms-be3 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:20:36] PROBLEM - swift-container-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:20:54] PROBLEM - swift-account-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:26:54] RECOVERY - swift-container-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:27:12] RECOVERY - swift-account-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:33:46] New patchset: Bhartshorne; "bumping full scan speed again. Current rate would take about 1mo to do a full scan. This should drop that to ~3 weeks." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3295 [18:33:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3295 [18:56:00] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 11.0074254955 (gt 8.0) [18:57:57] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.967665929204 [19:00:57] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 221 MB (3% inode=61%): /var/lib/ureadahead/debugfs 221 MB (3% inode=61%): [19:07:15] RECOVERY - Disk space on srv223 is OK: DISK OK [19:16:08] !log pulling disk 5 on virt1 for reseating [19:16:12] Logged the message, Master [19:25:02] notpeter: has the paging problem been fixed? [19:25:21] cmjohnson1: I don't believe so [19:25:36] want to try the ol' umplug and plug? [19:26:09] sure..i need to get the manual out first for that tricky maneuver ;] [19:26:20] unplugging now [19:26:30] heh [19:26:30] cool [19:27:05] ok [19:27:58] lemme know if that works [19:28:03] it did [19:28:27] cool! [19:28:33] yes [19:28:37] I know it worked [19:28:43] becuase it started spamming ;) [19:34:11] hm. I got paged on db13 [19:34:20] Ryan_Lane: false alarm. [19:34:24] ah ok [19:34:28] it was from notpeter fixing the SMS sender thingy. [19:34:43] cool [20:01:00] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 112 MB (1% inode=61%): /var/lib/ureadahead/debugfs 112 MB (1% inode=61%): [20:15:42] RECOVERY - Disk space on srv220 is OK: DISK OK [20:17:50] !log killed enwiki.revision sha1 migrator (upgrade-1.19wmf1-2.php). after db36 completes, will run the rest by hand [20:17:54] Logged the message, Master [20:23:32] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3295 [20:23:34] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3295 [20:33:57] maplebed: any chance copper will get nuked and reconfigured again? [20:34:08] AaronSchulz: sure, if you want it so. [20:35:33] GET/HEAD seems OK in terms of speed, but anything that involves writes is often so slow it's not even funny [20:35:38] maplebed: if it's not a pain [20:35:58] 80.14% 10.174888 1 - FileBackendStore::deleteInternal [20:36:28] why would wiping and rebuilding the cluster change its performance? [20:37:02] I wasn't sure if there was some wonk configuration? Is the hardware just really weak? [20:37:57] I think I'm the only one using it, one guy doing a delete shouldn't be that painful, and it doesn't require much network traffic between my box and copper [20:38:30] If the issue was my connection I'd expect that GET/HEAD would be hella slow too, but they are not [20:40:22] sure, why not. [20:40:39] anything you want saved before I nuke it? [20:40:55] meh, I'll just add more data in, don't bother saving [20:41:12] btw, http://wikitech.wikimedia.org/view/Swift/How_To#Nuke_a_swift_cluster [20:48:17] AaronSchulz: magnesium is flipping out for some reason. I'd bet that's why everything's slow. [20:49:43] RobH: "Mar 20 20:49:15 magnesium kernel: [5158942.661841] sd 1:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed" <--- sound like a dead disk? [20:53:31] AaronSchulz: I'm going to nuke the cluster anyways, but we may need robh to do a disk replacement before its happy. [20:53:39] yup [20:55:47] * RobH doesnt care today [20:55:49] in class ;] [20:55:56] i will be onsite tomorrow [20:56:39] what class? [20:58:45] tomorrow's good enough. I'll drop you a ticket. Thanks! [20:59:37] juniper routing essentials [21:00:28] RobH: RT [21:00:32] RT-2669 [21:00:48] if its in eqiad i review the entire queue daily (not today or yesterday, but usually) [21:01:04] so will snag it tomorrow and if the disk is gone, get a replacement in to dell, so it would show up on thursday [21:01:18] cool. [21:07:41] PROBLEM - Host magnesium is DOWN: PING CRITICAL - Packet loss = 100% [21:11:44] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.0 [21:16:53] New patchset: Bhartshorne; "increasing the multicast range to allow both pmtpa and eqiad configs to work. using memcached on all three eqiad swift hosts." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3301 [21:17:06] New patchset: Bhartshorne; "increasing the multicast range to allow both pmtpa and eqiad configs to work. using memcached on all three eqiad swift hosts." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3302 [21:17:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3301 [21:17:19] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3301 [21:17:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3302 [21:17:21] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3301 [21:17:34] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3302 [21:17:37] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3302 [21:24:58] udplogging on emery is losing data like mad [21:33:57] New patchset: Bhartshorne; "updated AUTH token for rebuilt eqiad-test cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3303 [21:34:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3303 [21:34:15] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3303 [21:34:17] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3303 [21:39:38] RECOVERY - RAID on virt1 is OK: OK: State is Optimal, checked 2 logical device(s) [21:41:17] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 9.52236754386 (gt 8.0) [21:46:06] !log stopped eqiad bits servers from udplogging to emery, packet loss is back to zero [21:46:10] Logged the message, Master [21:46:12] RoanKattouw: ^^^ [21:46:47] lol [21:46:48] i think mark enabled that by accident and then commented it out in puppet, but the processes were still running [21:46:56] Was eqiad bits logging every req to emery? [21:46:59] yep [21:47:06] Yeah that's a problem [21:47:20] i think that == about all of our other traffic combined [21:47:27] I know how many req/s bits did 9 months ago because I gave a talk with those numbers [21:47:34] Pretty much yeah [21:47:45] In July, torrus was telling me that bits=40k/s and total=90k/s [21:47:56] (at peak) [21:48:21] and probably a nice bit more now [21:53:53] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:55:59] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 1.02894858407 [22:10:41] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:11:53] maplebed: https://graphite.wikimedia.org/dashboard/temporary-7 <- I wonder what was going on over the weekend? [22:12:55] that thing always nukes firefox. [22:13:00] ::sigh:: [22:13:29] does it correlate with http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=load_one&s=by+name&c=Image+scalers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4? [22:14:33] quite well actually [22:14:51] that was the image scalers getting overloaded or something. [22:15:08] what does the graph represent? [22:15:42] (i.e. what does FileBackendStore.storeInternal do?) [22:15:44] the temp-7 graph is wall-time for NFS store operations via FSFileBackend [22:16:00] storeInternal() copies one file from tmp/ to NFS [22:17:13] when the image scalers flip out it's almost always because NFS is a bottleneck. The image hosts graphs show the bumps more clearly, but you can see them on ms5 and ms7 as well. [23:19:35] !log fixing the zero redirect [23:19:40] Logged the message, Master [23:40:08] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 199 seconds [23:40:08] PROBLEM - MySQL Slave Delay on db1033 is CRITICAL: CRIT replication delay 199 seconds [23:48:50] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [23:50:38] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [23:57:50] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [23:57:50] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours