[00:59:01] !log experimentally setting net.ipv4.tcp_tw_recycle=0 on cp1004 [00:59:04] Logged the message, Master [01:00:15] !log reverted after client-side TIME_WAIT connections rose rapidly from 367 to 9000 [01:00:18] Logged the message, Master [01:16:50] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7577 [01:16:52] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7577 [01:24:15] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [01:32:57] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is -0.525353360656 [01:34:56] It's gaining packets? [01:37:18] PROBLEM - Packetloss_Average on oxygen is CRITICAL: XML parse error [01:37:31] !log on cp1004: trying tcp_tw_reuse=1 instead of tcp_tw_recycle [01:37:35] Logged the message, Master [01:38:48] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 211 seconds [01:40:46] !log on cp1004: reverted after TIME_WAIT client connections reached 38k with no sign of a plateau [01:40:50] Logged the message, Master [01:44:30] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:52:18] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [02:15:02] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [02:19:19] * jeremyb wonders if these are retroactive or if there's some way to reapply them retroactively. https://gerrit.wikimedia.org/r/4796 (custom linkifications within commit msgs on gerrit) [02:22:26] Read from DB, parse, update, write to DB [02:30:38] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [02:42:21] RECOVERY - Puppet freshness on cp1004 is OK: puppet ran at Thu May 17 02:42:03 UTC 2012 [02:45:02] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [02:47:53] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [03:09:28] how are dns changes made and logged? i'm wondering how it all fits in the new world of gerrit [03:10:08] if i need to add/change/remove a record i assume i can't just add that to a diff. yet. or maybe it depends on the zone [03:11:26] * jeremyb goes digging some [03:17:24] It's in the/a private svn repo [03:17:40] I'm not sure if that's destined to change any time soon [03:20:53] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/6005 [03:22:23] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [03:22:49] Reedy: ;-( [03:24:46] Reedy: hi :) what was your IRC client? [03:24:56] was? [03:25:01] it still is Quassel ;) [03:25:15] yeah past time sorry [03:25:22] heh [03:25:31] I'll forgive you, it's only 05:25 :p [03:25:36] I thought about it in French and then translated word by word (somehow) [03:25:43] hehe [03:25:45] thanks! [03:25:46] i don't even remember where now but I saw someone mention a jobs.wm.o cert mismatch. the first step to fixing that would be sending it to the right IP [03:25:53] my night schedules are totally screwed up :-( [03:26:01] which i guess i can't change ;( [03:26:25] j.wm.o redirects to foundation [03:26:29] the first step is to reproduce the issue, take traces and open a bug report :-D [03:26:32] it does so [03:26:40] Oh, I see [03:26:44] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [03:26:59] lol, pointless [03:27:08] hashar: well the place where i saw the problem might be someone else's bug report? [03:27:29] report another [03:27:34] then we can close it as a dupe [03:27:56] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is -0.183695245902 [03:28:55] it was bug 36884 comment 1. but that's not the subject of the bug [03:29:57] ARHGH [03:30:07] Quassel uses french by default :-( [03:30:23] that's a bad thing? [03:30:48] hashar: no, it uses System Default by default [03:32:17] PROBLEM - Packetloss_Average on oxygen is CRITICAL: XML parse error [03:34:12] Reedy: I am discarding the client :-D Could not figure out how to setup core to be started automatically nor the user/pass to connect to it :-D [03:34:26] on ubuntu it just starts :p [03:34:34] yeah [03:34:41] I guess it is not Mac friendly yet [03:34:53] I will open bug reports [03:34:57] hashar: you could ask harej about it [03:35:04] honestly? no :-D [03:35:07] i think he might be on mac [03:35:12] I am too lazy to do that kind of stuff nowadays [03:35:32] I just want to click the app, fill my name, click connect then /join my chans ahaha [03:35:42] i assume harej is equally picky [03:36:02] though I can try again later on [03:36:07] aka not at 5:30am [03:36:11] hah [03:36:27] that must has set me in a bad mood [03:36:33] (has/have?) [03:36:45] I exist! [03:36:56] I live an hour away from Ashburn, Virginia! [03:37:04] I was summoned here by a jeremyb summons. [03:37:06] Hello James glad to meet you [03:37:11] is it really that far? [03:37:16] It's probably closer. [03:37:41] the computernets tell me it's more like a half-hour away. [03:38:25] so basically jeremy told me you are using Quassel [03:38:29] the IRC client [03:38:31] on a mac [03:38:38] correct [03:38:56] (sorry 5am, my brain is slow so I can't make long sentences) <-- that one already took me way too long [03:39:14] so I got a QT something client and some core shell script [03:39:25] do you happen to know how to get Core to start automatically? :) [03:39:40] no [03:39:51] so we need to launch core first then the client correct? [03:40:55] I just launch the client. I don't even know what core is. [03:41:33] there's a standalone client [03:41:40] but the core is for running an always on esk proxy [03:42:37] huh, isn't most of the reason to use quassel for the alwaysonness? [03:42:47] I use it because it's the least shitty client. [03:43:34] "Core" looks like a local bouncer [03:43:47] aka you install Core on some server and connect your client to it [03:45:07] that will be for another day [03:45:16] thanks for showing up harej :) [03:49:07] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/7724 [03:50:28] seriously [03:50:37] my daughter is waking up! [03:50:39] at 6am!!!! [03:50:46] she basically prevents me from working :-( [03:50:50] or sleeping [03:50:51] argh [03:51:01] * hashar waits [03:51:21] you could take shifts. you have half the day and you're off half the day. ;-P [03:51:51] the thing that kill me off is that she has woke up at 3am since I am back from SF [03:52:07] and jet lag made me get to bed late in evening [03:52:19] so basically made me 3am to be fully awake :-D [03:52:25] and having to take care of her [03:52:32] just to find out totally screwed for the rest of the day [03:52:34] damn kid [03:52:36] ;D [03:52:39] see you all later [03:52:46] au revoir [04:19:57] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [04:40:02] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [04:50:05] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:11:51] mornin [05:13:42] yo [05:20:35] New patchset: ArielGlenn; "skip verify (instead of whine) if no tarballs for wiki" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/7847 [05:22:10] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7847 [05:22:12] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/7847 [05:29:49] Hm.. there is a wmgPFEnableStringFunctions section in CommonSettings.php [05:29:52] I wonder.. why ? [05:30:32] O_O - there is a wiki that has it enabled [05:30:33] 'donatewiki' => true, [05:30:34] just donatewiki [05:30:35] pfew [05:30:36] TimStarling: ^_^ [05:33:37] Krinkle-away: i don't see where wgPFEnableStringFunctions is used then? (the thing that's set inside the block) [05:33:59] jeremyb: wmgPFEnableStringFunctions is the cluster conditional [05:34:06] set from InitialiseSettings.php [05:34:16] then in CommonSettings.php, if wmgPFEnableStringFunctions -> wgPFEnableStringFunctions [05:34:23] right... [05:34:29] * jeremyb saw all of that [05:34:34] i guess maybe it's just in core... [05:35:03] it's part of Extension:ParserFunctions [05:44:42] Krinkle-away: do you ever sleep ? :-D [05:44:55] Sure, when other people work [05:50:59] :D [06:49:59] !log WMFLabs dieing out, I/O latency raised constantly over the last 2 hours and eventually lead to situation where system (via ssh) is not usable anymore [06:50:04] Logged the message, Master [06:51:04] hashar: elaborate? [06:51:37] prompt takes age to show up? :-D [06:51:45] and I can't edit files remotely using vim hehe [06:51:45] i could log in fine to bastion (not restricted) [06:51:49] but http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=load_one&s=by+name&c=Virtualization+cluster+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 looks bad [06:51:52] jeremyb: load is at 500 or so? [06:51:53] :) [06:52:02] jeremyb@bastion1:~$ uptime 06:50:54 up 16 days, 4:38, 3 users, load average: 0.49, 0.61, 0.46 [06:52:19] ahhh thanks for the ganglia link [06:52:20] why isn't nagios-wm speaking? [06:52:27] for the virt* nodes that is [06:52:30] no idea [06:53:17] hashar: i just typed "ganglia virt" and that was the first hit in my local browser history ;) [06:53:39] virt2 has like 20% time waiting for IO [06:54:44] yeah for bots ;-) [06:54:45] http://ganglia.wmflabs.org/latest/?c=bots&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [06:54:46] PROBLEM - udp2log log age for emery on emery is CRITICAL: CRITICAL: log files /var/log/squid/teahouse.log, have not been written to in 24 hours [06:54:47] it's not just virt2 that's problematic though [06:55:15] hashar: I'm not sure if this is a cause or an effect [06:55:23] most probably an effet [06:55:25] ryan was looking at labs being wonky just last night [06:55:25] effect [06:55:30] lemme check the scrollback [06:55:41] I guess the cause is the NFS / some hard drive array [06:55:48] it shouldn't [06:56:05] one of them had degraded raid recently too. not sure what the resolution of that was [06:56:12] oh and good morning to the Greek ones :-] [06:56:15] nothing yet, I opened a ticket just yesterday [06:57:01] http://ganglia.wmflabs.org/latest/graph_all_periods.php?c=puppet&m=load_one&r=hour&s=by%20name&hc=4&mc=2&st=1337237804&g=cpu_report&z=large&c=puppet [06:57:09] this is me trying to :wq a simple file [06:58:09] maybe it's not simple? [06:58:11] ;P [06:59:58] http://dpaste.org/DGmWM/ [07:00:07] I have time to write stuff before having the prompt to show up :-] [07:00:47] hashar: tried mosh? [07:01:03] what is that? [07:01:11] mosh rocks [07:01:12] iirc it's mosh.mit.edu [07:01:15] yes [07:02:18] wow (I just saw the ganglia graph) that is no good [07:03:19] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:04:18] hmm it was an nfs instance a couple days ago that was the problem (reading the backlogs) [07:04:40] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:05:14] apergos: it rebooted (idk if anyone knows why) and then came back up broken so then it was rebooted again. eventually it started working [07:05:40] ugh. sounds just peachy [07:09:28] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:10:22] breakfast time [07:12:01] I am out of computer [07:13:12] my phone number is in the contact file on fenari [07:13:13] if needed [07:13:23] ++ [07:17:01] hashar: nothing works anyway :-) [07:18:40] 102400 bytes (102 kB) copied, 30.6156 s, 3.3 kB/s [07:19:55] I am wondering if it can be due to a specific instance trying to do a ton of IO [07:20:03] so this means it's not a good day to try to set up my exim test instance? [07:20:05] or just to one of the virt machine going wild [07:20:48] nope, it's not just one [07:23:34] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:24:52] I did ton of mediawiki configuration change this morning on deployment-prep [07:24:58] maybe one of them caused the issue :( [07:26:37] anyway breakfast for real now [07:27:36] hearing only crickets, I will at least try to prep what I would do, and then see if labs is stable enough to do it today or not [07:29:13] apergos: what for if I may ask? [07:29:26] is it for staging the IT changes that we were discussing at some point? [07:29:30] I guess [07:29:44] I got handed a ticket, I think that's what it is [07:29:51] ah [07:30:07] yeah, I volunteered for that but they wanted me to spend time with hashar instead :-) [07:30:13] :-D [07:30:40] I've spent virtually (pun intended) no time inlabs so this owuld at least get me familiar with it [07:30:53] hahaha [07:31:00] or with it's breakage ;) [07:31:10] already been there [07:31:38] settign up my first instance, I had the worst experience ever... besides the session bug, that is [07:32:09] oh the session bug is ooooooold news [07:41:52] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:01:46] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:15:21] !log WMFLabs seems to have recovered now [08:15:25] Logged the message, Master [08:16:32] so, i just tested jobs.wikimedia.org with the IP for wikimedia-lb... (in my hosts file). the redirect works fine. someone want to just switch the DNS? or you want a bug for it? [08:16:50] (to fix the cert mismatch) [08:17:53] jeremyb: ping apergos / paravoid ^^^ [08:18:06] I can't do DNS stuff nor I know what the procedure is to change DNS entry [08:18:10] ? [08:18:24] hello? [08:18:46] I have no idea what this is about [08:19:22] 17 03:28:55 < jeremyb> it was bug 36884 comment 1. but that's not the subject of the bug [08:19:26] see the end of that comment [08:20:03] hashar: here too? [08:20:10] hmm rt I guess [08:20:30] I don't know what the "right" answer is for this [08:20:32] jeremyb: ahh I can't get op on -operations and -dev :-( [08:20:40] hashar: ;) [08:20:56] he's in at least a dozen other channels [08:21:13] yeah [08:21:18] going to contact freenode staff so [08:22:28] looks like he got klined [08:22:30] oh no [08:23:31] heh [08:30:13] oh [08:30:20] both K-Lined [08:31:15] so folks who use labs... if I want to make some new class appear in the list of "Special:NovaPuppetGroup", what's the trick for that? [08:31:30] manage puppet groups [08:31:36] and add it there [08:31:48] the last link in the list in the side bar [08:31:58] em [08:32:07] I'm at that page. I want in the list of available classes, [08:32:16] to have the exim-related classes from mail.pp [08:32:39] is your project listed on that page? [08:32:52] if not, update the filter in the top corner [Show project filter] to include your project [08:33:02] right [08:33:03] yes it's there [08:33:08] great [08:33:17] and then next to the project there should be an "add group" [08:33:20] yes [08:33:27] click!!! [08:33:36] I'm asking a different q [08:33:36] haha [08:34:05] can I just arbitrarily give the full classname of anything that appears in any puppet file in manifests? [08:34:12] or does it need to be in some special list first? [08:34:19] AFAIK you can enter just whatever you know [08:34:23] even a totally wrong class [08:34:38] but you should put a class which exit in 'test' branch [08:35:02] so yes, any arbitrarily class should work [08:35:06] that was themissing piece of info. I thought it needed to be displayed already in the list below "all projects" [08:35:10] thanks [08:39:02] New patchset: Hashar; "convert hardcoded 10.0.5.8:8420 to $wmfUdp2logDest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7702 [08:39:25] New review: Hashar; "Patchset 5 is a rebase / solve conflicts." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7702 [08:39:27] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7702 [08:39:29] going to deploy that [08:39:40] this says "operations" but probably isn't for me to review :-) [08:41:05] !log Deploying https://gerrit.wikimedia.org/r/7702 which abstract out the udp2log destination [08:41:08] Logged the message, Master [08:42:39] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:44:35] !log running scap to apply https://gerrit.wikimedia.org/r/7702 [08:44:38] Logged the message, Master [08:47:07] hashar: did you do it yet? logmsgbot should have said something in #-tech [08:47:19] it is still running [08:47:23] scap is awfully slow nowadays [08:47:31] ahh [08:47:36] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:47:51] it loads all messages from both branches [08:47:58] rebuild all language localization caches [08:48:04] copy them around several time [08:48:05] etc [08:48:14] oh my god [08:48:27] it dumps me a list of the 400 or so server that have synced! [08:48:59] i guess that will teach you to sync just the one file ;-P [08:49:06] next time I will just sync-file the files I need :-] [08:49:17] definitely [08:49:20] well i guess it was 2 [08:49:37] over time `scap` seems to have became a hugeee pile of slow scripts [08:50:12] * hashar watches boxes compiling texvc one at a time [08:50:14] needs a little salt and other flavors [08:50:35] btw, salt's SEO really sucks [08:56:14] what are {news,todo}.dblist? [08:56:42] oh, outage (see #-tech) [09:02:45] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:07:57] New review: Hashar; "That change caused a short outage because $wmfUdp2logDest was not available in wfLogXFF() :-(" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7702 [09:10:24] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:02:18] so given that every lab instance seems tohave exim set up for basic mail sending, I wonder what an exim test instance needs in addition [10:02:23] all the packages should already be there [10:14:47] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:17:03] New patchset: Hashar; "warning message about wmfUdp2logDest format" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7848 [10:17:23] config? [10:17:29] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7848 [10:17:31] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7848 [10:18:46] hashar: whooooops. i changed deployment-prep to use a hostname there. a week or two ago [10:19:27] where ? [10:20:53] jeremyb: ahh I see :) [10:20:57] mutante: is the analytics subnet all working now ? [10:21:05] sorry , i had to run and do server racking [10:21:05] jeremyb: not a big trouble though ;) [10:21:16] hashar: http://wikitech.wikimedia.org/index.php?title=Server_admin_log&action=historysubmit&diff=46576&oldid=46574 [10:21:20] jeremyb: maybe you can indeed use a hostname there afterall [10:21:31] hashar: i never actually tested that it works [10:21:57] i assumed people didn't care if it was just going to prod anyway [10:22:33] (i told people what i did directly and I'd already logged it verbosely) [10:22:39] lets try again [10:24:14] I will remove the warning from 7848 [10:24:57] * jeremyb doesn't know one way or the other... [10:26:17] New patchset: Hashar; "Revert "warning message about wmfUdp2logDest format"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7849 [10:26:44] New review: Hashar; "It can indeed use a hostname. I have reverted that message with https://gerrit.wikimedia.org/r/#/c/7..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7848 [10:26:52] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7849 [10:26:54] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7849 [10:28:08] hmm [10:29:47] tcpdump -A -n -v -s0 udp port 514 | grep PHP <-- got nothing now :) [10:29:53] daughter duty have fun [10:30:29] (that is on labs btw) [10:33:14] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:36:50] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:48:23] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:55:41] LeslieCarr: well, + DCHP works and gets an IP ,but - i do not get an installer yet, screen stays blank with blinking cursor. looking at DHCP config looks to me i should get a lucid installer. option pxelinux.pathprefix "lucid-installer/"; [11:00:02] ok, i can switch to Ubuntu BusyBox shell from there, so it started somehow, gotta try debug from therehelp [11:12:56] weird [11:15:03] i can see the partman process running, i currently just dont see any installer output.. trying again with the "Legacy OS redirection" option for console [11:19:05] oh these are ciscos right ? [11:19:13] yes [11:19:15] i seem to remember some weird stuff with them and installing - robh would know [11:19:27] i think he was the one dealing with them [11:19:31] anything in wikitech ? [11:19:45] well, he already told me about that legacy option a while ago ... [11:20:09] wikitech, ehm, yea, i am editing on wikitech:) [11:23:27] ah [11:23:52] debconf is running and partman-auto/init_automatically_partition [11:23:56] i should just wait longer first [11:24:40] "0 questions will be asked" "GO" [11:25:35] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [11:53:09] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [12:22:51] New review: Lcarr; "fyi this broke searchidx2 (which should be decommissioned soon i believe bu t is not yet)." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7126 [12:23:48] New patchset: Lcarr; "removing old classes from searchindexer" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7852 [12:24:07] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/7852 [12:45:27] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:51:09] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:54:28] wow, our puppet is pain to work with :/ [12:57:10] hahaha [12:57:17] it's been worse [13:09:44] * paravoid cries [13:14:52] paravoid: you mean like there are no modules, there will be no modules? [13:16:01] it's not just modules [13:16:07] everything's entangled with each other [13:16:13] inheritance and overrides are basically absent [13:16:34] if/then/else and globally scoped variables is the standard way of overriding things [13:16:49] I'm working on something that is especially hurt by this [13:17:01] so it's not the current status quo that makes me cry [13:17:15] it's the hacks *I* am doing to work around things :-) [13:17:51] yeah... [13:21:02] ugh [13:38:23] RECOVERY - udp2log log age for emery on emery is OK: OK: all log files active [13:44:01] !log shutting down bellin for troubleshooting [13:44:06] Logged the message, Master [14:03:12] New patchset: Pyoungmeister; "removing old search classes from searchidx2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7855 [14:03:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7855 [14:04:31] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7855 [14:04:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7855 [14:05:37] New patchset: Dzahn; "Ciscos uses com1, Dells use com2 for console, wrong DHCP config file, thanks RobH" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7856 [14:05:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7856 [14:06:29] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7856 [14:06:31] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7856 [14:09:08] RECOVERY - Puppet freshness on searchidx2 is OK: puppet ran at Thu May 17 14:08:55 UTC 2012 [14:20:32] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [14:22:09] !log adding gerrit project analytics/udplog parent analytics [14:22:12] Logged the message, Master [14:23:23] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:37:56] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:51:56] New patchset: ArielGlenn; "option to include top level html/txt files in rsync list" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/7857 [14:52:16] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7857 [14:52:18] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/7857 [15:06:10] notpeter: r u around? [15:07:23] cmjohnson1: so the raid controller shipped, should get there tomorrow [15:07:25] Jeff_Green: ^ [15:08:07] RobHalsell: yayyyyyyy [15:08:09] the hard disks for grosley should also have arrived [15:08:22] ok. we're still waiting for RAM though? [15:08:25] its the 256mb versino of the controller, but oh well [15:08:30] that's fine [15:08:31] yes waiting on the RAM [15:08:43] i think i placed that order, let me confirm [15:08:55] we don't expect much from storage3 even during the fundraiser, performance wise [15:09:20] also most of the reads are heavy--db dumps, giant gz files etc, so the cache is probably not that useful anyway [15:09:21] yea, i ordered the ram upgrade on the 10th [15:09:36] they have to go back to 2008 to source it [15:10:33] cmjohnson1: so the memory order on the 10th shows delivered on the 11th [15:10:42] cmjohnson1: so it should already be there [15:11:35] okay...i got it...thought it was one of the 32 ssd's coming in one at a time [15:12:08] cool, so that memory will go in grosley along with the additional hard disks in the slot 3 and 4 hdd areas [15:12:18] jeff_green: what time do you want to do update grosley [15:12:36] robhalsell: cool [15:13:03] oh everything is here? I think we can do it anytime really, it's the redundant box and aluminium is the active one [15:13:22] cmjohnson1: what works for you? [15:13:24] Jeff_Green: everything should be there, chris can confirm if he has the hard disks to add. [15:13:58] would 11 PST work? [15:14:05] sure [15:16:25] cmjohnson1: what's up? [15:16:44] you decomm'd db15...do you remember why? [15:17:50] are you in eiqad? [15:17:55] eqiad [15:18:33] notpeter: was it memory related? [15:19:27] cmjohnson1: I'm not sure. [15:19:33] I remember it not booting [15:19:51] but I do not know why [15:20:02] ok...cool...thx [15:20:37] sorry could be more helpful :( [15:21:50] cmjohnson1: RT-345 [15:22:35] cmjohnson1: and 526 [15:24:23] mutante: thx...that was exactly what I needed to know [15:25:31] mutante: can you run the interwiki map update script please? [15:25:46] I requested it twice yesterday in here but doesn't seem to have been done - probably should have stuck it in an RT [15:26:17] would you have a doc link? [15:26:42] cmjohnson1: in which DC are you again ?:p [15:27:13] i am in tampa [15:27:28] ah,ok, nevermind then:) [15:27:40] mutante: sure let me have a look [15:27:53] http://wikitech.wikimedia.org/view/Update_the_interwiki_cache [15:30:28] !log adding DNS records to wikimedia.org for RT #2960 [15:30:32] Logged the message, Master [15:30:40] !log creating fresh interwiki.cdb from dumpInterwiki.php [15:30:43] Logged the message, Master [15:30:50] !log sync-common-file interwiki.cdb [15:30:53] Logged the message, Master [15:31:17] Thehelpfulone: ^ [15:31:28] thanks [15:31:34] hmm [15:31:42] still not working, does it take a few minutes? [15:31:56] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:32:53] ah there we go :) [15:32:56] great thanks mutante [15:33:14] ah, cool, i was about to get suspicious over the "cache is tracked in subversion" part [15:47:49] uh oh [15:47:53] power is flaky [15:53:15] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:13:04] hey dzahn [16:35:33] hi guys [16:35:43] could someone approve/merge this? [16:35:44] https://gerrit.wikimedia.org/r/#/c/7285/ [16:35:58] maplebed maybe, since we were talking about it yesterday? [16:36:45] sure. [16:41:20] ottomata: this doesn't do things like set up the data directory, is that ok? [16:41:48] it'll just be the default (/var/lib/mysql?) which won't be on a separate partition or anaything like that. [16:42:24] given / is only 9,2G capacity on stat1, that's maybe not best. [16:47:48] yeah that's fine [16:48:00] i can puppetize whatever is best once I figure it out [16:48:10] ok. [16:48:19] i'll move it to /a/mysql or something [16:48:47] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7285 [16:48:50] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7285 [16:50:04] damn! [16:50:11] my reviewing sucks. [16:51:48] oh? [16:51:56] New patchset: Bhartshorne; "typoed stat class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7864 [16:51:59] ^^^ [16:52:10] doh [16:52:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7864 [16:52:19] my typing sucks [16:53:55] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7864 [16:53:57] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7864 [16:55:49] hmmmm [16:55:49] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate definition: Package[mysql-client-5.1] is already defined in file /var/lib/git/operations/puppet/manifests/mysql.pp at line 438; cannot redefine at /var/lib/git/operations/puppet/manifests/generic-definitions.pp:659 on node stat1.wikimedia.org [16:55:53] yeah. [16:56:17] but i'm not including it…maybe puppet doesn't check for conflicts unless somehow both classes are included by someone? [16:57:11] hm. I'm using the generic mysql server class on iron (in site.pp) [16:57:22] and the client in singer [16:57:27] I wonder what's different. [16:58:02] hm [16:58:40] oh! [16:58:52] role/statistics.pp incrludes both mysql::client and generic:mysql:client. [16:59:23] (the latter by way of statistics::db) [17:01:34] ottomata: up to you if you'de prefer to remove mysql::client from role::statistics or remove generic::mysql::client from statistics::db. There's no difference between the two classes. [17:14:39] maplebed: https://gerrit.wikimedia.org/r/7867 [17:15:10] this is the same as last time? [17:15:22] maplebed: basically [17:16:02] you want it out now? [17:16:09] maplebed: yes please [17:16:12] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7867 [17:16:17] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7867 [17:16:17] k. on its way. [17:17:26] !log deploying config change to mobile - more zero IP addresses. gerrit r7867 [17:17:30] Logged the message, Master [17:18:46] preilly: as last time, is there no need to purge the cache? [17:19:07] maplebed: actually you probably should for this change [17:19:39] ok. I'll do so as soon as puppet is done. [17:19:45] ... which is now. [17:20:43] !log flushing the mobile cache post-deploy [17:20:46] Logged the message, Master [17:21:30] preilly: can you test and confirm the change worked and things aren't broken? [17:21:58] Change abandoned: Reedy; "https://gerrit.wikimedia.org/r/#/c/7820/ dupes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3815 [17:26:08] maplebed: looks good [17:26:18] sweet. thanks for the check. [17:27:19] New patchset: Ottomata; "role/statistics.pp - don't need to include mysql::client." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7869 [17:27:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7869 [17:27:45] maplebed, let's try that [17:27:45] https://gerrit.wikimedia.org/r/#/c/7869/ [17:28:01] k. [17:28:23] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7869 [17:28:25] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7869 [17:29:29] it's working. [17:30:11] ottomata: do you know if there were other changes queued? I see libcairo-dev getting installed, but that's not part of the mysql stuff. [17:30:38] the error with Misc::Statistics::Mediawiki/Git::Clone is still there too. [17:30:53] but mysql server was installed, so this change at least has happened. [17:31:28] aye, i'm actually waiting for someone to reinstall stat1! [17:31:37] before i bother sleuthing the git clone prob [17:31:45] as for libcairo: no idea [17:32:06] /Stage[main]/Misc::Statistics::Plotting/Package[libcairo-dev]/ensure: ensure changed 'purged' to 'present' [17:32:06] if you have any idea how to push this [17:32:07] https://rt.wikimedia.org/Ticket/Display.html?id=2946 [17:32:10] would be much appreciated [17:32:23] afaik, that should have been installed a while ago [17:32:23] hm [17:32:34] who is in charge of reinstalling the machine? [17:32:39] mark was the one who wanted us to do it [17:32:46] Could someone approve and merge https://gerrit.wikimedia.org/r/#/c/7820/ and https://gerrit.wikimedia.org/r/#/c/7831/ ? Thanks! [17:32:49] ottomata: can we do it ourselves? [17:33:09] i don't hitnk so, mark (or someone) said they'd find a place for /a while they reinstall [17:33:18] or at aleast, i have no idea how to do it from here [17:33:41] but it shouldn't be so hard to move stuff temporarily away, or is it? [17:33:51] if you know of a place to put it [17:33:52] then no [17:33:57] but I asked and they just said they would do it [17:34:56] Reedy: I can look, but I lack context. [17:35:32] first comment though, you should add -oConnectTimeout=30 in addition to -oSetupTimeout=30. [17:35:42] (though I always set them to 5 or 10 rather than 30...) [17:35:55] i was just normalising against the other scripts [17:36:08] for 7820 the script wont run on remote hosts without the sudo for non root [17:36:34] should probably go through and add connecttimeout to all the scripts then [17:37:49] while talking about normalizing the scripts... there is a variety in the rsync options too. [17:38:41] specifically one of them has --no-perms and the rest don't. [17:38:55] (ignoring that some have -v and some don't) [17:39:30] lol [17:39:39] sounds about right [17:40:13] ottomata, can we move stat1:/a temporarily to bayes? [17:40:23] or isn't there enough space? [17:40:36] reedy, do you want to patch and then I'll look again? [17:41:10] Patch what, adding the connect timeout? [17:41:14] 205G [17:41:37] Touching the rsync options has the ability to cause some issues [17:41:43] 37G avail on bayes [17:42:28] !log temporarily turning off puppet on brewster for preseed hackz [17:42:31] Logged the message, notpeter [17:44:25] maplebed, I made a pretty awesome mysql_instance define for CouchSurfing [17:44:35] included pretty much anything you could ever want to puppetize in my.cnf [17:45:04] would it be useful to try to adapt it and use it for Wmf? [17:45:19] or should I just commit a stat1.my.cnf file (blagh) to files/ [17:45:19] ? [17:46:00] you should ask binasher that question [17:47:36] I'm not sure how much of the rest of our mysql configs are done via puppet or something else. In principle, that's probably a good idea. [17:48:47] I just uploaded an example of how it worked [17:48:47] https://github.com/ottomata/cs_puppet_mysql [17:48:53] it would need tweaked of course [17:49:15] my setup used supervisor to run multiple instances on one machine (if you wanted) [17:49:39] and binary installs rather than packages, so I could easily switch between versions just by changing a symlink, or by changing the mysqld that supervisor used [17:54:04] this is the relevant mysql.conf.erb file [17:54:04] https://github.com/ottomata/cs_puppet_mysql/blob/master/mysql.conf.erb [17:59:26] jeff_green: are you ready to bring down grosley for memory upgrade and and add hdd's [17:59:42] yep, lemme just check that nobody's on it [18:00:16] ok. I'll shut it down [18:00:19] ok [18:00:40] log: shutting down grosley for disk and RAM upgrades [18:00:42] err [18:00:49] !log shutting down grosley for disk and RAM upgrades [18:00:52] Logged the message, Master [18:01:13] i can't help but read that as not!-log [18:01:23] New patchset: Catrope; "Explicitly pass --wiki= in foreachwiki*" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7877 [18:01:29] cmjohnson1: it's yours once you see it down [18:01:47] cool thx [18:01:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7877 [18:01:50] PROBLEM - Host grosley is DOWN: CRITICAL - Host Unreachable (208.80.152.164) [18:11:15] jeff_green: question...the dimm that rob bought is the wrong type but somewhere along the line I found stored dimm here that will fit [18:11:22] I am going to add it [18:11:22] New patchset: Aaron Schulz; "Changed purge hook to use doQuickOperations()." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7878 [18:11:41] i will look for errors during post [18:11:44] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7878 [18:11:44] ok [18:11:46] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7878 [18:18:38] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:24:29] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:26:08] RECOVERY - Host grosley is UP: PING OK - Packet loss = 0%, RTA = 0.14 ms [18:28:59] PROBLEM - Host grosley is DOWN: CRITICAL - Host Unreachable (208.80.152.164) [18:33:11] RECOVERY - Host grosley is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [18:33:41] jeff_green: added new memory...not hdd...chk to see if everything is working okay [18:33:49] k [18:44:13] !log restarting puppet on brewster [18:44:17] Logged the message, notpeter [18:44:41] binasher: db61/62 should be set up to your specifications. let me know [18:45:11] thank you! [18:46:15] !log stopped replication on es1002 [18:46:18] Logged the message, Master [18:49:05] !log syncing cluster23 tables from es1002 to es1004 [18:49:08] Logged the message, Master [19:00:26] hey maplebed [19:00:33] what would XML Parse Error in a ganglia notice mean? [19:00:47] referring to the PacketLossLogTailer [19:00:49] umm... [19:01:24] the value of the check being 'XML Parse Error' would indicate a faulty check. [19:01:35] the check reports whatever the checking script tells it to. [19:01:48] if the XML parse error is trying to interpret the check, then maybe there's some weird quoting going on [19:02:05] !log running securepoll_votes.vote_ip schema migration on all s7 dbs [19:02:08] Logged the message, Master [19:02:12] where are you seeing that error? [19:02:16] hmm [19:02:35] from a notice about packetloss on oxygen, from a couple of days ago [19:02:54] ***** Nagios  ***** [19:02:55] Notification Type: PROBLEM [19:02:55] Service: Packetloss_Average [19:02:55] Host: oxygen [19:02:56] Address: 208.80.154.15 [19:02:56] State: CRITICAL [19:02:57] Date/Time: Thu May 17 03:32:18 UTC 2012 [19:02:58] Additional Info: [19:02:59] XML parse error [19:03:16] or yesterday I guess [19:03:27] ok, so it's nagios spitting out the error. [19:03:41] you want to find out whether the error is in nagios or in the check. [19:04:04] you might be able to see what the check was (if it's recent enough) by looking at the RRD file for the check. [19:04:14] recent enough is less than an hour, I think. [19:04:16] hm, wait so that has notthing to do with ganglia? [19:04:22] maybe! [19:04:41] hm, ok, i thought it was ganglia only because the PacketLossLogtailer uses GangliaMetricObject in python [19:04:55] does nagios get info from ganglia and then send notices? [19:04:56] the question is whether ganglia or nagios spit out garbage [19:05:03] all you know is there's garbage at the end of the chain. [19:05:09] growl [19:05:23] nagios RRD file......... [19:05:24] hm [19:05:30] ganglia RRD file. [19:05:37] (lets you see what the value was before now()) [19:05:49] if you want the current value you ask gmond instead. [19:06:22] ah monitor_service does it (i have never used nagios or ganglia before, so pardon the ignorance :) ) [19:06:28] hmm. it's also worth while to verify that nagios is actually getting that data from ganglia. It looks that way, but ... [19:06:54] yeah [19:06:55] it is [19:07:00] monitor_service is nagios, right? [19:07:34] monitor_service { "packetloss": description => "Packetloss_Average", check_command => "check_packet_loss_ave!4!8", contact_group => "admins,analytics" } [19:07:48] maybe it is separate! [19:07:51] ottomata: you were asking about my.cnf templates. there is currently one for prod that's at templates/mysql/prod.my.cnf.erb - but it'd rather you not modify that to fit misc instances. you can check in a generic template there for use on stats1 etc. it should probably have more defaults than the couchsurf template in github though. [19:07:53] i don't know what check_packet_loss_ave is [19:07:54] it isn't a file in puppet [19:08:24] binasher: more defaults? the define sets tooons of defaults, no? [19:08:54] i only looked at the template tbh [19:09:17] ah [19:09:17] yeah, the define uses the template [19:09:17] https://github.com/ottomata/cs_puppet_mysql/blob/master/mysql.pp [19:09:17] scroll down [19:09:17] oh cool i can likn to a line :) [19:09:17] https://github.com/ottomata/cs_puppet_mysql/blob/master/mysql.pp#L87 [19:12:31] something like that would be fine [19:12:36] cool! [19:12:51] i'll work on it and tweak it to make it fit wmf, and use most of the defaults from the .deb my.cnf [19:12:59] maplebed, how are nagios_service check_commands created? [19:13:09] what is this? [19:13:09] check_command => "check_packet_loss_ave!4!8" [19:13:19] there's a file that defines it in puppet [19:13:25] OHOHOH yes [19:13:26] found it [19:13:29] thoguht I grepped for that already [19:13:31] k reading... [19:13:33] checkcommands.cfg [19:13:50] check_ganglios_generic_value [19:13:53] it is getting it from ganglia. [19:15:00] ok [19:15:09] can I run that command manually and see result from ganglia? [19:15:10] $USER3$/check_ganglios_generic_value [19:15:12] what's $USER3$? [19:15:23] ahhhh [19:15:28] more grepping answers my qs [19:15:30] that's what's in between the !!s [19:15:31] i should do that before asking :p [19:15:32] ottomata: there's also a desire to use mariadb in some places, so support for adding an array of key/value pairs to the conf might be nice - to pass in options that aren't supported by stock mysql. also make sure the naming of everything is obviously distinct from the production core db defintions, though i don't care what it's actually called [19:15:50] ok cool, i've done that before too [19:15:58] i might make a generic_mysql.pp file? [19:16:07] that sounds good [19:16:31] !log running securepoll_votes.vote_ip schema migration on all s6 dbs [19:16:34] Logged the message, Master [19:17:12] !log running securepoll_votes.vote_ip schema migration on all s5 dbs [19:17:15] Logged the message, Master [19:18:10] hmm, where is this installed? [19:18:10] check_ganglios_generic_value [19:18:22] should be here /usr/lib/nagios/plugins/check_ganglios_generic_value [19:18:31] but it isn't on oxygen, so I assume there is a nagios host somewhere that it is running on? [19:19:02] !log running securepoll_votes.vote_ip schema migration on all s4 + s3 dbs [19:19:03] spence! [19:19:05] Logged the message, Master [19:19:09] ottomata: yes. [19:19:22] wahh, no access to spence :( [19:19:23] spence has a cached copy of the gangila data that it uses to feed checks. [19:19:41] ottomata: was the error transient or is it still complaining? [19:19:47] transient, but i've seen it happen before [19:19:53] and robla wants me to make it not happen again [19:21:07] the easiest way to make it not happen is to make the nagios retry intervals such that they're greater than the duration of the transient time. [19:21:52] i got the OK notice about 5 minutes after the PROBLEM notice [19:21:59] hmm... [19:22:13] the ganglia-logtailer runs every 5 minutes [19:22:16] that usually happens when the metric is reported at a time interval that's too long. [19:23:02] ? [19:24:37] if ganglia has a hiccup and misses a metric and it times out you've got a 10m difference there. if nagios's retry is such that it expects it to be ok with a 5m threshold, it'll trigger. [19:25:02] probably best to increase the retry count for the nagios check [19:25:12] i.e. make it check again for >5m before alerting. [19:26:55] cmjohnson1: how's it going with grosley? [19:27:23] hmmm ok [19:27:38] default retries is 3 [19:27:52] bump that to 6 for this check? [19:28:11] I don't know how the newer nagios stuff works (paravoid does though) to know whether that's an easy change to make in puppet for one check. [19:28:25] it's no big deal in nagios [19:28:29] changing retries looks easy [19:28:32] i can pass that as a param [19:28:42] but, tryign to figrue out how long the nagios check interval is... [19:28:50] I think retry interval is 1m [19:29:07] ah so if not fixed in 3 minutes [19:29:07] yeah [19:29:13] and ganglia cron is only running every 5 [19:29:14] ok [19:29:37] jeff_green: the memory has been updated ..you now have 16Gb of RAM...the HDD could not be added the R300 only has 2 HDD slots [19:29:57] ahhhHAHhahHA [19:30:11] ottomata: I gotta bail for lunch. you good for a bit? [19:30:31] I thought grosley was the same flavor of host as aluminium, or so suggested racktables [19:30:37] * Jeff_Green cries, then dies. [19:31:50] yep..apparently racktables was wrong [19:32:13] happens! r300 r310 [19:32:17] very similar [19:32:28] the horror. [19:32:43] do we have a spare r310 around? [19:33:50] not in tampa...as far as I know..but I will look around [19:33:59] ok [19:34:11] Jeff_Green: console does not work on Grosley but I am assuming that is a security setting [19:34:26] no idea, i didn't set it up originally [19:34:33] okay [19:34:50] the RAM is showing up though [19:35:02] http://ganglia.wikimedia.org/latest/graph.php?r=2hr&z=xlarge&h=grosley.wikimedia.org&m=cpu_report&s=descending&mc=2&g=mem_report&c=Miscellaneous+pmtpa [19:35:06] that is a bizarre graph [19:35:07] yep...which is nice [19:35:36] that is bizarre [19:36:18] New patchset: Ottomata; "logging.pp - Upping number of retries for packetloss notices to 6." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7884 [19:36:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7884 [19:36:55] maplebed, would love a review of that when you get a chance [19:37:03] https://gerrit.wikimedia.org/r/#/c/7884/ [19:37:43] i know what is going on...at first I left (2) 2GB sticks and added (2) 4Gb....i decided to remove the 2GB sticks and replace with 4GB each [19:37:59] ah [19:38:05] i was thinking it was a ganglia artifact [19:38:08] so it went from 4 to 12 to 16 [19:38:57] cmjohnson1: what's the host erzurumi? it's in racktables as a R310 [19:39:40] no idea [19:39:56] can you check when you have a chance? [19:40:07] that is a r300 as well [19:40:15] gar stab. [19:40:49] the whole rack is mislabeled r310...gonna fix that now [19:41:00] ok [19:43:26] no 310's at this site [19:45:27] ok. thanks for checking./ [19:46:20] i noted all this on the RT ticket, we can give it more thought next week I guess [19:47:47] ok [19:56:21] New patchset: Aaron Schulz; "Enabling new purge hook on testwikis again." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7886 [20:01:35] !log running securepoll_votes.vote_ip schema migration on all s2 dbs [20:01:39] Logged the message, Master [20:01:49] New patchset: Aaron Schulz; "Enabling new purge hook on testwikis again." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7886 [20:02:13] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7886 [20:02:15] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7886 [20:03:24] !log running securepoll_votes.vote_ip schema migration on s1 [20:03:27] Logged the message, Master [20:04:16] ohhh great [20:04:33] securepoll_vote is totally inconsistent across enwiki dbs [20:04:34] 4003 rows on db36 [20:04:35] 4145 rows on db32 [20:04:36] 4259 rows on db59 [20:04:37] 4032 rows on db12 [20:04:48] New patchset: Ottomata; "generic-definitions.pp - install libmysqlclient-dev with mysql-client-5.1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7887 [20:05:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7887 [20:08:56] New patchset: Aaron Schulz; "Moved mediawikiwiki to new thumb purge hook; enabled concurrent ops." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7888 [20:09:18] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7888 [20:09:20] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7888 [20:09:52] * maplebed looks [20:12:46] oh good, percona tool is lying [20:14:30] !log completed securepoll_votes.vote_ip and all ipv6 schema migration [20:14:33] Logged the message, Master [20:16:32] AaronSchulz: my tests look like it's working ok on testwiki [20:16:39] I take it yours do too? [20:17:04] so far [20:17:21] hmm. with one exception. [20:17:33] it didn't purge the 450px version from squid [20:17:36] (I think) [20:18:57] it did purge it from swift, but not from squid. [20:20:01] sounds like that bug report [20:21:01] yaeh, that would have nothing to do with this change, right? [20:21:16] right [20:21:32] of course I didn't look at ms5 before the purge and see that the 450px version existed there... [20:21:45] anyway, I'll save that for later. [20:21:59] shall we move on to the second change? [20:22:06] or do you want to push that to all wikis first? [20:22:06] sure [20:24:32] New patchset: Aaron Schulz; "Enabled new transform hook for testwikis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7890 [20:24:46] AaronSchulz: we need to coordinate when it actually goes live, right? [20:24:52] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7890 [20:24:54] I need to turn off writes for those same wikis? [20:24:54] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7890 [20:26:51] AaronSchulz: what's the container name / URL format for thumbnails on mediawiki? [20:26:52] maplebed: sure, testwiki, test2wiki, and mediawikiwiki [20:28:00] I think wikipedia-mediawiki. do you agree? [20:28:06] yup [20:28:50] New patchset: Bhartshorne; "disabling thumbnail writes for test wikis and mediawiki to go with gerrit change r7890" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7891 [20:28:54] AaronSchulz: would you review ^^^ for typos etc.? [20:29:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7891 [20:30:37] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/7891 [20:31:15] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7891 [20:31:18] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7891 [20:32:29] !log deployed parallel thumbnail purging for test, test2, and mediawiki with aaron [20:32:32] Logged the message, Master [20:32:52] AaronSchulz: I'm running puppet now (which will stage it but not enact the change) [20:33:01] ok [20:33:04] at the right time I'll kick the proxies and they'll pick up the change. [20:33:34] so if this works, what do we do next? [20:33:40] we test it! [20:33:43] :P [20:33:51] ok, I'm ready. [20:33:58] well, that's how you know "it works" :) [20:34:01] say when and I'll kick the proxies. [20:34:06] I mean should we do any more wikis? [20:34:13] maplebed: you can push it now [20:34:55] !log deployed change to swift and mediawiki for MW to write thumbnails to swift instead of rewrite.py with aaron [20:34:56] done. [20:34:58] Logged the message, Master [20:35:37] I don't think it's working [20:36:32] I'm loading new thumbnails in test (and I'm seeing them in my browser) but I don't see them in the container. [20:38:07] verified that new objects are still getting written to the commons containers. [20:38:59] heh [20:42:21] hooray!!! [20:42:23] it works! [20:43:01] and it's got a sha1 hash too! [20:43:07] heh [20:43:14] υαυ [20:43:14] er [20:43:14] yay! [20:43:23] maplebed: now when do we schedule commons? [20:43:53] should we do a few larger wikis today and let it sit for the weekend, then do commons on Tuesday? [20:44:17] ok, which wikis? nlwiki? plwiki? [20:44:51] umm... i dunno... how do you usually do tiered deploys like this? [20:46:05] pl and nl only have about 50 objects each. [20:46:19] maybe they only use commons [20:46:33] yeah [20:46:35] we can check the shard list! [20:46:54] I haven't tested my change on a sharded container, though I think it'll work. [20:47:07] but I'm game. [20:47:53] maybe itwiki, ruwiki, and zhwiki? [20:48:01] ok. [20:48:04] let's start with one [20:48:07] so I can test teh sharding stuff [20:48:11] then do the others. [20:48:45] * AaronSchulz wishes pushing code was faster [20:48:48] sigh, ok [20:49:00] itwiki first [20:49:05] ok. [20:49:43] New patchset: Bhartshorne; "disabling thumbnail writes for it wiki (tests passed) to go with gerrit change r7890" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7894 [20:50:05] New patchset: Aaron Schulz; "Made itwiki use new thumb hooks." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7895 [20:50:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7894 [20:50:47] maplebed: I'll wait for you to be ready to press the last ENTER [20:50:51] ok. [20:51:51] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7894 [20:51:53] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7894 [20:54:50] AaronSchulz: go ahead. I've turned off writing in swift for it. [20:55:28] and verified that writes aren't happening. [20:55:28] * chrismcmahon surfs itwiki looking at pictures [20:55:31] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7739 [20:55:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7739 [20:55:37] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7895 [20:55:41] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7895 [20:55:46] chrismcmahon: our change should be totally transparent [20:55:55] the only effect might be that images are a bit slower [20:56:04] but you shouldn't see any actual errors. [20:56:07] please let us know if you do. [20:56:08] :D [20:56:19] I did notice a slightly slow first load, but really minor [20:56:31] first load is always slow; [20:56:44] our change would make second and third load slow (while the change is happening) [20:56:50] I guess I could do this in IE7 for laughs [20:57:21] !log several package updates on payments* and silicon [20:57:24] Logged the message, Master [20:57:29] although I think I lose a little more of my soul every time I use IE7 [20:57:38] New patchset: Ottomata; "files/nagios/check_udp2log_log_age - adding slow_log_files list." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7896 [20:57:58] New patchset: Aaron Schulz; "Revert "Made itwiki use new thumb hooks."" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7897 [20:57:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7896 [20:58:06] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7897 [20:58:08] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7897 [20:58:29] AaronSchulz: you want me to roll back the swift change for itwiki too? [20:58:44] yeah [20:59:07] itwiki isn't on wmf3, and I was calling a wmf3 function there [20:59:21] doh! [20:59:37] New patchset: Bhartshorne; "reverting disabling thumbnail writes for it wiki (tests passed) to go with gerrit change r7890" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7898 [20:59:41] which wiki should we do instead? [20:59:54] something on wmf3 with some decent file count [20:59:57] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7898 [20:59:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7898 [20:59:57] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7898 [21:00:16] commonswiki is like the only option, heh [21:00:30] well, we can wait I guess [21:00:39] we won't get any bug reports on test. [21:00:59] yeah... [21:02:27] well, any reason not to do commons? [21:02:47] I guess we may as well [21:03:42] Fee-fi-fo-fum! [21:03:51] * maplebed preps the commons change. [21:04:42] New patchset: Aaron Schulz; "Made commonswiki use new thumb hooks." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7900 [21:05:05] New patchset: Bhartshorne; "disabling thumbnail writes for commons wiki (tests passed) to go with gerrit change r7890" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7901 [21:05:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7901 [21:06:42] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7901 [21:06:44] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7901 [21:07:05] so AaronSchulz - I'll push the swift side live, verify writes aren't happening, then tell you to push the MW side live, then we verify writes are happening. [21:07:07] is that right? [21:07:37] ok [21:08:41] AaronSchulz: pushed and verified. go go go!!! [21:08:43] ;) [21:08:52] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7900 [21:08:54] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7900 [21:09:41] it's writing again. [21:10:53] purging works [21:16:38] AaronSchulz: I think we're good! [21:16:43] I'll post something to the commons VP. [21:17:03] PROBLEM - MySQL Idle Transactions on db61 is CRITICAL: Connection refused by host [21:17:03] PROBLEM - MySQL Slave Running on db61 is CRITICAL: Connection refused by host [21:17:03] PROBLEM - mysqld processes on db61 is CRITICAL: Connection refused by host [21:17:21] PROBLEM - MySQL Recent Restart on db61 is CRITICAL: Connection refused by host [21:17:21] PROBLEM - MySQL disk space on db61 is CRITICAL: Connection refused by host [21:17:48] PROBLEM - MySQL Replication Heartbeat on db61 is CRITICAL: Connection refused by host [21:18:06] PROBLEM - MySQL Slave Delay on db61 is CRITICAL: Connection refused by host [21:18:06] PROBLEM - Full LVS Snapshot on db61 is CRITICAL: Connection refused by host [21:19:34] @info db61 [21:19:34] jeremyb: Unknown identifier (db61 [21:26:42] AaronSchulz: I'm noticing that there are two HEAD requests to the proxy server for each image generated. Is that expected? [21:28:32] Maybe he just wants some head. [21:30:17] * Reedy beats Damianz [21:30:39] * Damianz returns Reedy a 404 [21:31:47] maplebed: I think that's right [21:32:13] 19 log lines for one image request. [21:32:14] \o/ [21:32:19] while we are at it, maybe we can look at: [21:32:21] 2012-05-17 21:10:20 srv222 commonswiki: Could not store thumbnail.Site: `wikipedia` Lang: `commons` src: `/tmp/transform_9d421e-1.jpg` dst: `mwstore://local-swift/local-thumb/4/44/Point_State_Park_in_Fall.jpg/50px-Point_State_Park_in_Fall.jpg` [21:32:31] these have been around for quite a while [21:32:45] Reedy: would have been more appropriate to DELETE Damianz... [21:32:47] mostly in the 4/44 - 4/48 range [21:33:17] swift-thumb.log [21:34:52] AaronSchulz: I'm gonna post to the VP first, then we can dive in. sound ok? [21:35:02] sure [21:35:03] PROBLEM - NTP on db61 is CRITICAL: NTP CRITICAL: No response from NTP server [21:35:08] maplebed: I bet I can cut that down to one HEAD btw [21:35:21] AaronSchulz: just commenting out the code is cheating [21:35:22] by using doQuickOperations() [21:39:46] !log resumed replication to es1002 [21:39:49] Logged the message, Master [21:41:05] binasher: ^^ db61 [21:41:39] RECOVERY - MySQL Slave Running on es1004 is OK: OK replication [21:41:55] db61's not even on /dbtree/ ? [21:42:12] @info db36 [21:42:12] jeremyb: [db36: s1] 10.0.6.46 [21:42:19] !log es1004 is replicating again [21:42:22] Logged the message, Master [21:43:24] disabled notifications for db61 [21:43:41] that works too [21:44:06] what was up with db12? [21:44:41] New patchset: Ryan Lane; "In a slot based system we need to ignore the slots" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7906 [21:44:57] i'm not seeing anything about db12 in my scrollback.. what was up with it? [21:46:04] binasher: was persistently ~10-60 secs behind according to API for a significant period of time [21:46:42] around 10:20ish UTC today [21:47:04] also see http://ganglia.wikimedia.org/latest/?r=day&cs=05%2F15%2F2012+17%3A45+&ce=05%2F17%2F2012+17%3A45+&m=&c=MySQL+pmtpa&h=db12.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [21:47:14] yeah, db12 is a totally underpowered dumping ground that gets all of the enwiki watchlist and recentchange queries [21:47:38] there's a lot of graphs that level out or otherwise change dramatically about midway through the interval [21:47:53] indeed [21:48:12] shutting down mysql and rebooting to run a new kernel will do that. [21:48:34] hah [21:49:32] binasher: and the mysql_slave_running graph on that page? [21:50:13] maplebed: https://gerrit.wikimedia.org/r/7867 [21:51:33] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7907 [21:51:35] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7907 [21:53:32] AaronSchulz: I'm all done. Wanna look through that bug? [21:53:55] sure [21:53:56] !log reverted mobile change from this morning - testing completed. [21:54:00] Logged the message, Master [21:54:15] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [21:55:49] jeremyb: i restarted gmond on db12. the mysql module has a pretty big bug in that if it can't connect to mysql when its first started, it doesnt retry [21:58:08] * jeremyb reloads ganglia [22:00:39] !log migrated centralauth.global_group to innodb [22:00:42] Logged the message, Master [22:01:33] !log migrating centralauth.spoofuser to innodb via osc (13.5mil rows) [22:01:37] Logged the message, Master [22:02:28] huh?! why not already innodb? [22:03:07] binasher: i see more than a few graphs now have data, thanks! [22:03:31] it might take a few more minutes for them all [22:03:44] really? [22:03:45] i should figure out how to fix that :/ [22:03:57] can you make it happen in labs? [22:07:04] that would probably be a better place to hack on it [22:07:15] is there a labs ganglia? [22:10:07] i think so [22:10:21] well there's a ganglia for labs that is [22:10:29] amazingly http://ganglia.wmflabs.org/latest/ [22:10:35] is there a labs project for ganglia? i think so too [22:11:14] https://labsconsole.wikimedia.org/wiki/Nova_Resource:Ganglia [22:13:43] ooh [22:13:51] ok, i need to get more immersed in labs [22:14:13] labs is really crippled by not having per project puppet [22:14:46] unless ya'll want to give out +2/merge/submit access to the test branch liberally [22:15:08] but anyway that's being worked on. at least when labs itself is not freaking out [22:18:26] !log migrated centralauth.wikiset to innodb [22:18:29] Logged the message, Master [22:18:49] binasher: 17 22:02:28 < jeremyb> huh?! why not already innodb? [22:19:32] there's stuff all over that was never moved [22:19:45] ;-( [22:19:54] Reedy: there's more? :( [22:20:09] None that I know of [22:20:17] i haven't done an exhaustive search, i was pretty sad to find the centralauth tables [22:20:19] But wouldn't suprise me if there was somewhere [22:20:38] are the old CA tables still laying around on one of the other clusters? [22:20:51] that's possible [22:21:05] I seem to recall finding them somewhere else [22:21:07] I may have RT'd it [22:22:19] can't you just `find /a -name '*.MYI'` or something? [22:22:26] yup [22:23:08] Nope, you droped them already [22:23:09] ugh [22:23:15] I wish RT would notify you of more stuff by email [22:23:59] there are recent tables that are myisam.. moodbar_feedback [22:24:19] filejournal [22:24:31] AaronSchulz: ^^ didn't you just create filejournal? grr [22:24:38] lol [22:24:42] yes, it's relatively new [22:25:05] i never even saw the sql for that [22:25:28] binasher: how many tables? [22:25:46] https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=blob;f=maintenance/archives/patch-filejournal.sql;h=114297c6d88f82da2511f2906f6f1635a70e2a52;hb=HEAD [22:25:51] ); [22:26:00] aft_article_filter_count [22:26:11] but not other aft tables [22:26:13] sigh [22:26:20] it has not table options comment [22:26:23] *has no [22:26:46] there are 13 myisam tables in enwiki right now [22:26:56] archive_old = droppable? [22:26:58] it's to easy to miss this stuff [22:27:42] I'd suspect so [22:28:22] Yeah, it's empty [22:28:33] an OLD version of archive [22:29:33] i finally get to run DROP TABLE :D [22:30:00] user_old seems equally suspect [22:30:06] but for now.. [22:30:08] binasher: yeah, all those journal table must be myisam then, so they need to be changed to innodb [22:30:08] * binasher goes afk [22:30:21] old 310163, "current" 16948031 [22:32:24] And querycache_old is probably good to go.. [22:32:55] There's load of old tables.. [22:55:26] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 214 seconds [22:55:26] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 213 seconds [23:11:02] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [23:11:11] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [23:12:23] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:16:44] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:17:47] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 0.0094112195122 [23:22:08] PROBLEM - Packetloss_Average on oxygen is CRITICAL: XML parse error [23:47:16] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 196 seconds [23:47:34] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 201 seconds [23:50:58] what's the difference between efRaiseThrottle() (CommonSettings.php:1841) and wgRateLimitsExcludedIPs ? [23:54:19] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 17 seconds [23:54:55] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds