[00:06:09] observium > torrus [00:18:46] maplebed: TimStarling: check your email. msg from Ralf [00:19:24] "since we noticed in the last few days that some of the images we fetch [00:19:24] are corrupt" <-- nice of him to report it... [00:19:29] grumblegrumblegrumble. [00:20:04] I'm asking him to join the channel now [00:20:46] who is ralf ? [00:21:00] he's from PediaPress [00:21:05] ah [00:22:48] ping robla [00:23:01] hi schmir...thanks for joining [00:23:31] New patchset: Tim Starling; "Limit fanout like in scap, to avoid overloading the NFS server." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2570 [00:24:07] New patchset: Lcarr; "Generating initcwnd.erb with both default gateway and default interface fact" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2571 [00:24:29] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2570 [00:24:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2570 [00:24:29] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2570 [00:24:29] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2570 [00:24:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2571 [00:24:52] schmir: we probably shouldn't reenable collections before we have a good plan of attack here [00:26:07] maplebed and TimStarling are the main people that have been working on this [00:26:41] schmir: in ralf's email, he says "since we noticed the last few days ..." [00:26:47] do you have more detail on exactly when that started? [00:26:55] schmir == ralf :) [00:27:16] specifically, did it coincide with http://blog.wikimedia.org/2012/02/09/scaling-media-storage-at-wikimedia-with-swift/ [00:27:26] ah. hooray for multiple names. [00:27:37] (says the dude with at least three identifiers in mediawiki) [00:27:45] * robla was just observing that :) [00:28:47] robla: http://wikitech.wikimedia.org/view/User:Bhartshorne/pdf_thumbnail_issue [00:28:52] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2571 [00:28:53] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2571 [00:28:58] https://github.com/pediapress/mwlib.rl/commit/b1ead48e94f32c302a711160d3544141c4679928 is one of the first commits that tries to handle broken images [00:28:59] fwiw, swift does have a truncated version of the 1200px thumbnail. [00:29:04] that's from january 20 [00:29:07] so there may be a legit bug here. [00:30:47] robla: I already changed back the default to 1200px on the render servers. [00:32:08] so you changed the default without telling any of us or logging it in the server admin log? [00:32:42] TimStarling: yes. I didn't expect it to cause problems like this. [00:32:59] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2556 [00:33:00] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2556 [00:33:31] besides I don't even know what the server admin log is. [00:34:14] well, join #wikimedia-tech, and then type "!log ...the log entry..." [00:34:31] then a bot will add your message here: https://wikitech.wikimedia.org/view/Server_admin_log [00:34:59] do you think you could do that every time you change something on the pdf servers, regardless of whether you think it will break something? [00:35:34] only if I automate it [00:36:08] schmir: it's going to be pretty important for us to have a protocol for changing production services that works for all of us here [00:36:56] I'm going to guess that Tomasz or someone from the WMF never outlined that as a hard and fast requirement for you all, but it's definitely a requirement on our end [00:37:56] automation or no, it would be useful to just be in #wikipedia-tech or this channel so that if there's a flurry of several people spending a few hours banging their head against something you can speak up. [00:39:56] well, I didn't know about the flurry of people... [00:40:50] you'd like to use puppet for administration anyway...but I'm still waiting for someone to get us a pdf render machine on labs with some basic puppet config [00:42:05] * maplebed sees a labs project titled "pediapress" in existence already. [00:42:38] * maplebed will stop being grumpy now. [00:44:12] New patchset: Lcarr; "temp commenting out config file until new facter script propogates" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2572 [00:44:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2572 [00:45:06] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2572 [00:45:07] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2572 [00:47:20] schmir: you have the ability to make the instance yourself [00:47:37] Ryan_Lane: please read my last mail on the subject. [00:47:54] you mean the "that's not my job, it's yours"? email? [00:47:59] s/?// [00:48:11] I want to have a basic puppet template as a jump start [00:48:29] I'd imagine someone from the ops team will work with you on puppetization [00:48:49] I think RobH is supposed to be working with you on this [00:50:11] anyway, you don't need to start by puppetizing it. we recommend: 1. installing everything manually, and documenting the steps. 2. Puppetizing it. 3. Make an instance configured from scratch using the puppet configuration and see if it builds properly [00:50:29] err 2. Puppetizing it based on the documentation [00:50:58] if it doesn't work based on the documentation, the documentation should be updated. doing it this way ensures the documentation is accurate [00:51:09] the docs are fine: http://mwlib.readthedocs.org/en/latest/installation.html#installation-instructions-for-ubuntu-10-04-lts [00:51:12] schmir: in all of my puppet efforts i just look at existing machines and copy over the information :) there's a lot of random machines out there to check out [00:52:13] LeslieCarr: I don't even know where to start. I have no experience with it. [00:52:32] again, I don't think you need to do the puppetization [00:52:45] but, we aren't going to puppetize something with pip [00:52:54] we discussed that already, though [00:54:01] Ryan_Lane: there are no debian packages! [00:54:12] fyi, nfs1 is not very responsive - i am trying to conenct to its mgmt now [00:54:40] nfs1 is sending alerts [00:54:46] I'd imagine that's for LDAP [00:54:55] actually it appears that nfs1 is completely unresponsive [00:55:03] schmir: well, that's going to be a problem [00:55:29] it's fine for your software. we didn't want to manage that via puppet anyway [00:55:36] but the dependencies need to be packaged [00:55:52] why don't you start doing that then? [00:56:04] I'm not even working on this [00:56:05] * Ryan_Lane shrugs [00:56:21] None of us is working on it. so who is? [00:56:21] but I do need to ensure it's being done in a way that we can manage [00:56:22] guys, nfs1 looks completely fubar [00:56:24] rebooting ? [00:56:28] LeslieCarr: sounds good [00:57:02] !log rebooted nfs1 as it was unresponsive on console and via IP [00:57:04] Logged the message, Mistress of the network gear. [00:57:36] schmir: I don't know. I'm busy with Labs, otherwise I'd do it [00:57:53] but realistically, it's generally the developer who makes packages [00:58:14] if RobH is working with you on this project, he may be able to help [00:58:52] Ryan_Lane: having debian packages is *your* requirement, not ours. [00:59:51] well, we often don't use software that isn't packaged [01:00:25] nfs1 was totally segfaulting [01:00:38] especially when the software isn't terribly easy to work with [01:00:47] not saying the software is bad. just complex. [01:01:10] and yes, we'll likely package the software if you are unwilling to do it, but it's going to take a while [01:02:20] LeslieCarr: is it coming back up ok? [01:02:27] yeah, came back ok [01:02:29] cool [01:02:50] we can certainly do it if we're getting paid for it...but it's a lot of work...for very little gain. [01:03:26] it's the kind of thing that can be automated into your builds [01:03:37] so, it's a bit of upfront work, but that's it [01:03:48] anyway, it looks like it's moving nowhere. [01:05:18] I'm not sure who's working with me on it, not sure if I am allowed to use pip in order to install into a git repository, not sure what to do now [01:05:44] <^demon|away> schmir: You don't have to use git-review, it just makes it easier. [01:06:12] schmir: again, for your own software, manage it how you want [01:06:12] <^demon|away> http://www.mediawiki.org/wiki/Git/Workflow#Manual_setup [01:06:18] the dependencies need to be packaged [01:06:18] it looks like you rather turn off the collection extension cause it's a complex... [01:06:31] eh? that had nothing to do with me [01:06:38] Tim turned that off, I believe [01:06:38] schmir: we turned off the extension because it BROKE WIKIPEDIA [01:07:11] we will turn off anything that breaks wikipedia [01:07:17] our job is to keep wikipedia up [01:07:22] yes, I got that. Ryan said you often do not use software that isn't packaged and is complex [01:09:31] yes. we often don't [01:09:45] because it takes too much effort to manage it [01:12:01] and to me it looks like you threaten me to turn the thing off [01:12:11] o.O [01:12:20] I don't understand what you are talking about [01:12:50] heya folks...let's not try to hash out the longer term stuff. I'd like to have Tomasz and Kul around for that [01:12:53] looks like. [01:13:14] for the short term issue, it'd be nice to get Collections back up and running [01:13:16] I'm not threatening anything. I'm saying that we want to use software that is properly packaged, rather than an unmanagable way. [01:13:38] schmir: and I'm only talking about the new service, not the current one [01:14:05] robla: well, this is a continued conversation that's been taking place for months [01:14:10] via email [01:14:33] Ryan_Lane: sure...let's *make sure* it happens when we've got the right people around [01:15:12] * Ryan_Lane shrugs [01:16:09] TimStarling: maplebed: you guys feel comfortable that we can turn this back on now? [01:16:20] * maplebed does [01:16:59] maplebed and I just discussed this problem: http://wikitech.wikimedia.org/view/User:Bhartshorne/pdf_thumbnail_issue ...and he's got some ideas about how to fix that [01:17:54] schmir: we think we have an idea about what the root cause of the image corruption problem is, and we're working out a plan for fixing htat [01:18:04] robla: thanks [01:18:35] schmir: if you have or can create a list of truncated 1200px images, I'd love a copy. [01:19:32] maplebed: sorry, I think that's not possible without changing the software... [01:20:03] you don't have them in logs or something perhaps? [01:20:39] (i.e. wherever the russia china locator example came from) [01:21:00] if you don't, no big deal, but it would help confirm or refute my hypothesis on why they're truncated. [01:21:08] (by giving me a larger dataset) [01:23:45] volker may be able to provide some more. I'll ask him. [01:28:30] !log resuming 1.19 schema migrations after fenari reboot (on first s4 commons slave, db22) [01:28:31] Logged the message, Master [01:42:15] maplebed: btw volker reported the issue in some irc channel...without getting an answer. we changed the default size after that. [01:42:44] you don't know which channel that was, do you? [01:43:08] (just curious) [01:43:12] I don't know. I guess the tech channel [01:43:40] I posted announcements about swift in the en tech village pump, the commons village pump, the tech mailing list and commons mailing list, and the blog. [01:44:04] if there was another spot that I missed that might have allowed you to see it, I'd like to add it to the list for the next time. [01:45:15] maplebed: me see it? [01:46:00] the post included instructions "if you see anything weird going on with thumbnails, ..." and how to get ahold of me. [01:46:20] well, I'm not reading any of that stuff... [01:46:30] basically, I just want to avert future reoccurrences, [01:46:41] so am asking for help in advice on how to broadcast about changes that might affect folks, [01:46:50] so that they can help us avoid situations where wikipedia breaks. [01:48:17] so you broke it with prior denouncement? :) [01:48:21] schmir: is there any way we can inform you of changes that might affect your service? [01:48:37] mail [01:48:50] do you read any mailing lists? [01:48:54] mwlib-l? [01:49:13] if we send it there, will you see it? [01:49:25] is there a mwlib-l mailing list? [01:49:26] I guess mwlib doesn't have a -l. heh [01:49:47] there's a mwlib one, for sure [01:50:01]  mwlib@googlegroups.com [01:50:39] at any rate, I gotta bail for the evening. [01:51:03] maplebed: seeya [01:52:39] !log started indexer on searchidx2 with /home/rainman/scripts/search-restart-indexer per docs [01:52:41] Logged the message, Master [01:52:47] yes, same here. good night. [01:53:25] Ryan_Lane: I would prefer mail to my @brainbot.com address... [01:53:59] well, it's easier to send to a list, that way your coworkers also get the info [01:54:15] I'll take that into consideration, though [01:54:24] I can setup an alias on our mail server. [01:54:36] that would be great [01:54:49] thanks [01:56:20] RobH is Rob Halsell? [01:56:34] yep [01:57:10] ok. good night. [02:10:11] Hi - CT asked me to report here… I have two independent reports of 502 errors when saving edits on wikis. I'm not sure if it's important, or normal :) [03:13:31] New patchset: Tim Starling; "Fixed sync-l10nupdate again." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2573 [03:13:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2573 [03:22:56] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2573 [03:22:57] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2573 [03:43:48] New patchset: Tim Starling; "Added sudoers rule for l10nupdate -> mwdeploy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2574 [03:44:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2574 [03:44:18] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2574 [03:44:39] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2574 [03:44:58] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2574 [04:26:39] New patchset: Demon; "Revert 682b27, was a stupid change. Just adding something like" [operations/software] (master) - https://gerrit.wikimedia.org/r/2575 [06:57:47] when did they make that default size change (what time and what day) I wonder [07:29:43] New patchset: Asher; "graphite stats retention" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2576 [07:30:06] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2576 [07:30:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2576 [07:30:07] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2576 [07:36:13] New patchset: Asher; "fix lower-precision longer term storage of stats data" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2577 [07:36:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2577 [07:37:42] New patchset: Asher; "fix lower-precision longer term storage of stats data" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2577 [07:38:05] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2577 [07:38:05] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2577 [07:38:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2577 [13:07:28] do we have ipv6 interfaces in prod? [13:07:41] I ask b/c of https://bugzilla.wikimedia.org/34362 [13:08:58] Reedy: ^^ [13:11:34] Reedy: Can you make me admin on test2? [13:12:45] we have them only for a limited list [13:12:55] lemme think about that [13:13:03] that's true for upload maybe, don't remember about the ret [13:13:04] rest [13:13:47] apergos: can you make me admin on test2? [13:13:52] nm [13:13:56] :-D [14:16:23] New patchset: Catrope; "Fix MIME type for .woff" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2578 [14:16:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2578 [15:15:14] New review: Mark Bergsma; "Any reason not to deploy this in base.pp, i.e., on all servers? :)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2556 [15:16:32] New patchset: Mark Bergsma; "Working upstart job varnishncsa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2579 [15:17:03] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2579 [15:17:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2579 [15:30:59] New patchset: Mark Bergsma; "Pass the environment as arguments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2580 [15:32:19] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2580 [15:32:20] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2580 [15:33:46] New patchset: Mark Bergsma; "Syntax error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2581 [15:34:28] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2581 [15:34:29] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2581 [15:36:09] New patchset: Mark Bergsma; "Apparently Puppet doesn't do string concatenation with +" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2582 [15:36:53] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2582 [15:36:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2582 [15:45:34] hexmode: http://wikitech.wikimedia.org/view/IPv6_deployment#Current_IPv6_Deployment_status [16:02:38] !log spence lost /home, mount was "Stale NFS file handle", causing outage of stats.wikimedia.org, fixed by remounting [16:02:39] Logged the message, Master [16:11:52] hi robh [16:12:39] New review: Catrope; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2264 [16:12:48] i have a general ip notation question (not for robh specifically), what does the following notation mean 232.53.21.35/32 (it's the /32 part I am curious about) [16:13:00] It's an IP range [16:13:07] a range over what? [16:13:26] Well if I remember correctly, a /32 range is a range of 1 IP address :D [16:13:32] :) [16:13:48] what would /18 mean? [16:13:54] (for example) [16:14:13] Well, a /31 is a range of 2 addresses, a /30 is 4, a /29 is 8, etc etc [16:14:21] the first 32 bits of the address are the network mask. that's all the bits in this case [16:14:25] So an 18 contains 2^(32-18) = 2^14 addresses [16:14:34] =4096 [16:15:09] so for a given ip address, what is then the lower and upperbound [16:15:15] ? [16:15:27] To figure that out you have to convert it to binary [16:15:33] ugggh [16:15:44] it's not hard. you do each octet separately [16:15:48] Then the lower bound is that address with the last N bits zeroed out, and the upper bound is that address with the last N bits one-d [16:15:49] Yeah [16:16:02] ok [16:16:09] Esp. in C this is easy [16:16:14] :D [16:16:22] if you know your bitwise operators [16:16:43] i guess this is a good learning case [16:17:16] here, this is not bad: http://www.cisco.com/en/US/tech/tk365/technologies_tech_note09186a00800a67f5.shtml [16:17:25] except no one uses "class a" etc anymore [16:17:33] it's all subnetting now [16:18:06] thanks apergos! [16:18:09] much appreciatee [16:18:10] d [16:18:11] sure [16:19:17] oh apergos,could you maybe take care of this ticket: http://rt.wikimedia.org/Ticket/Display.html?id=2436 [16:20:30] hmm my rt ticket filter must not be narrow enough, it didn't end up in my inbox [16:36:34] so drdee [16:36:50] yes [16:37:01] instead of me doing that, I helped out a little with the nfs mount issue, and mutante is setting up andre :-D [16:37:14] excellent! [16:46:10] New patchset: Dzahn; "add account for aengels, add to stat1, fix last UID counter" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2583 [16:46:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2583 [16:56:50] New patchset: Mark Bergsma; "Make start-stop-daemon work with multiple instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2584 [16:59:36] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2584 [16:59:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2584 [17:05:43] New patchset: Mark Bergsma; "Sigh." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2585 [17:06:51] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2585 [17:10:24] New patchset: Mark Bergsma; "Add job name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2586 [17:10:47] New review: Dzahn; "approved now in RT 2436" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2583 [17:10:47] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2583 [17:10:48] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2585 [17:11:35] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2586 [17:11:35] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2586 [17:19:03] I need jenkins on "gallium" to ssh to "formey" so it can executes some gerrit commands there. [17:19:36] so I could create a jenkins user on formey with a ssh key but I am not sure that is much a good idea to have jenkins able to do anything it wants on formey [17:20:13] did we ever setup something like that previously? (softwares sshing between hosts) [17:30:00] New patchset: Hashar; "swp files are now ignored" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2587 [17:30:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2587 [17:49:15] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/2264 [17:49:45] New patchset: Dzahn; "enhance page_all - area code API lookup one-liner :p - option to skip an area" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2264 [18:05:22] Now where the hell am I going to fit 4 new swift servers. [18:06:55] typical answer: under the floor [18:07:08] nobody will notice, it is cold and close to power plugs [18:09:25] RobH: yeah. raid-6 with 12 disks it is [18:10:37] 12? [18:10:47] Ryan_Lane: so we used raid6 on dataset2 [18:11:13] yeah, why not 12? [18:11:20] checking what we did on dataset2 [18:12:03] well god damn [18:12:10] seems its a huge 15 disk raid6. [18:12:12] i dont like. [18:12:17] but meh. [18:12:20] why not? [18:12:27] two disks out of twelve is a lot [18:12:34] true, and its mirrored to another system [18:12:38] so indeed, i guess thats good [18:12:42] netapp, for instance, defaults to 15 disk raid-6 [18:12:54] yea that is not concerning [18:12:57] netapp does it ;P [18:12:58] * Ryan_Lane nods [18:13:04] well, by default [18:13:09] you can make it larger, and many people do [18:13:10] cuz those netapps have been totally great to work with [18:13:14] heh [18:13:17] and /sarcasm off [18:13:30] maplebed: I am tracking down where I can shove these [18:13:37] its a one here, another there kind of thing [18:13:45] all I ask is 3 racks. [18:13:49] well, I ask for 5. [18:13:51] but I'll take 3 [18:14:02] hm [18:14:09] why does this have one unconfigured disk? [18:15:38] crap. I accidentally just made a raid- [18:15:39] 0 [18:15:43] hehe [18:15:47] damn netsplits [18:16:21] maplebed: trying to get you more than three [18:16:37] hey RoanKattouw_away [18:16:55] maplebed: i have you 4 so far, trying to find one more. [18:17:10] chris is going to hate this, its all mostly full racks and he has to rack in top ;] [18:17:32] well he's netsplitted away, so he won't know our evil plans, mwhahaha [18:17:42] ah, there we go [18:19:05] ah. there's two arrays. I forgot about that [18:19:28] RobH: so, I should do two raid-6 and LVM them? [18:19:43] thats what i would suggest yea [18:19:50] * Ryan_Lane nods [18:19:59] though you wanna leave some space on first raid6 for os / [18:20:16] that's a lot of wasted space :( [18:20:23] that's 4 drives [18:20:26] oh well [18:20:40] ? [18:20:45] Ryan_Lane: put the OS in the raid6 [18:20:49] yeah [18:20:49] just dont LVM partitoin it [18:20:50] I am [18:21:13] whats the wasted space? [18:21:29] nothing is. I'm crazy. ignore me :) [18:21:57] that Ryan_Lane, he done goned crazy [18:22:03] maplebed: i now hate ms-b3 [18:22:05] be [18:22:11] i wish we had just called it msbe [18:22:14] i hate the - ;] [18:24:00] hm [18:24:03] it seems we have a bad disk [18:24:26] 01:00:02: Rebuild: 1862.50 GB [18:24:38] it keeps cycling between that and missing [18:24:40] hey, so db46 has a very bad replication lag -- can anyone help/show me how to troubleshoot htis ? [18:25:11] LeslieCarr: is it one of the ones that is having the schema change done? [18:25:42] binasher's email said we'd see some dbs with replag [18:25:45] not that i know of ? (quick email search doesn't show that) [18:26:04] which slice is currently being done? [18:26:16] does SAL mention anything about which slice is being done right now? [18:26:26] ah [18:26:39] RobH: should I enter a ticket about this drive? [18:26:45] is there an easy way to see which slice a db belongs to ? [18:27:14] asher made some tool for this [18:27:51] This is the room for serious operations updates, right? [18:28:01] If so: the espresso machine is now fixed. [18:28:03] That is all. [18:28:07] heh [18:29:53] Ryan_Lane: yea drop ticket in pmtpa for cmjohnson1 to get it replaced [18:29:58] ok [18:30:05] though ya may not be able to set it up today then [18:30:12] * Ryan_Lane nods [18:30:14] not sure if you can force it to build the array with a bad disk. [18:30:18] I'll do the rest of them for a raid [18:30:19] you can't [18:30:22] I tried. heh [18:30:43] I'll make sure the rest are OK [18:30:52] I can make a two-node gluster cluster [18:30:57] then add the other two later [18:33:08] Ryan_Lane: so checking the delay i do see a huge delay ( http://wikitech.wikimedia.org/view/Checking_MySQL_replication ) - just not sure what to do about it :) [18:33:29] you may not want to do anything [18:33:38] if it is the one doing a schema migration [18:33:56] i think i will wait until binasher gets in… [18:33:56] http://noc.wikimedia.org/~hashar/db.php [18:34:15] that's s6 [18:35:41] cool :) [18:36:43] well, seems that's not the slice he was doing [18:39:02] ryan_lane: any idea which disk you think could be bad? [18:39:08] in the ticket [18:39:20] 01:00:02 [18:40:12] got it thks [18:40:19] I'm not even getting a console on labstore2 :( [18:40:23] ah. now I am [18:40:31] just took a while [18:40:43] ah so if that isn't the slice he's doing, something else is wrong... [18:41:24] I looked at the process list [18:41:30] looks like a bunch of processes waiting on the master [18:43:34] Jeff_Green: ? [18:45:00] cmjohnson1: https://rt.wikimedia.org/Ticket/Display.html?id=2442 is for the swift servers arriving tomorrow [18:45:55] Ryan_Lane: so we have an offer from the author of OTRS to help us with upgrading to the current stable version [18:46:04] robh: cool [18:46:19] sounds great! [18:46:45] is he going to take the security issues into consideration? or are they already fixed in the newer version? [18:46:50] he reviewed our existing build including the patches Tim deployed, and he's willing to port those patches to v3 [18:47:00] ah. great. [18:48:14] I'm not really sure what staging that upgrade will look like, but I'm guessing the 'hard' parts are porting the patches, puppetizing, testing, and the data-modification [18:48:28] data modification will likely be the hardest part [18:48:40] and a lot of that data contains private information, which is tricky in labs [18:48:47] exactly [18:49:01] I'm not opposed to having private data in labs, as long as it's short lived, and we track who has access to that project [18:49:03] also it's a large data footprint, I wasn't sure if we're set up for that in labs yet [18:49:08] how much? [18:49:11] 100GB? [18:49:14] more [18:49:17] hm [18:49:20] 300GB? checking [18:49:34] hell which db is it now . . . [18:49:36] well, I'm building the gluster cluster right now ;) [18:49:44] !log running sync-apache to fix office redirect [18:49:45] Logged the message, Master [18:49:53] so, we'll have like 60+TB [18:50:08] otrs is currently 227GB on disk [18:50:22] I would assume we need at least double that for a reasonable conversion process [18:50:25] ok, we'll need to wait till I can share a gluster volume with the project, then [18:50:48] current storage is really limited, and I'd worry about running out of space [18:50:49] roughly when do you anticipate that happening? [18:50:55] I'm working on it right now :) [18:51:00] hopefully in a week or so [18:51:08] oh that's plenty of time [18:51:41] it won't be set up in the way I'd like it to work by then, but it'll be usable storage [18:51:42] his proposed schedule doesn't even have him needing a test system until after 3/5 [18:51:51] oh. yeah. we should be totally fine, then [18:52:11] we just need to inform everyone that only he is allowed to be in the project [18:52:13] and ops people [18:52:18] ok [18:52:27] is he singing some form of non-disclosure? [18:52:35] signing* [18:52:42] I'm in the process of asking that myself [18:52:54] probably good to set up a meeting with legal [18:52:57] I have no idea how we usually do things around here, but I would certainly want that [18:53:00] ok [18:53:04] actually... [18:53:11] can you attend a meeting on friday? [18:53:19] we are having a labs and privacy meeting with legal [18:53:25] we can ask there. heh [18:53:28] if you can bring me along on a laptop :-P [18:53:35] yeah. I'll skype you in [18:53:37] sure, skype power go. [18:53:43] it's at 12:30 PDT [18:53:45] arr, that apparently did not work [18:53:55] maybe we should bring Philippe in on that too? [18:53:57] broke office.wm [18:54:13] he can't attend :( [18:54:29] we'll follow up with him, though [18:54:40] mutante: :D [18:54:53] how'd you break it? [18:55:02] ah. redirect loop [18:55:03] Ryan_Lane: adding a redirect to redirects.conf [18:55:13] Ryan_Lane: ok [18:55:13] we wanted forced https [18:55:18] * Ryan_Lane nods [18:55:35] mutante: forced https ++ [18:55:47] arr, but why is it circular [18:55:59] removing it [18:56:17] mutante: where's the redirects.conf in question? [18:56:26] ah. yeah. that's not going to work [18:56:27] fenari:/home/wikipedia/conf/httpd [18:56:38] mutante: apache has no clue that it is in https mode [18:57:05] well, that's not totally true [18:57:18] you need to read a request header? [18:57:27] the X-Forwarded-Proto header says which protocol is in use [18:57:42] so, you need to configure the redirect to only redirect if X-Forwarded-Proto is http [18:57:55] ah that seems pretty doable [18:57:58] should be doable via RewriteCond [18:57:59] !log reverting the (circular) office redirect, syncing.. [18:58:00] Logged the message, Master [18:58:02] yup [18:59:06] it's safe to trust the header, since we strip it from any request that doesn't come from the ssl servers [18:59:09] Ryan_Lane: do you think we would need an NDA on the labs part excluding the data transform itself? [18:59:14] RewriteCond %{HTTP:X-Forwarded-Proto} !https (?) [18:59:19] or something pretty close to that [18:59:26] Jeff_Green: probably an NDA just for the data [18:59:26] good [18:59:38] Jeff_Green: though we might have to have ID checks for all labs users. :( [18:59:53] which saddens me a little [19:00:19] ok [19:00:21] Ryan_Lane: noooooooo, anons are great! [19:00:25] and real people are assholes too [19:00:26] thanks all! [19:00:31] notpeter: +1 [19:00:45] so i reverted the changed, synced, graceful.. but ..hmm [19:00:51] but, there's some board policy requiring identification for anyone who may have access to user data [19:00:52] also.... how check? cuz, like, photoshop is a thing.... [19:01:21] Ryan_Lane: I'm not sure I think that's such a bad idea, but then I may be a fascist. [19:01:37] * apergos eyes Jeff_Green [19:01:55] * Jeff_Green stares back, sternly, and calmly [19:02:07] Jeff_Green: fascist, hitler, godwin's law, etc. [19:02:13] well, it's probably a good idea, yeah [19:02:21] it makes things a little more painful, though [19:02:32] notpeter: troll! troll! [19:02:33] it definitely makes it harder to give people accounts [19:02:38] yeah [19:02:38] I just think that we won't be able to implement it in any realistic way, and I think that it will be a barrier to entry [19:02:47] * apergos doesn't blink [19:02:47] Jeff_Green: oh, most definitely :) [19:02:47] exactly [19:02:49] I have cats. [19:03:28] perhaps we could have tiers of access [19:03:34] ok, i have a packaged version of lucene-search-2 running on a box in eqiad. without nfs. [19:03:44] clear definitions of what you can do without being 'formally' identified [19:03:45] it's spewing out some warnings. but this is massively good progress [19:03:59] like clearance levels, we can model ourselves on the CIA :-P [19:04:03] Jeff_Green: yeah. likely [19:04:06] Jeff_Green: can we call it the wedding cake access model? [19:04:08] that's very likely what we'll do [19:04:58] we can have people who call all your friends and ask if they can account for your where you were between 12/2/1973 and 4/14/2003, and whether or not you're a compulsive gambler [19:06:02] Jeff_Green: so no one younger than 39? [19:06:08] that's pretty ageist [19:06:19] "not born yet" is an option [19:06:25] but we'll need proof. [19:06:42] Jeff_Green: what if they were doing messed up stuff in a past life? [19:06:58] or in a virtual one? [19:07:04] see, the cost of employing all of these psychics to do past-life background checks just isn't going to scale... [19:07:21] oh, yeah, we should get in touch with linden and blizzard.... [19:10:09] and just hire all of linden labs ? [19:10:31] can't they build us a labs module in their own world? [19:10:56] we could save ourselves a lot of work that way [19:11:13] RoanKattouw: it's virtualization all the way down.... [19:11:28] misping? [19:11:55] RoanKattouw: I was going to ping you about patching scap to push things out to search boxes [19:12:08] Oh, right [19:12:22] I was gonna theoretically do that for you when I was in SF, but I forgot [19:12:26] And I'm sick today [19:12:31] yeah, it's still no hurry [19:12:38] I'm still in a testing phase [19:12:43] so I can keep pushing stuff out by hand [19:12:46] just wanted to check in [19:12:50] OK [19:13:02] my need for this will be real... hopefully within the next week [19:14:25] Hmm, how about next Wednesday? [19:14:33] (as in in 8 days, not tomorrow) [19:14:40] sure [19:14:49] arr, so office.wm still not looking good for you? [19:14:57] tried to purge URLs from squid now [19:15:04] RoanKattouw: by then I should be done reimaging boxes regularly [19:15:14] hehehe [19:15:25] and I made a dsh group called search-transition, I believe [19:15:27] office.wm also has broken Js/CSS [19:15:34] Because Varnish also got the redirect loops, and is caching them [19:15:51] how to clear that ? [19:15:54] that is for migrating all search boxes to a fully puppetized state [19:16:31] hmph, there is a purge-varnish script but it's broken [19:19:55] !log used purgeList.php on office.wm URLs, but it appears to be in varnish cache (broken redirect) [19:19:57] Logged the message, Master [19:20:13] well, you can run it via dsh directly on the varnish servers [19:20:20] using banurl [19:20:24] RoanKattouw: all scripts have a 25% chance of being broken it seems [19:20:38] it's like a slot machine ;) [19:20:46] Ryan_Lane: The current script uses purge.url , maybe this broke due to a Varnish upgrade or someting? [19:20:50] varnishadm ban.url [19:20:52] Aha [19:20:54] * RoanKattouw edits script [19:20:57] well, purge likely works too [19:21:28] No it doesn't [19:21:36] * Ryan_Lane nods [19:21:39] ban.url seemed to work [19:21:47] great [19:21:51] !log Ran Varnish purge for 'office [19:21:53] Logged the message, Mr. Obvious [19:22:09] OK, officewiki JS/CSS is back [19:22:31] You will probably need to do a hard refresh (Ctrl+F5) because the 301 response will be cached by the browser too [19:23:10] thanks Roan [19:23:40] still not working :-( [19:24:29] people are in too many channels [19:24:54] New patchset: Catrope; "Fix purge-varnish, wants ban.url now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2590 [19:25:16] apergos: yes it is a little overwhelming [19:25:35] wtf [19:25:39] ? [19:25:48] one of these boxes is missing a ton of disks [19:25:54] ?! [19:25:55] missing 8 [19:26:06] Ryan_Lane: failed controller? [19:26:14] maybe [19:26:21] where's robh when you need him? :) [19:26:55] the bot needs an APB function, that broadcasts to every known channel, email, and pagers [19:27:07] it says 12 physical disks, but only lists 4 [19:28:51] what does megacli tell you, anything? I assume it's some sort of perc(s)? [19:29:23] (I have no idea what hardware you are on) [19:30:42] no OS installed yet [19:30:49] ah [19:30:50] this is at the hardware raid level [19:30:50] even funner [19:30:58] so something is definitely wrong [19:31:06] well, at least two of them are working properly :) [19:31:10] that's all I need to get started [19:42:03] ryan_lane: did you notice anything different on the 2 that went well. regarding post, etc. I am wondering if it has anything to do with UEFI [19:42:21] I dunno [19:42:24] this one seemed normal [19:42:46] let me reboot and check it out again [19:44:47] same problem [19:45:08] oh wait [19:45:15] it sees the other disks as "foreign" [19:45:38] wtf does that mean? [19:46:19] idk...i am getting that to...i think that is the problem. just need to figure out what it means [19:47:54] hello, daniel? [19:48:27] "Disks cannot be migrated back to previous PERC RAID controllers. When a controller detects a physical disk with an existing configuration, it flags the physical disk as foreign, and it generates an alert indicating that a foreign disk was detected." [19:49:15] fixed [19:49:29] cmjohnson1: in foreign view you can tell it to clear the foreign config [19:49:44] or import it [19:49:48] I chose to clear it [19:49:50] ok..i see that we need to clear ii [19:49:52] andrew_wmf_: hi [19:49:53] now all of the disks are available [19:51:25] rhalsell here? [19:52:04] i dont think he is back on yet [19:52:11] andrew_wmf_: i can see that you sent a message, but cant see the content [19:52:36] test [19:53:06] now it works [19:53:40] ryan_lane: on labstore4...are the disk showing up as foreign or missing? [19:53:53] now they are showing as available [19:53:57] I cleared the foreign config [19:54:22] okay..i think that same needs to be done for labstore1 [19:54:59] Ryan_Lane: hey [19:55:07] so is all of the labs gear going to be on row b ? [19:55:17] umm [19:55:19] I dunno [19:55:38] cmjohnson1: nah. on labstore1 I think the disk is actually broken [19:55:48] yeah..i am getting missing disk errors [19:55:52] yeah [19:55:53] same [19:56:00] but it cycles between missing and rebuilding [19:57:33] well so far it is, so i'll make the labs subnet there for now and hope :) [19:57:37] heh [19:57:46] yeah. I have no clue how that stuff is planned out [19:57:49] :) [19:57:51] ma rk and robh would know [19:58:10] well the big issue is that if it's cross subnet, we'd have to do some spanning tree, which mark has worked really really hard to avoid [19:59:35] hmm I have a small issue: https://office.wikimedia.org/ has a 301 to …. self! :) [19:59:45] is that a known issue on the cluster? [19:59:59] hashar, yes [20:00:05] hashar, try to force-reload [20:00:10] the issue should be fixed now [20:00:22] I use curl so that is not fixed for me [20:00:28] Expires: Thu, 15 Mar 2012 18:53:04 GMT [20:01:01] full trace http://dpaste.org/oc8Tx/ [20:09:14] mutante: is this your first site outage? :) [20:09:17] of course the purge varnish require root ... [20:09:25] I have leaved a private message to roan [20:11:30] the varnish purge already went around I thought [20:13:00] !log built two raid6 arrays per labstore host. raid sets are initializing. [20:13:02] Logged the message, Master [20:13:29] Ryan_Lane: kind of, there was a minor one, but that diddnt really count [20:13:54] heh [20:14:00] mutante: welcome to the team! [20:14:01] :) [20:14:02] hashar: yes, confirmed, that is what woosters still reported [20:14:25] Ryan_Lane: thank you! [20:14:43] it could have been worse, the change could have affected all wikis [20:14:47] mutante: looks like the root URL was not properly purged :/ [20:14:53] then we would have had to purge all of our caches [20:15:00] hashar: i am looking at it.. but the change should have been reverted all this time..and just browser caching [20:15:07] hashar: yea [20:15:31] does it need to be purged in squid? [20:15:46] and didn't the redirect redirect everything that wasn't http? [20:15:46] thats what i just did [20:15:51] so it could have all kinds of crap in there [20:15:58] is amsterdam a mix of varnish / squids ? :( [20:16:09] I got it from amssq41.esams.wikimedia.org [20:16:11] we are using squid for upload and text [20:16:15] and varnish for mobile and bits [20:16:28] we will eventually move to all varnish [20:16:44] echo 'https://office.wikimedia.org/' | mwscript purgeList.php --wiki aawiki [20:16:48] Purging 0 urls [20:16:54] \o/ [20:19:02] mutante: the script does not accept https URL purging 8-))))))) [20:21:28] https://www.mediawiki.org/wiki/Special:Code/MediaWiki/111478 [20:22:31] hmm..what can we do about it. the change itself is long reverted [20:22:58] if that change goes around [20:23:29] hi RobH [20:23:40] Heyas [20:23:44] I have merged it in 1.18wmf1 https://www.mediawiki.org/wiki/Special:Code/MediaWiki/111480 [20:23:52] updating fenari now [20:23:54] test please? (I'm looking at [20:24:03] SquidPurgeClient right now) [20:24:26] would you have some time this week to get multicast logging enabled on Locke this weeke [20:24:29] we have sooo many purge scripts [20:25:03] wasnt someone else working on multicast and ran into a problem (just recalling irc chatter) [20:25:07] ? [20:25:19] hashar: i see now, lagging a bit, thanks [20:27:43] you should be able to purge it now [20:28:13] done. it still said "0 urls" though [20:28:39] ah,. now!:) [20:29:14] !log purged https://office link, using modified purgeList.php that accepts https urls, thanks hashar [20:29:16] Logged the message, Master [20:29:21] Purging 1 urls [20:29:46] does not solve the issue though :-))) [20:29:53] :7 [20:31:10] maybe that purge script does not do anything nowadays [20:33:17] it's nginx sending back the redirect [20:33:55] so we need a purge nginx script? [20:34:05] (we need a purge all script :D ) [20:34:09] Server: nginx/0.7.65 [20:34:14] I don't know who that is [20:35:31] nginx doesn't cache [20:35:34] it's a transparent proxy [20:35:46] you don't need to purge https, since we aren't varying on it [20:35:47] better than before ! [20:35:56] hashar: try again [20:36:23] as far as everything except for nginx is concerned, everything is http. nothing is https :) [20:36:32] (we are doing ssl termination) [20:37:10] mutante: works for me now! thanks [20:37:28] hashar: :) thanks for fixing the script! [20:37:40] well according to Ryan the script is useless [20:37:58] but it worked after using it :p [20:38:03] really? [20:38:03] heh [20:38:05] that would be odd [20:38:07] well [20:38:10] oh. wait [20:38:11] i'm going to do master swaps shortly for s7, s2, and s3, starting with s7. there will likely be nagios heartbeat alerts until i update dns (i.e. s7-master cname), ignore those.. (just those) [20:38:15] no, that makes sense [20:38:20] does the squids know about HTTPS URLs ? [20:38:26] a wget of https://office.wikimedia.org/ produced a redirect to the same url [20:38:27] only redirects [20:38:31] and only ones served by apache [20:38:32] and the back end claimed to be nginx [20:38:33] so [20:38:34] which this one was [20:38:41] aah [20:38:50] totally forgot about that [20:38:54] that's on the agenda to fix ;) [20:38:57] heh [20:39:13] ok well now when we break it again in a few minutes we know how to fix it :-P [20:39:16] binasher: LeslieCarr was mentioning we have really bad replag on db46 [20:39:22] at least it is good to know that the cluster is still a bit messy :-)) [20:39:30] we weren't sure if this was related to the schema migrations [20:39:35] see email about migrations [20:39:37] that would spoil the fun if anyone could fix any issue by themselves [20:39:38] hashar: it'll always be :) [20:39:49] binasher: yeah, didn't know s6 was under migration [20:40:06] binasher: can you log which ones you are starting, so we can reference the SAL? [20:40:13] Ryan_Lane: it's automatic [20:40:14] ryan_lane did you load anything on to labstore1? i am going to move a couple of disks around to see if it is the disk or something else. [20:40:15] see the email [20:40:18] oh? [20:40:31] cmjohnson1: nope. go for it [20:40:43] cmjohnson1: until the disk issue is fixed I can't do anything on that box [20:40:50] please read the email with a subject of "Please Read:.." [20:40:51] so do whatever you need to :) [20:40:57] binasher: I read it. [20:41:20] mutante: thanks for the fix :) [20:41:48] binasher: ah ok. reading it again, I understand better [20:41:56] it's doing all of them, one at a time [20:42:11] can I request a script enhancement? :) [20:42:29] asher@fenari:~/db/switch$ tail -1 /home/asher/db/119-migration/coredbs-1.out [20:42:29] db46 ruwiki 1.19wmf1-1 page_redirect_namespace_len index ar_sha1 rev_sha1 [20:42:30] binasher: when the script moves on to a new host, can it dologmsg, so that it'll show up in the SAL? [20:42:33] yes.. but for 1.20 :) [20:42:36] * Ryan_Lane nods [20:42:38] yeah [20:42:41] that's a good idea [20:42:50] easy to do since it runs on fenari too [20:42:50] I had assumed that db46 was doing a schema migration [20:42:55] when I looked at the process list [20:43:07] so we just waited till you were here to ask :) [20:43:26] yep. that would make it easy for us to look at the SAL and know the lag is definitely normal [20:44:19] this script seems pretty awesome. were all of these done manually before? [20:46:10] it borrows from stuff tim did for 1.16, not sure about before that though [20:48:56] * Ryan_Lane nods [20:49:30] thanks everbody who helped with the office issue, logging out for now [20:49:36] night [20:50:13] mutante: night [20:54:39] ryan_lane: can you go into labstore1 and clear any foreign configs please...thanks [20:54:50] sure [20:54:53] I hadn't seen any before [20:54:56] but lemme check [20:55:22] New patchset: Asher; "upgrading mysql on db16, db37 is new s7 master" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2591 [20:55:46] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2591 [20:55:46] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2591 [20:55:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2591 [20:56:01] oh wow [20:56:03] look at that [20:56:07] there were a few [20:56:24] cmjohnson1: are you also in the interface? [20:56:47] I was just going to delete and recreate the virtual disk [21:16:44] cmjohnson1: looks like it's good now :) [21:16:54] I spoke too soon. heh [21:17:20] ryan_lane: all 12 disks are showing up now...should be good to go [21:17:35] it's showing in the rebuild state, though, eh? [21:17:54] if there is no os data, you guys should quick init them to kill any old sycns [21:17:55] i pulled that disk out [21:18:09] though at least it isn't showing as missing [21:18:24] I wonder how long the rebuild takes [21:18:25] New patchset: Asher; "upgrading db34, db39 new s3 master" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2592 [21:18:29] RobH: I'm initializing all of them [21:18:35] quick init? [21:18:39] i never do full init. [21:18:40] full [21:18:43] meh. [21:18:43] why not? [21:18:46] why? [21:18:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2592 [21:18:46] it stresses the disks [21:18:55] so does the OS install ;] [21:19:03] not nearly as much as an initialize [21:19:07] not quite as much, but meh [21:19:18] it's a good way to see if you are going to get a failed disk early on [21:19:32] it has been rebuilding for about 15-20 mins [21:19:45] I'll take a look later [21:20:01] as long as I start initializing some time today it'll all be ready for tomorrow [21:20:15] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2592 [21:20:16] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2592 [21:26:52] RobH: virt's are in b3, yah ? [21:27:00] and their new interfaces also in b3 :) [21:27:25] they are in differing racks [21:28:23] too close to alt tab..... [21:28:56] LeslieCarr: labstore1 and 2 are in c3-sdtpa, the other two are in d1-pmtpa [21:32:12] New patchset: Lcarr; "reenabling ifup script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2593 [21:32:43] virt1001-1008 [21:32:48] in eqiad [21:33:03] oooo. are those racked now? [21:33:08] robh: what is the story with dataset1 [21:33:14] that's the ciscos, right? [21:33:38] LeslieCarr: how's the networking set up on those? two interfaces, one for management, and another for guests? [21:33:51] or three, where two interfaces are bonded? [21:34:02] I saw something about bonding configuration in tickets [21:34:07] two interfaces bonded [21:34:31] i'm unsure of the other networking requirements…. [21:34:36] did we want them in two vlans ? [21:34:42] two different vlans, that is ? [21:35:45] yeah [21:36:11] eth0 should be management of the hosts, and eth1 should be the guest network [21:36:50] based on the network statistics, it probably isn't necessary to bond the guest, but when migrating instances from host to host, it's definitely possible to saturate that network [21:36:55] hm [21:37:00] though it likely uses eth0 for that [21:37:10] so, yeah. probably don't need bonding [21:37:53] these will be configured just like the ones in pmtpa [21:38:01] okay [21:38:09] http://wikitech.wikimedia.org/view/OpenStack#Network_design [21:38:24] well then i need to do a lot less work :) yay [21:38:27] heh [21:38:39] there's a small chance we'll need to do bonding in the future, but I doubt it [21:39:00] the guest network is weirdly named [21:39:12] it should be named something like "labs internal" [21:39:15] if someone is using that much network traffic, we need to figure out what the hell they are doing :) [21:39:22] what's it named now? [21:39:25] guest makes it sound like anyone can hook up their laptop [21:39:26] guest [21:39:29] heh [21:39:33] instance, then? [21:39:54] instance [21:39:57] labs instance network [21:40:03] that sounds good to me :) [21:40:03] guest is a virtualization term that's really common [21:40:05] host/guest [21:40:27] instance is an ec2/nova term [21:40:48] cool [21:41:10] i see where guest could come from, just so used to stuff having a "guest" network for when random people drop by [21:41:19] heh [21:41:26] yeah. I can see how that would bug a networking person [21:41:43] I prefer instance over guest as well [21:41:59] we can use domU if you really want ;) [21:42:19] dom0/domU [21:42:27] doesn't virtualization just suck? [21:42:31] haha [21:42:51] three sets of terms, all for the same damn thing [21:42:54] yes :) whole new set of names to invent [21:46:19] ryan_lane that rebuild is only at 13%....it will take just a few more minutes ;p [21:46:26] heh [21:46:39] yeah. gonna be a while I see [21:46:46] it's frightening that a rebuild takes this long [21:47:13] extremely [21:47:15] if a disk goes bad, it'll take a while to replace [21:47:23] this is life with giant disks, though [21:47:35] this is one of the reasons raid5 is basically worthless now [21:48:18] that and silent write errors [21:49:54] i am not an expert with raid but doesn't raid 5 have faster write speed? [21:50:20] yes [21:50:31] raid6 is double parity [21:51:26] but a 12 disk raid5 with 2TB disks worries me [21:52:32] New patchset: Asher; "upgrading db31, new s2+s4 masters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2594 [21:53:17] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2593 [21:53:18] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2593 [21:55:10] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2594 [21:55:11] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2594 [21:55:25] why does the nagios bot no longer alert the channels? [21:55:35] hrm, sorry re: page [21:55:42] no worries [21:55:55] !log restarting ircecho on spence [21:55:57] Logged the message, Master [21:56:25] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [21:56:29] that's better [21:56:36] New patchset: Lcarr; "Fixing default interface to default_gateway_interface" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2595 [21:56:58] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2595 [21:56:59] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2595 [21:57:01] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:57:41] binasher: i merged your change [21:58:04] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 138 MB (1% inode=57%): /var/lib/ureadahead/debugfs 138 MB (1% inode=57%): [21:58:22] thanks, i was just wondering what happened to it.. "did i forget to merge it?" hah [21:59:52] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.309 seconds [21:59:54] LeslieCarr: did your merge on sockpuppet succeed at syncing to the real puppet host? [22:00:21] I thought the mobile WAP was now just hitting the normal mobile site [22:00:29] as far as i know ? [22:00:38] is ekrem even doing anything anymore? why are we still monitoring it? [22:00:40] the change i merged with it synced to a host [22:01:02] Ryan_Lane: is that the custom apple gateway thing? [22:01:10] I thought apple is using the api now [22:01:49] i have no idea [22:01:51] I'll ask tomasz :) [22:03:09] no they're not [22:03:16] ekrem is mainly used for apple dictionary search [22:03:28] lame [22:03:30] and for the wap redirect [22:03:35] can someone tell them to stop sucking? [22:03:37] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 221 MB (3% inode=58%): /var/lib/ureadahead/debugfs 221 MB (3% inode=58%): [22:03:37] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.942 seconds [22:03:41] en.wap.* redirects to the mobile site [22:04:40] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 1 MB (0% inode=58%): /var/lib/ureadahead/debugfs 1 MB (0% inode=58%): [22:04:40] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=57%): /var/lib/ureadahead/debugfs 199 MB (2% inode=57%): [22:06:19] RECOVERY - mysqld processes on db31 is OK: PROCS OK: 1 process with command name mysqld [22:06:28] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:07:13] RECOVERY - MySQL Replication Heartbeat on db1038 is OK: OK replication delay seconds [22:07:13] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 75 MB (1% inode=58%): /var/lib/ureadahead/debugfs 75 MB (1% inode=58%): [22:07:13] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 123 MB (1% inode=57%): /var/lib/ureadahead/debugfs 123 MB (1% inode=57%): [22:07:32] RECOVERY - MySQL Replication Heartbeat on db51 is OK: OK replication delay seconds [22:07:32] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay seconds [22:07:32] RECOVERY - MySQL Replication Heartbeat on db22 is OK: OK replication delay 0 seconds [22:07:40] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:07:49] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay seconds [22:10:13] RECOVERY - Disk space on srv223 is OK: DISK OK [22:10:39] !log restarted the 1.19 schema migration script - it's going to hit the just rotated s3 (db34), s2 (db30), s7 (db16), and s4 (db31) ex-masters before resuming s5 (db55) and all s6/s1 slaves [22:10:41] Logged the message, Master [22:11:07] RECOVERY - Disk space on srv219 is OK: DISK OK [22:11:25] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=57%): /var/lib/ureadahead/debugfs 0 MB (0% inode=57%): [22:11:25] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 274 MB (3% inode=58%): /var/lib/ureadahead/debugfs 274 MB (3% inode=58%): [22:14:07] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 100 MB (1% inode=58%): /var/lib/ureadahead/debugfs 100 MB (1% inode=58%): [22:14:16] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 135 MB (1% inode=58%): /var/lib/ureadahead/debugfs 135 MB (1% inode=58%): [22:16:22] RECOVERY - Disk space on srv220 is OK: DISK OK [22:16:31] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 134 MB (1% inode=58%): /var/lib/ureadahead/debugfs 134 MB (1% inode=58%): [22:16:58] New patchset: Hashar; "adding .gitreview (again)" [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2596 [22:17:52] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=57%): /var/lib/ureadahead/debugfs 0 MB (0% inode=57%): [22:18:23] New review: Hashar; "(no comment)" [test/mediawiki/core2] (master); V: 1 C: 0; - https://gerrit.wikimedia.org/r/2596 [22:18:23] New review: Hashar; "(no comment)" [test/mediawiki/core2] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2596 [22:18:23] Change merged: Hashar; [test/mediawiki/core2] (master) - https://gerrit.wikimedia.org/r/2596 [22:20:25] PROBLEM - Puppet freshness on aluminium is CRITICAL: Puppet has not run in the last 10 hours [22:23:25] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 266 MB (3% inode=58%): /var/lib/ureadahead/debugfs 266 MB (3% inode=58%): [22:23:42] RobH: hey the virt100X machines tubes are ready [22:23:57] Ryan_Lane: ^ [22:24:06] the bonded interfaces for the cisco servers [22:24:12] LeslieCarr: thanks =] [22:24:13] sweet [22:24:20] not bonded [22:24:25] oh [22:24:32] i thought they were bonded, i may have requested wrong [22:24:34] LeslieCarr: ? [22:24:38] oh [22:24:40] not bonded [22:24:41] Ryan_Lane knows best [22:24:43] ok, cool [22:24:43] good [22:24:44] I just talked wih her about it [22:24:44] hehe yep [22:24:55] put eth0 into "labs" vlan that's accessible outside [22:25:06] and eth1 is in a vlan that is not ip'ed [22:25:12] so is stuck on asw-b-eqiad [22:25:38] fine you folks who can actually talk to one another ;p [22:25:40] RECOVERY - Disk space on srv224 is OK: DISK OK [22:25:40] RECOVERY - Disk space on srv221 is OK: DISK OK [22:25:41] rub it in [22:25:46] eth0 has public addresses, and 10.4.0 addresses [22:25:49] RECOVERY - Disk space on srv223 is OK: DISK OK [22:25:50] err [22:25:58] 10.0 addresses [22:25:58] RECOVERY - Disk space on srv222 is OK: DISK OK [22:26:18] networking for labs is awkward, to say the least ;) [22:26:54] atleast its spffy enough that its not rack dependent. [22:31:18] it kind of is [22:31:23] well, it's row dependent [22:31:51] row dependant, yeah :( [22:32:07] well RobH actually Ryan_Lane is not in the office today... [22:32:23] yeah, taking care of "things" [22:32:47] i'll be back in the office tomorrow [22:35:36] mutante: still around ? [22:38:25] PROBLEM - Puppet freshness on gilman is CRITICAL: Puppet has not run in the last 10 hours [22:38:25] PROBLEM - Puppet freshness on grosley is CRITICAL: Puppet has not run in the last 10 hours [22:40:33] RobH: hey, can I unallocate ganglia1001 and ganglia1002 to being misc hosts ? [22:40:45] we arent using them anymore? [22:40:58] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.946 seconds [22:41:21] going to put them with other services [22:41:36] so sort of ? :) [22:42:08] so the servers should rename back to original names and be wiped clean right? [22:42:42] yeah [22:42:50] then i'll use them again but they can be allocated to more stuff [22:43:02] shit, i didnt list the old name... [22:43:06] god damn it. [22:43:39] time for http://en.wikipedia.org/wiki/Periodic_table [22:44:28] hehe [22:44:35] PROBLEM - MySQL Slave Delay on db34 is CRITICAL: CRIT replication delay 247 seconds [22:44:41] just please not the unn… names [22:44:53] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:45:42] !log deployed squid config to upload to send all thumbnail traffic to ms5 instead of swift [22:45:44] Logged the message, Master [22:46:13] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.516 seconds [22:46:40] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 281 MB (3% inode=62%): /var/lib/ureadahead/debugfs 281 MB (3% inode=62%): [22:46:44] LeslieCarr: Ok, so we need to drop a few tickets and update a few things. If you want, drop a ticket for the master renaming of them in core-ops [22:46:48] then i can link the tasks from that. [22:46:56] ok [22:46:58] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=62%): /var/lib/ureadahead/debugfs 0 MB (0% inode=62%): [22:48:20] ok, ganglia1001=neon and ganglia1002=cobalt [22:48:30] and next time you need a name change you are gonna get harassed for it ;] [22:48:56] RobH done [22:48:57] haha [22:49:00] yes [22:49:14] woosters: robla: swift is now out of service. [22:49:43] ok. [22:49:47] maplebed: ok, thanks. looking now [22:49:49] RECOVERY - MySQL Slave Delay on db34 is OK: OK replication delay 0 seconds [22:50:15] (of course, squid will still have caches of partial images) [22:50:17] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:50:34] RECOVERY - Disk space on srv224 is OK: DISK OK [22:50:51] Is it worth logging an RT ticket to get these apaches fixed/reinstalled? [22:50:52] RECOVERY - Disk space on srv223 is OK: DISK OK [22:51:35] heh.. we're actually not much better off than we were before since the squids are caching all the broken images anyways; they could be cached for as long as a week. [22:51:46] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 31 MB (0% inode=62%): /var/lib/ureadahead/debugfs 31 MB (0% inode=62%): [22:51:46] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 178 MB (2% inode=62%): /var/lib/ureadahead/debugfs 178 MB (2% inode=62%): [22:52:22] LeslieCarr: you will be reinstalling these systems right? [22:52:33] or you will handle the hostname changes that is ;] [22:52:43] (i vote reinstall) [22:52:49] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.724 seconds [22:53:41] sure i can reinstall them [22:54:22] RobH: did you get the dns ? [22:54:35] nope, updated ticket saying i didnt wanna if you werent reinstalling [22:54:37] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 209 MB (2% inode=62%): /var/lib/ureadahead/debugfs 209 MB (2% inode=62%): [22:54:42] didnt wanna interrupt anything you may have tied to it. [22:54:52] naw, nothing active right now [22:54:56] you can make any changes ya want [22:55:31] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.770 seconds [22:55:40] RECOVERY - Disk space on srv219 is OK: DISK OK [22:55:49] RECOVERY - Disk space on srv220 is OK: DISK OK [22:55:49] RECOVERY - Disk space on srv221 is OK: DISK OK [22:56:52] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:59:34] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:59:34] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=62%): /var/lib/ureadahead/debugfs 199 MB (2% inode=62%): [23:00:46] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.270 seconds [23:00:46] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.016 seconds [23:00:55] RECOVERY - Disk space on srv219 is OK: DISK OK [23:01:06] New patchset: Pyoungmeister; "some loggin for the lsearchz" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2597 [23:02:52] PROBLEM - MySQL Slave Delay on db34 is CRITICAL: CRIT replication delay 217 seconds [23:04:13] RECOVERY - MySQL Slave Delay on db34 is OK: OK replication delay 0 seconds [23:04:49] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:04:58] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 85 MB (1% inode=62%): /var/lib/ureadahead/debugfs 85 MB (1% inode=62%): [23:07:13] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 4.593 seconds [23:07:40] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 154 MB (2% inode=62%): /var/lib/ureadahead/debugfs 154 MB (2% inode=62%): [23:08:06] New review: Pyoungmeister; "manually verifying" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2597 [23:08:07] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2597 [23:08:43] PROBLEM - Host ganglia1001 is DOWN: PING CRITICAL - Packet loss = 100% [23:10:52] RECOVERY - Disk space on srv221 is OK: DISK OK [23:12:22] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:13:25] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=62%): /var/lib/ureadahead/debugfs 0 MB (0% inode=62%): [23:15:31] RECOVERY - Lucene on search1002 is OK: TCP OK - 0.027 second response time on port 8123 [23:17:19] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 84 MB (1% inode=62%): /var/lib/ureadahead/debugfs 84 MB (1% inode=62%): [23:17:19] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 29 MB (0% inode=62%): /var/lib/ureadahead/debugfs 29 MB (0% inode=62%): [23:17:37] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.043 seconds [23:18:40] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 139 MB (1% inode=62%): /var/lib/ureadahead/debugfs 139 MB (1% inode=62%): [23:21:06] !log modifying "martian" blocks on cr2-eqiad to allow newly allocated ip ranges [23:21:08] Logged the message, Mistress of the network gear. [23:21:22] RECOVERY - Disk space on srv221 is OK: DISK OK [23:21:31] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:21:40] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:24:40] PROBLEM - MySQL Slave Delay on db34 is CRITICAL: CRIT replication delay 201 seconds [23:24:40] RECOVERY - Host ganglia1001 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [23:25:16] RECOVERY - Disk space on srv223 is OK: DISK OK [23:25:25] RECOVERY - Disk space on srv219 is OK: DISK OK [23:26:02] RECOVERY - MySQL Slave Delay on db34 is OK: OK replication delay NULL seconds [23:26:37] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 5.203 seconds [23:26:46] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 5.130 seconds [23:27:40] PROBLEM - MySQL Slave Running on db34 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Unknown column rev_sha1 in field list on query. Default d [23:29:19] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 38 MB (0% inode=62%): /var/lib/ureadahead/debugfs 38 MB (0% inode=62%): [23:30:31] RECOVERY - Disk space on srv223 is OK: DISK OK [23:30:40] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:32:01] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 8.607 seconds [23:32:10] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:34:34] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 199 MB (2% inode=62%): /var/lib/ureadahead/debugfs 199 MB (2% inode=62%): [23:34:43] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 6.309 seconds [23:39:40] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 58 MB (0% inode=62%): /var/lib/ureadahead/debugfs 58 MB (0% inode=62%): [23:39:49] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 155 MB (2% inode=62%): /var/lib/ureadahead/debugfs 155 MB (2% inode=62%): [23:41:10] RECOVERY - Disk space on srv220 is OK: DISK OK [23:42:31] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:43:22] !log allowed ipv6 pim on edge routers in the US [23:43:25] Logged the message, Mistress of the network gear. [23:43:52] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 7.961 seconds [23:46:16] RECOVERY - Disk space on srv223 is OK: DISK OK [23:46:25] RECOVERY - Disk space on srv224 is OK: DISK OK [23:50:46] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:51:58] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.093 seconds [23:55:43] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:56:01] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:58:25] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 9.696 seconds [23:58:43] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 9.619 seconds