[00:00:58] !log readded /dev/sda2 partition on streber, it was somehow deleted, borking the raidset [00:01:07] Logged the message, Master [00:01:13] which is stupid, because it's a raid1. wtf is the point of a raid1 that doesn't allow access when a disk is down? [00:01:55] where the hell is the bot? [00:01:56] morebots: poke poke [00:04:12] did someone upgrade streber recently? [00:04:17] it's pretty royally fucked [01:47:27] Error connecting to 10.0.6.47: Lost connection to MySQL server at 'reading authorization packet', system error: 0 [01:47:49] that's an odd one. [01:48:02] db37, s7 slave [01:48:10] ganglia only shows a drop in api and application traffic but not squid. This makes me think it's not something out there on the internet (eg routing issue) but something internal to us. [01:48:16] !replag [01:48:19] does that work? [01:48:26] isn't there a bot that gives us replag? [01:48:30] anon is quick, logged-in is slow [01:48:33] It did [01:49:47] Yeah... Squid load is constant [01:49:52] look on the daily graph [01:50:03] TimStarling: which daily graph? [01:50:15] of network [01:50:34] I'm sorry, we have so many graphs. could you link the one you mean? [01:50:40] http://ganglia.wikimedia.org/graph.php?g=network_report&z=medium&c=Application%20servers%20pmtpa&m=network_report&r=day&s=descending&hc=3&mc=3&st=1324432189 [01:51:01] oh, you mean that it's not a drop but the end of a spike. [01:51:02] interesting. [01:51:05] yeah [01:51:27] I can buy that. [01:51:55] http://ganglia.wikimedia.org/?m=bytes_out&r=day&s=descending&c=Application+servers+pmtpa&h=&sh=1 [01:52:28] it was just on srv229 apparently [01:53:16] TimStarling: you should start playing with ganglia3-tip.wikimedia.org. way better features. [01:53:43] ganglia.wikimedia.org is bookmarked and in my history a million times [01:53:55] you should move ganglia3-tip to ganglia if you think it's better [01:54:18] TimStarling: leslie's currently packaging the 2.2.0 release; we'll move to that as soon as it's puppetized. [01:54:42] but back to site suckage... [01:54:47] I don't see a candidate yet, still poking. [01:56:02] is there actually any problem on the site? [01:56:20] to me it just looks like there was a single fast download [01:56:31] just "[5:45 PM] People reporting editing is slow" [01:56:51] let's profile then [01:57:06] apparently the conversation is happening in -tech, not here. [01:57:08] ::sigh:: [02:55:16] PROBLEM - mobile traffic loggers on cp1043 is CRITICAL: PROCS CRITICAL: 1 process with args varnishncsa [02:55:38] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Wed Dec 21 02:55:20 UTC 2011 [04:41:43] New patchset: tstarling; "Attempting to fix l10nupdate on the image scalers. Everything in the mediawiki-installation dsh node group should be able to get LU updates. Hume is also broken and should probably be in applicationserver::home-no-service, but I'll leave that for another " [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1653 [04:41:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1653 [04:42:44] New review: tstarling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1653 [04:42:44] Change merged: tstarling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1653 [04:48:16] New review: tstarling; "Tested." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1653 [05:47:53] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [08:18:31] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [08:19:31] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [08:21:51] PROBLEM - Disk space on hume is CRITICAL: DISK CRITICAL - free space: / 341 MB (5% inode=79%): /a/static/uncompressed 23167 MB (2% inode=99%): [08:29:02] Good evening TimStarling [08:29:16] What was going on with LocalisationUpdate earlier today? [09:06:35] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1606 [09:13:14] PROBLEM - Disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10754 MB (3% inode=99%): [10:17:50] anyone keeping track: [10:17:56] ds1 kernel panics, log ful of em [10:18:08] so *no one close the ticket* this means you robh [10:22:24] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:24:14] 1200 GiB transferred [10:54:44] RoanKattouw: some servers didn't have the l10nupdate user [10:55:32] https://gerrit.wikimedia.org/r/#change,1653 [10:56:46] also LocalisationCache::recache() was showing up in profiling for a while, probably triggered by LU [10:57:27] I'd set up manualRecache if our sync scripts didn't suck so much [11:13:36] Oh, I've long known not all servers have that user, but I figured it wasn't harmful [11:14:12] Were there any "real" issues? [11:14:33] (Like increased CPU usage, downtime, etc) [11:27:08] can someone please have a look at gallium puppet run? [11:27:31] change https://gerrit.wikimedia.org/r/#change,1644 was merged yesterday but is not yet applied [11:28:10] it is supposed to copy a bunch of html / css files in /srv/org/mediawiki/integration/WikipediaMobile/nightly/ [11:28:41] that then make http://integration.mediawiki.org/WikipediaMobile/nightly/ available (nightly builds for the wikipedia mobile application) [11:39:13] hashar: ok, checking [11:39:41] good morning David [11:39:58] there must be an error somewhere in the puppet file [11:40:15] morning [11:40:19] Duplicate definition: File[/srv/org/mediawiki/integration/WikipediaMobile/nightly] is already defined [11:40:29] \o/ [11:40:31] in file /var/lib/git/operations/puppet/manifests/misc/contint.pp at line 160; cannot redefine at /var/lib/git/operations/puppet/manifests/misc/contint.pp:160 [11:41:11] fixing it. Thanks [11:41:17] k:) [11:44:51] New patchset: Hashar; "nightly mobile build dir was duplicated" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1654 [11:45:01] mutante: ^^^ that one should fix it [11:45:01] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/1654 [11:45:08] well it does not lint of course [11:46:01] New patchset: Hashar; "nightly mobile build dir was duplicated" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1654 [11:46:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1654 [11:46:34] that one is better [11:47:23] New patchset: Dzahn; "make the process check on mobile traffic loggers a bit more relaxed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1655 [11:47:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1655 [11:48:02] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1655 [11:48:02] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1655 [11:49:33] New review: Dzahn; "looks good. should fix gallium. checking" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1654 [11:49:33] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1654 [11:50:31] ask for a few secs [11:50:32] afk [11:50:34] err [11:50:37] well will be back soon [11:51:07] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Wed Dec 21 11:50:55 UTC 2011 [11:51:13] hashar: done:) [11:51:45] great! [11:51:48] it created several files in /integration/WikipediaMobile/nightly .. [11:52:01] I can access them at http://integration.mediawiki.org/WikipediaMobile/nightly/ [11:52:07] but / give a 403 error [11:52:18] I forgot an apache directive osomewhere [11:52:28] or add an index.html? [11:52:34] na [11:52:40] the idea is to list the files [11:52:57] and add HTML header & footer to format the default Apache directory listing [11:55:56] I don't get it [11:56:00] the apache conf says: [11:56:01] [11:56:02] Options +Indexes [11:56:15] +Indexes should allow directory browsing as I understand it [11:56:27] Did you graceful Apache after changing the config? [11:56:33] lol [11:56:36] that must be it [11:56:41] I have forgot add a subscribe [11:56:48] let me do it for you, i am still on gallium [11:56:52] \O/ [11:57:11] gracefulled [11:57:21] and there they are:) [11:57:30] yeah https://integration.mediawiki.org/WikipediaMobile/nightly/ [11:57:44] please applaud my wonderful design :D [11:58:00] thanks again mutante ! [11:58:05] nice! [11:58:12] * mutante claps hands [11:59:17] heh, i guess i will have to install one of those .apk files now:) [12:14:19] heading lunch :D [12:14:45] will bug you this afternoon to get the testswarm class enabled on gallium :) [12:14:45] https://gerrit.wikimedia.org/r/#change,1646 [12:14:55] but that will be for after the lunch [12:18:56] does anyone happen to know how to use the pdus to powercycle a server? [12:19:08] first time I've needed to and there's no rob here [12:19:48] I don't have the faintest idea what make they are or anything, so while I could try to ssh to the right ip, after that I would be stuck (doesn't seem right to monkey around on one of those guessing) [12:24:29] it doesnt have drac nor lom? [12:25:06] no, it's ds1, no ilom, [12:25:40] you can get onto the serial console via pmshell but here it does us no good, I'm on the host but it won't shut down cause of the kernel issue [12:25:51] (SM hardware) [12:26:44] maybe you can still send a magic sysrequest sequence? [12:27:12] http://en.wikipedia.org/wiki/Magic_SysRq_key#Alternate_ways_to_invoke_Magic_SysRq [12:27:15] I need to hard pwoer it down [12:27:27] I can type commands on the box, that's not the isssue [12:28:09] it just won't complete the shutdown sequence, see [12:29:03] hmm,ok, it doesn't shutdown "-h" [12:29:17] if I shutdown -h then I can't bring it back up [12:29:22] no ilom! [12:32:20] i can tell you more about the PDUs, from the snmp traps, OIDs we use [12:32:28] $servertech_tree = ".1.3.6.1.4.1.1718" [12:32:28] great, I'll take it [12:32:35] I was looking at the snmp stuff [12:32:45] so they are "Servertech" [12:32:49] and I saw those but didn't know how to get anything more oout of them [12:32:49] servertech.com [12:32:56] going to look em up now [12:34:02] http://wikitech.wikimedia.org/view/ServerTech_CDU [12:34:28] that says almost nothing about them :-D [12:40:56] ok, found a manual for their firmware generally, going to have a look at that [12:41:17] also I am *starving* [12:41:43] was stuck somewhere around http://www.servertech.com/products/remotepwrmgmtconsoleportaccess/sentry-commander-pt40 [12:45:36] ah [12:45:45] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [12:46:09] http://www.servertech.com/products/smart-pdus/smart-pdu-cs-48vd I look ed at this one randomly and found the firmware pdfs [12:48:27] apergos: yes, yours is probably better, they look more like something from that category (CDUs), not like "sentry-commander" [12:48:46] I guess I'll poke at this after I eat [12:55:26] Happy holidays guys. Try not to work too hard ^^. [13:03:42] apergos powercycling dataset1? [13:03:53] yes [13:03:56] don't do it please [13:04:01] did you figure it out? [13:04:11] you login to powerstrip via browser [13:04:13] but if you can tell me (since I just got done eating lunch and am now looking for the username :-P) [13:04:18] via a browser? [13:04:25] I was gonna ssh in [13:04:27] no, huh? [13:04:41] I am not aware of a way to do via command line [13:04:54] so you have to have the proxy setup for lan browsing the internal vlan [13:04:59] so do I assume rigiht, first off, that it would be ps1-a1-sdtpa? [13:05:08] so i do a ssh to fenari with -D 8080 [13:05:17] well, dataset1 is in b1 i think [13:05:28] so ps1-b1-sdtpa.mgmt.pmtpa.wmnet [13:05:45] username? [13:06:09] root with mgmt info [13:06:22] I'm gonna try ssh :-P [13:06:32] just to see, might learn something! (if it's been set up [13:06:32] ) [13:07:37] nice! I'm in :-D [13:07:48] Sentry Switched CDU Version 6.0h (090310) [13:08:26] heh and now we try a fee "show me some info" commands.... [13:08:54] there is a group setup for dataset1 [13:08:54] t [13:08:54] hat [13:08:55] [13:09:00] that will cycle the array and chassis [13:11:06] hrmm, my foxyproxy stuff isnt working ... [13:13:11] wtf. [13:14:18] so when you say there's a group set up, what do you mean? (I'm gonna try to translate what you say about the web interface tothe right command line stuff) [13:16:07] hmm I am looking at show traps (very interesting), there are a couple lines: [13:16:13] .AC6 dataset1_a:xz:6 ON OFF ON 0 A 12 A [13:16:14] and [13:16:20] .BC6 dataset1_b:xz:6 ON OFF ON 0 A 12 A [13:16:24] ah and one for the array also [13:16:29] .BC7 dataset1-array1_b:xz:7 ON OFF ON 0 A 12 A [13:16:42] two for the array, sorry [13:16:44] .AC7 dataset1-array1_z:xz:7 ON OFF ON 0 A 12 A [13:20:46] yea, and the array and chassis are in the dataset1 group in the software [13:20:54] so it can do a power cycle on all 4 ports at the same time [13:21:04] but sudddenly my tunnel via -D 8080 isnt working [13:21:11] i cannot bring up anything on lan =P [13:21:16] the dataset1 group... not sure what you mean by that [13:21:21] * apergos continues to poke around [13:21:33] i have no idea how to do in command line, not sure if you can [13:21:38] so no idea how to help you there [13:22:08] wtf =P [13:23:22] apergos: so its my config locally, but i can get in now kinda [13:23:27] huh, I am on 6.0 of the firmware, better get that manual :-D [13:23:35] so in the web interfect its on outlet control group [13:23:54] so if you want i can do it here, but not touching unless you say. [13:25:28] re [13:25:34] ooohh I see a group [13:25:37] dataset1-all [13:25:46] sounds right [13:25:53] plus some groups for storage1 and 2 :-P [13:26:13] (I am totally going to document this plus a link to the manual when I get done :-P) [13:27:12] cool [13:28:23] +1 [13:29:11] so, did you try installing the wikipedia app on your android yet;) [13:29:31] the "nightly builds" link works since a couple hours [13:32:15] http://integration.mediawiki.org/WikipediaMobile/nightly/ [13:32:51] RECOVERY - mobile traffic loggers on cp1043 is OK: PROCS OK: 3 processes with args varnishncsa [13:32:51] RECOVERY - mobile traffic loggers on cp1044 is OK: PROCS OK: 1 process with args varnishncsa [13:33:10] going to try it on a tablet [13:38:01] PROBLEM - NFS on dataset1 is CRITICAL: Connection refused [13:40:52] New patchset: Hashar; "enable testswarm on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1646 [13:41:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1646 [13:41:54] mutante: can you ping me when you have time to unleash testswarm on gallium? [13:44:00] "latest" installs and starts on HTC Desire Z aka. T-mobile G2 - check. - installs and starts on Sony Tablet S - check [13:45:08] hashar: yea, will do, just a little while to finish initial commit for my labs instance [13:45:24] take your time [13:45:30] I am doing MediaWiki code review meanwhile [13:53:21] PROBLEM - SSH on dataset1 is CRITICAL: Connection refused [14:04:23] you saw the backread RobH that dataset1 kernel panic etc etc same old same old? [14:04:42] unfortunately [14:04:51] only we have the first instance of the panic, it was kswapd, then the resst are all scp broking [14:04:53] *borking [14:05:08] so this could be [14:05:29] still hardware, some obscure kernel bug, an issue with the kernel and the particular boards, [14:05:34] whoe th fsck knows [14:05:51] that was "who the" [14:09:39] the shadow knows [14:09:54] none of these kids will get that. [14:29:06] http://wikitech.wikimedia.org/view/PDU [14:29:08] no they won't. [14:29:19] so please feel free to add anything you like, oh just a sec actually [14:30:07] chris will be updating the firmware on all of them [14:30:22] cuz the newest row in sdtpa has new firmware, and its interface is a lot nicer for balacing the power [14:30:40] and he wanted something more sysadmin to do [14:30:53] and this was something that would be nice, but isnt time sensitive, so he can take his time and get it right [14:31:15] but i doubt the command line options will have changed, as its going grom 6.0 to 6.2 [14:31:37] well the 6.1 manual is there, I'm pretty sure these command will be good through it [14:31:53] so now you can add whatever you like, I just finished the one thing I wantd to put in [14:32:08] for example how do I know which pdu really is the right one for some host? [14:32:15] which I sort ofhandwaved over [14:32:47] for second example ds1 will automagically come back up when we do that, if it was powered on when we turned off the outlet [14:32:51] what about other hosts? [14:32:54] well, there are only 3 racks with the switching [14:32:59] the other hosts cannot be powercycled [14:33:05] ok [14:33:17] wanna scribble something about that stuff onthe page? [14:33:21] so only the network racks (a1-sdtpa, a1/8-eqiad) [14:33:25] and b1 have them [14:33:33] b1 was for old servers like storage1/2 [14:33:36] ds1 [14:33:40] sure [14:33:44] thanks. [14:34:11] ahh shit [14:34:19] so i used keepassx to redo my passwords [14:34:23] and now im locked out of wikitech [14:34:37] cuz i tried to set it to an invlaid one, it trimmed off the excess, and i have no idea what characters were tossed [14:34:44] apergos: you an admin on wikitech? [14:34:46] the "excess" eh? [14:34:50] pretty sure its email is borked [14:34:57] so someone has to reset it for me. [14:35:00] I think I am an admin over there [14:35:42] but I don't know how I reset your password [14:35:58] hrmm [14:36:04] I'm not even sure that's a feature we have in mw [14:36:09] it may have to be done via php console, in which case i can do [14:36:18] yea, i was being stupid thinking mediawiki could do something like that ;P [14:36:42] try the email though first, you never know [14:36:43] of course mediawiki expects the server to actually work [14:36:51] with email and basic services ;] [14:37:24] heh, im ssh'd into ps1-b1 now [14:37:37] apergos: i normally do this to do the initial setup, then never again [14:37:50] which should *also* be documented someplace *cough* [14:37:52] since most of our power strips are stupid, there is no reason to login to them, so never bothered to mess with this [14:37:56] feel free to add it.... [14:38:01] i cannot login! [14:38:03] heh [14:38:05] hahahaha [14:38:12] it's funny cause it's true [14:38:23] maybe we can find some other thing for you to document while we're at it [14:39:44] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [14:41:38] apergos: so yea, i just used php console to do that [14:41:42] ok [14:41:43] you know how to do it, wanna learn? [14:42:17] its interesting, cuz this is the only way we used to have to scramble user accounts, so back in the day I would do a bunch of these for projects that were internal [14:42:17] changePassword.php out of maintenance directory? [14:42:38] using eval.php with dbdefined [14:42:46] so php eval.php wikidbname [14:43:05] then you have the php console up and can just directly update entries [14:43:15] ok [14:43:27] its a root level access thing but its decent, if you want i can email you my notes [14:43:32] so changePassowrd just takes a username and a password [14:43:52] in the past i have used this to null passwords and email fields [14:43:57] thus preventing accounts from logging in [14:44:21] I think we lock accounts now for that [14:44:31] yep [14:44:42] when i did this stuff mediawiki didnt allow users to do that stuff [14:44:51] but it was handy for this, heh [14:44:51] feel free to send em, I just think I'll probably use these other tools since we have em [14:45:28] which reminds me that we'll prolly have eval.php a bit longer since we aren't on the php-hiphop train right now [14:47:05] ah so... please add stuff about the racks that have these smart pdus, how to know which one is the one you want, and how to set 'em up... now that your account is working :-P [14:48:33] updated page with notes on limited deployment of the switched cdus [14:49:01] yay! [14:49:08] how to setup from scratch is already on both my and chris's todo on whoever sets up a new one next [14:49:20] since i walked him through how to do just that the other day [14:49:40] ok awesome [14:49:42] New patchset: Dzahn; "small fixes to make the "nightly builds"-page validate as XHTML 1.0 Strict" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1657 [14:49:44] if he doesnt write the page before me, we should be building out row c next month [14:49:50] since we are paying for it already [14:49:53] just gotta point him to the wikitech page so he can add em [14:49:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1657 [14:50:05] oops, yeah well let's use it then for sure! [14:50:14] so I look ed at ds1's wikitech page today [14:50:16] it was depressing [14:50:19] huh, my wikitech password reset finally showed up [14:50:22] guess email is owrking on it [14:50:25] heh [14:50:29] =P [14:50:33] I should add all the rt stuff there but it's waaaay too depressing [14:50:41] (nice. snail mail eh?) [14:50:51] New review: Dzahn; "this makes it look nice on validator.w3.org" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/1657 [14:51:10] i hate that we dont have a better solitoin for our inventory =P [14:51:31] i just want a mediawiki extension that does all of what racktables does, and can flag specific fields private for only speciifc groups to see [14:51:39] and that will visually lay out racks [14:51:57] and that ties into RT custom field to pull all info tied to an asset tag, which we would populate in RT tickets [14:52:12] so pulling up the server on the inventory mgmt can at minimum list all tickets involving the server. [14:52:20] thats all...... [14:52:24] yeah I would love it [14:52:54] !change 1657 | hashar [14:52:54] hashar: https://gerrit.wikimedia.org/r/1657 [14:53:14] but by rt I meant rt, not racktables [14:53:28] the last entry on the page is from late 2010. [14:53:38] no, jan 2011. [14:53:39] hashar: re.. we can look at testswarm too now [14:53:40] anyways... [14:54:00] mutante: great! maybe in private to let apergos/RobH works ? [14:54:03] work [14:54:08] sure, lets do that [14:54:30] ? [14:54:48] we dont have an outage, you guys dont have to not work in here on our account [14:54:51] ;] [14:55:44] if we're cluttering up your work, let us know [14:56:51] so here we ware with dataset1. I could... try upgrading to amore recent kernel and see if tht does anything. we could... [14:56:58] call sm back. we could... [14:56:59] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1657 [14:57:07] get a frkickin sledgehammer [14:57:21] any thoughts? [14:57:30] s/ware/aree/ [14:57:32] -e [14:57:32] hehe,ok [14:57:55] that was about "dont have to not work in here" [15:00:18] mutante: let s make it public :D [15:00:27] alright [15:00:34] apt-get remove testswarm [15:00:44] Deconfigure database for testswarm with dbconfig-common? Y|N [15:00:49] N [15:00:52] we can keep it [15:00:55] it should be fine [15:01:12] the only issue was with the testswarm system user not being created by the package [15:01:18] ok, done,merging the change to include it then [15:01:24] yep [15:01:26] merge + puppet run [15:02:02] most usefull linux command: watch [15:02:06] ex: [15:02:11] New review: Dzahn; "re-enabling after manual package removal and fix for user account being created" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1646 [15:02:11] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1646 [15:02:13] watch grep /etc/passwd [15:02:46] RobH: ? [15:03:13] everything should work on this one though [15:03:26] i would email back sm saying there are still issues and include all the info [15:03:49] hashar: merged on sockpuppet, running puppet on gallium..applying config ..NOW [15:03:54] if they want us to run some diagnostics they can provide thats fine [15:04:08] apergos: though in all honesty i wish we could just give up on this [15:04:14] * hashar grabs a coffee while puppet install stuff :D [15:04:15] ok. It's just that it's possible it's a kernel issue [15:04:16] its just been a stupid amount of time [15:04:23] notice: /Stage[main]/Misc::Contint::Test::Testswarm/Package[testswarm]/ensure: create [15:04:26] apergos: Ok, well, lets try a different kernel then [15:04:43] and hopefully it also doesnt work, so we can throw this thing away and just use the hard disks elsewhere =P [15:04:45] hashar: done, no errors [15:04:51] ok, I can put 2.6.37 on there I guess, it's not in the standard repos but whatever, this is not a production machine right now [15:04:56] the disk shelf in its entirely could just plug into a dell ;] [15:05:06] mutante: but no testswarm user :/ [15:05:11] looking at production branch [15:05:57] hashar: but last time we saw an error about the user missing ..uh? [15:06:13] mutante: yeah the .deb does not create a testswarm user. But puppet should [15:06:29] ok,yea [15:06:40] oh yeah [15:06:45] is the user creation stuff still just in test? [15:06:47] THE STUPID CHANGES WERE NOT MERGED [15:06:51] abahabhaa [15:07:27] sorry also need to merge https://gerrit.wikimedia.org/r/#change,1647 :) [15:07:56] that is the change I have been working on in 'test'. They got merged by Roan and fully reviewed yesterday by me. [15:08:02] aarr.:p [15:08:10] i was hoping we dont need that one [15:08:22] the change actually implements the contint::test::testswarm class [15:08:24] as per the discussion yesterday and Roan trying to fix it [15:08:33] ? [15:08:48] the whole thing about merging from test to production for hashar [15:09:00] where you could not push merges [15:09:04] and then went on to "plan b" [15:09:09] Oh [15:09:12] Yeah I executed plan B [15:09:17] And it led to 1647 [15:09:40] then it modify the apache conf for integration.mediawiki.org to make testswarm public and finally give some sudo privileges to the platformeng people [15:09:56] (so we can fix the app ourself without bugging you about it :D ) [15:11:52] and this resolves mark's veto because "we can't tell what it changes"? [15:12:02] Well [15:12:05] Look at Gerrit [15:12:07] There's a diff right there [15:12:29] It's also not technically a merge, and it doesn't integrate the whole of the test branch, just hashar's changes [15:12:44] ok [15:13:07] yeah a merge of test into production would have merged a lot of unrelated / possibly problematics changes [15:13:22] all of them package in production as one commit IIRC [15:13:54] (For the record I don't think Mark's "we can't tell what it changes" stance is very well-informed, cause it's easy to get a diff of a merge from the command line. Yes, it sucks that Gerrit doesn't display this diff, but getting the diff is not impossible or even hard) [15:14:46] maybe so, doesn't mean I want to merge in all other changes [15:15:43] True [15:18:04] hehe @ "# Magic stuff for lazy people" [15:21:17] mutante: hehe [15:21:24] mutante: the PHP file was reviewed with Krinkle [15:21:48] mutante: although the latest changes are mostly fix for stuff that did not work :) [15:21:52] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1647 [15:22:16] path conflict [15:22:53] Hmm [15:22:55] I'll rebase it [15:23:58] hashar: should the .php files have an ending "?>"? [15:24:05] heh, trivial conflict [15:24:08] mutante: na we skip them :) [15:24:10] mutante: No they should not [15:24:20] kk [15:25:22] New patchset: Catrope; "script to fetch mediawiki + puppetization" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1647 [15:25:33] mark: Rebased patchset ---^^ [15:25:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1647 [15:26:16] The conflicts were because was added at the same spot in the file as [15:26:50] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1647 [15:26:50] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1647 [15:27:14] RoanKattouw: ah, yes, there was a duplicate definition with the nightly dir, that was fixed earlier [15:27:22] mutante: That wasn't it [15:27:32] ok [15:27:34] The patch was based on a revision that didn't contain the nightly dir at all [15:27:58] So git went "oh the prod branch adds the nightly stuff, but your change adds the testswarm stuff in the same place, what should I do" [15:28:07] well, the one in 1654 is the good one that fixed it [15:28:10] add both? :D [15:28:16] In this case yes [15:33:00] well thanks for the help :) [15:33:11] mutante: I guess you can get puppetd to update gallium now [15:33:23] on it [15:33:28] \o/ [15:37:31] merged it on sockpuppet..running again [15:37:48] breaks it :/ [15:37:53] wonderful [15:37:59] Duplicate definition: Sudo_user[demon] is already defined [15:38:45] site.pp at line 1022 [15:39:05] cannot redefine on 1023 [15:39:20] yeah, I wasn't sure how to define additional rights [15:39:31] so I just copy pasted the above line and s/jenkins/testswarm/ [15:39:44] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1623 [15:39:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1623 [15:40:15] mutante: I should merge the privilege arrays [15:40:47] hashar: ok, i was about to get the example, yea [15:42:01] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1649 [15:42:48] yea, i expected the dependency problem, but will fix, thanks for review [15:44:25] New patchset: Hashar; "gallium: avoid duplicate sudo_user definitions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1658 [15:44:31] RECOVERY - NFS on dataset1 is OK: TCP OK - 0.000 second response time on port 2049 [15:44:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1658 [15:44:57] mutante: 1658 should fix the sudo_user duplication ^^^ [15:46:41] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1658 [15:46:42] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1658 [15:48:22] hashar: fixed it. it's running.. ..but: [15:48:26] crontab: user `testswarm' unknown [15:49:17] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1606 [15:49:18] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1606 [15:49:23] hashar: BUT.. you got your user:) testswarm:x:2008:1001::/var/lib/testswarm:/bin/bash [15:49:41] mutante: looks like I should have a made an additional subclass named something like contunt::test::testswarm::systemuser [15:49:44] and require it [15:49:48] hashar: looks like an ordering issue, first user, then cron [15:50:07] that would have saved us the trouble to add a require=>User['testswarm'] to each statement [15:50:52] or puppet should handle those himself [15:51:11] hashar: hmm, not sure about more subclasses for users.. cant you just move the cron stuff after the system_user [15:51:52] it is after :) [15:52:07] oh. nvm [15:52:09] maybe ruby/puppet order the hash [15:52:20] and thus 'cron' ends up being executed before 'systemuser' [15:52:21] ' [15:52:30] (cause C < S ) [15:52:34] how do other hosts not run into that? [15:52:58] I have no idea, maybe cause the users have been setup already [15:53:12] or not using systemuser or I am cursed [15:53:25] or maybe this is just a warning on the very first run and doesnt even really matter? [15:53:42] if you rerun puppet, it should add the cron successfully [15:53:43] because it will create the cron on second run [15:53:57] so it is not very clean, but end up working "eventually" [15:54:15] yep [15:54:51] what's the issue? [15:55:38] cronjobs and a system_user are setup via puppet, and on first run you get "crontab: user `testswarm' unknown", and after that you have the system user [15:55:57] ...so cron should require the systemuser [15:56:07] but since it works on second run..it works anyways [15:56:17] sure, but that's not how it should be [15:56:21] it's a dependency [15:56:30] if you wait long enough, it'll probably, hopefully resolve itself [15:56:31] RECOVERY - SSH on dataset1 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [15:56:35] yea, that's what hashar meant with "not very clean" [15:56:41] but if you specify dependencies as they should be, then that's not even necessary [15:56:57] hashar: there is another problem though, unfortunately [15:57:19] the cron job is running and setting up ton of MediaWiki installation right now :)) [15:57:21] looks promising [15:57:31] "Could not evaluate: Could not retrieve information from environment production source(s) puppet:///files/testswarm/testswarm-checkouts.conf at /var/lib/git/operations/puppet/manifests/misc/contint.pp:246" [15:58:27] I thought I removed that one yesterday [15:59:46] hashar: also wanna add the "require" then (cron / systemuser)? [16:01:46] well not really needed [16:01:54] please do [16:02:23] why does puppet does not detect that dependency automatically ? [16:03:02] because it can't [16:03:13] puppet doesn't necessarily manage that user [16:04:19] mutante: I think we have to purge and reinstall the testswarm package [16:04:27] mutante: some user rights are incorrect [16:05:24] ok, purging it, then letting puppet re-install it [16:06:27] by purge you mean incl. to deconfigure the db this time? [16:09:08] would be an option to avoid "dbconfig-common", btw? or is that explicitely wanted [16:09:46] New patchset: Hashar; "testswarm: make sure we have a system user" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1659 [16:09:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1659 [16:10:15] dbconfig-common is a nice way to setup a database on Debian that save all the trouble of loading db9 or something else [16:10:33] plus that would let us easily trash the content of that database whenever we have too [16:10:43] the app does not really support that from its web interface [16:10:58] mutante: what I would need is to have the postinst script to run [16:11:05] maybe dpkg-reconfigure triggers that [16:11:11] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1659 [16:11:12] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1659 [16:11:19] yea, but just because it asks the dialog question, and wondering how that works if puppet installs the package [16:11:21] mark: thanks ) [16:11:27] (not needing a human) [16:11:41] not sure [16:11:57] so want me to deconfigure db or keep that config? [16:12:26] lets just keep that config [16:12:28] best test would be to remove all [16:13:03] ok, well, it does not need to be told a password, it just needs to be told , and then generates a random pass [16:13:04] that is simple unless you want to dpkg --purge testswarm then have it installed by puppet [16:13:25] that sounds like a good change the same happens if puppet installs it [16:14:09] then it doesn't ask that question [16:14:13] and picks the default [16:14:16] ok, package purged, besides db config.. it did not remote /etc/testswarm because that was not empty [16:15:06] mark: ok, good, then it should be fine to use dbconfig-common [16:15:16] not empty because puppet install a sample conf there [16:15:21] so that is expected [16:15:32] hashar: want to have /etc/testswarm deleted ? [16:15:43] before puppet run [16:15:46] yes please :) [16:15:54] so we know what happens there [16:16:28] that is installed by the .deb package anyway [16:16:45] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Wed Dec 21 16:16:37 UTC 2011 [16:18:09] hashar: oh oh.. not so good :/ dpkg error re-installing the package [16:18:43] nothing else ?: / [16:18:44] hashar: and it is dbconfig-common :p [16:19:04] sanity check failed for dbc_dbuser. [16:19:04] error encountered creating user: [16:19:04] No database user specified. [16:19:04] dbconfig-common: testswarm configure: aborted. [16:19:04] dbconfig-common: flushing administrative password [16:19:21] dpkg: error processing testswarm (--configure): [16:20:25] can you somehow specify that database user in your package? [16:20:44] that is by installing the package with puppet? [16:20:54] yes, this is from puppet log [16:21:21] the user should just be "testswarm" aka the default [16:21:56] maybe dpkg just skip all the configuration question since there is no human reading the screen? [16:22:17] I would just dpkg-reconfigure it manually [16:22:19] i guess need to call dbconfig-common with "-u testswarm" or so [16:22:49] but how to pass that from testswarm package to dbconfig-common [16:23:41] hashar: yes, as mark just said it picks the defaults then, and that is fine for the password, because it gets generated, but we still need the user name ..hrmmm [16:24:12] still looking at that :) [16:24:18] sure i can reconfigure it manually for now, but yea [16:24:32] it cant really be installed automatically then [16:25:19] dpkg-reconfigure: testswarm is broken or not fully installed :p [16:25:20] well it at least ask for a password at one point [16:25:25] hehe [16:26:37] ok, fixed manually, it's installed, db config should be untouched, other configs rewritten by puppet [16:26:50] how does it look to you now? [16:27:18] i have lost my terminals [16:28:22] hashar: how does it look for you now? installed package manually, puppet finishes runs [16:28:52] well it is missing user / password [16:29:19] dbconfig-common should generate them, create the db and all and place username/password in /etc/testswarm/config.ini [16:29:32] duh :p told it to not _de_configure [16:29:45] hehe [16:30:10] maybe dpkg-reconfigure regenerate them? [16:30:20] or you have to --purge the package again and reinstall it manually [16:31:11] unix socket or tcp/ip? (havent been asked before ?) [16:31:49] doh [16:32:10] ok, within configuration dialog: [16:32:12] mysql said: ERROR 1050 (42S01) at line 5: Table 'clients' already exists [16:32:12] let s do tcp/ip [16:32:20] you dont care if i "ignore", right [16:32:30] yeah just skip that [16:32:39] hashar: that would be another that is not default :p [16:33:20] hashar: config.ini has content again now [16:33:34] looks better [16:33:59] so the /etc/testswarm/config.ini file is loaded by the app [16:34:07] which is denied access: https://integration.mediawiki.org/testswarm/ [16:34:11] testing manually [16:37:12] hashar: oh you said it is supposed to talk to db9 already? [16:37:19] no [16:37:22] to the local database :) [16:37:27] ok.. [16:37:58] does dbconfig-common just write that one config.ini for you? might be easier to just have that config in the private repo.. ? [16:38:34] the idea in using dbconfig-common is that everything is setup automatically [16:38:44] hashar / mutante: to answer questions packages ask during installation you can use debconf-get-selections and debconf-set-selections [16:38:46] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [16:39:00] that's how you teach puppet the answer to a question without hardcoding it in the package. [16:39:08] looks like something got screwed in the process and the testswarm user has a wrong password [16:39:17] maybe the previous password since we did not flush the db [16:39:38] maplebed: ah, if that works in puppet, great! thx [16:40:12] mutante: it's a debian packagke thing, rather than a puppet thing. it's used anywhere you need unattended package installation. [16:40:22] ok, gotta run to post office, back shortly. [16:40:31] ok [16:40:38] mutante: https://integration.mediawiki.org/testswarm/ [16:40:41] mutante: password fixed [16:40:45] maplebed: thanks for the tip [16:40:48] maplebed: yea, but did we use it in puppet yet? [16:41:02] mutante: the DB Password for mysql user testswarm was incorrect :) [16:41:04] I'm not sure. I think leslie might have done something with it. [16:41:14] maplebed: like how to combine it with the puppet package class etc [16:41:23] maplebed: ok, will check, thx [16:41:38] I would suggest that the package installation requeires an exec that runs debconf-set-selections, but there might be a better way. [16:42:06] mutante: the last change would be to chown testswarm:testswarm /etc/testswarm :D [16:42:11] or I can amend puppet [16:42:14] hashar: i dont know how you fixed it and the password was just created, but great :) [16:42:49] maplebed: sounds reasonable, yea [16:43:45] hashar: if you want to know it's really fixed, let puppet do it right away, i guess [16:44:18] hashar: but if you want me to for testing right now, also tell me, np [16:44:38] for someone sick and in bed you are doing too much work :-P [16:46:00] New patchset: Hashar; "testswarm: fix /etc/testswarm permissions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1660 [16:46:03] mutante: here is the change :) [16:46:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1660 [16:47:16] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1660 [16:47:19] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1660 [16:50:48] hashar: just the warning about testswarm-checkouts.conf is left. the directory is owned by testswarm [16:51:19] great [16:51:42] i am calling "deploy testswarm on gallium" resolved [16:52:39] New patchset: Hashar; "testswarm-checkouts.conf is not needed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1661 [16:52:47] yeah looks fine mutante [16:52:56] I am finishing the software configuration [16:52:59] cool [16:53:02] something that can't really be puppetiez [16:58:20] https://integration.mediawiki.org/checkouts/mw/trunk/r105770/tests/qunit/?filter=mediawiki.util&_=1324486738245&swarmURL=https%3A%2F%2Fintegration.mediawiki.org%2Ftestswarm%2F%3Frun_id%3D57%26client_id%3D5%26state%3D [16:58:21] bah [16:58:29] not fully working yet :D [16:58:37] does not recognize index.html [17:01:10] New patchset: Hashar; "add index.html to DirectoryIndex for integration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1662 [17:01:57] New review: Hashar; "This is a cherry pick of bf0f391d from test to production" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1662 [17:02:00] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1661 [17:02:00] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1661 [17:02:12] mutante: and 1662 :) [17:02:23] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1662 [17:02:24] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1662 [17:02:28] mutante: looks like I did not manage to get all changes merged from test to production :(( [17:04:01] mark: you did you poke around at the pmtpa swift cluster at all today? [17:06:35] hashar: your change to add index.html looks good and has been applied, but still does not do what you expected it seems (yes, apache reloaded) [17:06:51] :-( [17:07:33] maplebed: no, not yet [17:07:58] (but not done today yet) [17:09:10] mutante: well it works for http not for https :-D [17:09:45] hashar: ah ..ok [17:09:59] maplebed: get well ;) [17:10:08] tnx. [17:10:46] I wonder if streber is in a bad enough state that i need to fix that first :/ [17:10:52] +1 [17:11:50] I like your idea of only splitting container for commons [17:11:56] or, making it configurable "per wiki" [17:12:01] (and only do it for commons, for now) [17:12:29] New patchset: Hashar; "testswarm: index.html as default for HTTPS too" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1663 [17:12:35] mutante: that should be the last change for n ow [17:12:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1663 [17:12:43] mutante: the RT ticket can be closed :) [17:13:00] mutante: will drop a thank you / congratulations mail this evening hopefully [17:13:21] streber is fucked in some weird way, I think it's hardware [17:13:29] it's been weird for some weeks now [17:14:32] yea, more than one of use rebooted it due to stuff like "dpkg being stuck due to kernel bug" [17:14:40] us [17:14:54] yeah [17:15:14] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1663 [17:15:14] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1663 [17:15:23] eep, what happened last night ? [17:15:30] last night? [17:17:15] hashar: https://integration.mediawiki.org/checkouts/mw/trunk/r105770/tests/qunit/ [17:17:16] New review: Hashar; "It fixed the issue! Thanks for the fast merge 8-)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1663 [17:18:45] mutante: wonderful :) [17:18:46] in my scrollback looks like some bad editing speed/site suckage.. but then i closed my laptop and the scrollback goes away... [17:18:53] mutante: so as I said, you can close the testswarm ticket [17:18:55] hashar: np:) RT closed:) [17:18:58] mutante: I can't find it in RT thgouh [17:19:01] oh good [17:19:19] I still have some bugs but that is in the app I think t [17:19:24] the ops part should be fine now [17:19:27] but i can just go find chat logs... [17:19:38] yeah, new bugs are not part of "initial deploy".. /me nods [17:20:16] well that took a looong time but at least I have learned lot of stuff in the process [17:20:27] the first would be to use a topic branch and run the VM out of it [17:21:53] mark, apergos: I'm going to start up the uploader again to keep populating swift. We're at 2.2 million objects in the commons bucket right now and read throughput (if I warm up the dataset) is still >1000qps. [17:22:05] feelfree to kill it if you like (it'll be running as me on hume) [17:22:11] cool [17:22:18] hume had a disk space warning [17:22:28] i'll keep an eye on it [17:22:48] hume is fixed [17:23:09] maplebed: let's create a ganglia cluster for them, (if you haven't already) [17:23:16] i can do that now if you want [17:23:17] I haven't. [17:23:35] I've just been using http://ganglia3-tip.wikimedia.org/?r=day&cs=&ce=&m=&tab=ch&vn=&hreg[]=ms[123]\.pmt [17:23:36] should we do one for swift, or proxy/storage separately? [17:23:58] I guess that works too [17:24:10] oh i lied, there's a new unrelated hume disk space warning. looking [17:24:21] I think just swift is fine rather than splitting them into proxy/storage. [17:24:30] alright [17:24:37] that'll be easy to do by eye or with the aggregate graphs like that likn. [17:24:44] ok [17:24:46] thanks! [17:25:09] ~300M of MegaSAS.log :-) [17:25:17] New patchset: Dzahn; "change max_concurrent_checks from 8 to 64 - rebased" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1649 [17:25:29] New patchset: Dzahn; "remove special.cfg from nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1648 [17:25:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1649 [17:25:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1648 [17:25:46] New review: Dzahn; "was already reviewed" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1649 [17:27:18] New patchset: Mark Bergsma; "Create separate Ganglia cluster(s) for Swift" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1664 [17:27:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1664 [17:27:46] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1664 [17:27:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1664 [17:28:42] ok (busy) [17:30:56] hi apergos, online? [17:31:05] New patchset: Mark Bergsma; "Setup ganglia aggregators for swift clusters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1665 [17:31:09] !log ran apt-get clean on hume to clear out ~600M space on the / partition [17:31:18] Logged the message, Master [17:31:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1665 [17:31:25] New patchset: Dzahn; "change max_concurrent_checks from 8 to 64 - rebased - remove special.cfg here so it doesnt break" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1649 [17:31:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1649 [17:31:46] Jeff_Green: re: hume's disk space - I just cleared out some too. [17:31:53] ya [17:32:06] wow--I didn't know apt was prone to bloat like that [17:32:30] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1648 [17:32:31] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1648 [17:32:39] nfs is still hurting on hume though. [17:32:40] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1649 [17:32:41] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1649 [17:32:58] it's taking forever to give me ls in my homedir. [17:33:26] I really want to mute the gerrit bot. [17:33:49] use /ignore ? :) [17:34:02] i wrote up a work summary on linkedin relating to the work done in tampa a few months ago. need to be sure it's kosher and doesn't say anything that should not be public. [17:34:04] hi mark [17:34:07] /var/tmp/refreshImageMetadata.core [17:34:26] from 10/25, 200M, nuking [17:35:12] andrewS: then I suggest you restrict it to "did some hands physical work on Wikimedia equipment in its datacenters" or something resembling that [17:35:34] but how can i impress google and microsoft and apple with only that!? [17:35:36] j/k [17:35:39] restricting [17:35:43] can i url you though? [17:35:54] well you failed to impress us in the first place, so no chance anyway [17:36:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1665 [17:36:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1665 [17:36:31] yeah you can [17:36:37] thank you [17:38:52] hume has a lot of kernels installed [17:39:59] apt-get -s autoremove [17:40:01] a gb of / is php versions [17:40:18] mutante: I have great fear of that command, it's done me wrong nearly every time I've used it [17:40:31] Jeff_Green: the "-s" makes it a simulation [17:40:53] then remove packages by hand? [17:41:23] depends what it wants to remove i guess [17:41:50] interestingly, not kernels: byobu cpu-checker java-common odbcinst odbcinst1debian1 python-newt screen unixodbc update-notifier-common [17:41:55] but it's kind of new that apt-get can do it, and not just aptitude [17:41:56] damn: http://ganglia3-tip.wikimedia.org/?c=Miscellaneous%20pmtpa&h=nfs1.pmtpa.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [17:42:08] before used to be stuff like "debfoster"/"deborphan" for that [17:42:23] maplebed: wow [17:42:34] no wonder fenari and hume are not particularly responsive. [17:43:03] who would be responsible for these: hume:/usr/local/apache/common-local/php-* ? [17:43:28] Jeff_Green: i would remove old kernels manually then i guess, if its for saving disk space [17:43:32] are those propagated with code or hand-installed [17:43:52] we're down to 75% so it's not urgent [17:46:08] (busy) [17:48:42] ah it all becomes clearer looking at fenari:/home/w/common . . . [17:49:06] maplebed: Is that you hogging NFS? [17:49:19] 1251 ben 20 0 11908 1072 496 D 21 0.0 0:01.99 rsync -avP ms5:/tmp/wikipedia-filelist.txt ./wikipedia-filelist.txt [17:49:32] hah [17:52:16] /usr/local/apache/common appears in "wikimedia-task-appserver" package, not common-local though [17:52:57] RoanKattouw: no. [17:53:31] I ran that after it started getting hogged; I don't think the transfer's even started yet. [17:53:34] Grr, now fenari is even less responsive [17:53:50] It's hanging even on things like ls -l /proc/1251 [17:54:05] mutante: i'm not sure how it works--but hume:/usr/local/apache/common is a symlink to common-local, and common-local has the same php-* dirs as fenari:/home/w/common [17:54:16] I killed that process anyways though; it certainly isn't helping. [17:55:22] NFS overloads always take a while to calm donw [17:55:24] Up to 15 mins [17:56:22] lrwxrwxrwx 1 ben wikidev 0 2011-12-21 17:54 /proc/1251/cwd -> /home/ben/swift/filelists [17:56:26] meh [17:56:32] iptables is now blocking ganglia access :/ [17:56:40] I think that means your process was at least contributing to the NFS lockup, even if it wasn't the cause [17:56:42] RoanKattouw: is that (what Jeff says) related to the task we created to have a smaller package , which is a subset of wikimedia-task-appserver, and contains some scripts only? no..hm? [17:56:47] the ghost has taken over the machine [17:56:56] mark: you notice streber's upgrade went very poorly? [17:57:04] mutante: It sounds like it's a bit of a mess so we might as well clean that up too [17:57:12] Ryan_Lane: wasn't the upgrade [17:57:19] streber was upgraded to lucid a long time ago [17:57:27] it's broken hardware wise I think, it's been weird for a few weeks now [17:57:49] access blocked by nfsdeath == good time for more coffee [17:58:07] Jeff_Green: ^^ and RT 2129 [17:58:15] mark: there's a ton of shit broken [17:58:48] bad permissions and screwed up users, and broken dpkg and.... [17:58:50] etc etc etc [17:59:07] mutante: i don't see the connection (yet)? [17:59:28] Jeff_Green: the connection for me was: dpkg -L wikimedia-task-appserver | grep local [17:59:58] Jeff_Green: because that package puts stuff there, and you say its a symlink.. etc.. we are all not sure right now ..hrmm [18:00:02] oh, i haven't been able to get a session back on hume in a while, will check when fenari is usable again [18:00:53] Ryan_Lane: I know [18:01:06] it's scheduling or I/O related [18:01:07] back [18:01:29] yes that would be fenari [18:01:34] what was that alternative bastion box leslie was working on? [18:01:37] is he copying directly to home or something? [18:01:46] New patchset: Mark Bergsma; "Allow gmond access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1666 [18:01:47] bast1001 [18:01:54] thx [18:01:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1666 [18:02:03] mark: o.O [18:02:08] it's not using nfs yet, so it's quite speedy at the moment Jeff [18:02:11] that's weird [18:02:12] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1666 [18:02:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1666 [18:02:13] let's see, I was busy for 1 hour and got pings in three channels... [18:02:40] leslie let's just remove nfs support from the kernel on that box :-) [18:02:49] +1 [18:02:52] +1111111111111111111111 [18:02:53] :) [18:02:54] +1 [18:03:06] let's just put nfs everywhere, so it doesn't time out in one spot [18:03:08] s/that/any/ [18:03:18] no NFS on any box [18:03:19] * Ryan_Lane ducks [18:03:33] s/in one/in just one/ [18:03:35] Hey guys, Ariel's busy so I'll tell you guys instead: srv224 (scaler) has a full dsik [18:03:40] I'm probably going to switch to cluster for labs [18:03:40] heh [18:03:44] *gluster [18:03:47] no NFS there either [18:04:38] can we remove the stuff in /tmp on srv224 ? [18:04:42] yes [18:04:43] go ahead [18:05:13] I was just over there [18:05:18] cleaning it up :-P [18:05:23] oh [18:05:26] s'ok [18:05:32] now I can go do dishes and eat instead [18:05:34] i just killed all the /tmp and it's down to 68 [18:05:35] :) [18:05:40] I'm actually done for the day :-D [18:05:45] awesome [18:05:46] nfs has recovered. [18:06:03] maplebed: ok kill it before it rises again to attack us! [18:06:10] I also apt-get cleaned as the first emergency measure so we would have a few spare bytes [18:06:14] but that didn't get us much [18:10:15] wow, swift does indeed use a lot of cpu on the storage nodes [18:10:43] it's not i/o bound at all [18:11:13] New patchset: Bhartshorne; "allowing swift hosts to be gmond listeners" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1667 [18:11:22] mark: ^^^ [18:11:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1667 [18:11:28] er I already did that ;) [18:11:31] thanks though [18:11:34] oh. [18:11:37] drnit. [18:11:57] in almost exactly the same way [18:12:07] yours wants multicast, but this is actually unicast tcp from spence [18:12:11] (and some other ganglia servers) [18:12:32] I'm very curious to see how long the find takes to run on ms5... [18:12:39] I think they're both necessary. [18:13:07] unicast from spence for gmetad to pick up the data, multicast/udp for the host to hear what its peers are saying. [18:13:43] apergos: it's been running for over 12 hours already. [18:13:49] I didn't time it though; shoulda. [18:13:49] right... mine works because it accepts it from the right source [18:14:04] but only for public servers [18:14:11] maplebed: will you amend yours then? [18:14:16] mark - but only tcp [18:14:20] true [18:14:25] I wonder why it's working for ms1-3 then [18:14:31] I'll abandon mine [18:14:36] they're reporting to the misc aggregator. [18:14:42] which dosn't have these rules. [18:15:01] http://ganglia3-tip.wikimedia.org/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Swift+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [18:15:03] owa1 and 2 now [18:15:18] i'm having dinner now [18:15:22] shall I run oprofile afterwards? [18:15:24] on ms1 [18:15:44] oh, ms123 work because there's a rule 'accept everything from 10.0.0.0/8' [18:15:58] ah :) [18:16:01] ok [18:16:20] I'll abandon my change and redo it for udp. [18:16:21] food now [18:16:23] thanks [18:16:25] bbl [18:16:55] Change abandoned: Bhartshorne; "mark already did this." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1667 [18:22:49] New patchset: Bhartshorne; "allowing swift hosts to hear their peers' multicast traffic" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1668 [18:22:59] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/1668 [18:24:37] New patchset: Bhartshorne; "allowing swift hosts to hear their peers' multicast traffic" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1668 [18:24:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1668 [18:25:08] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1668 [18:25:08] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1668 [18:29:10] RECOVERY - Disk space on hume is OK: DISK OK [18:32:26] ffs [18:32:41] Are we now in NFS death /again/? [18:37:10] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:37:59] yeah, you can use bast1001 though as a bastion host [18:38:44] Does it have a public DNS name? [18:39:16] Hah, it does [18:39:19] bast1001.wikimedia.org [18:39:46] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=manifests/site.pp;h=84c9925f142cd50ebb6a60bcddc7b9a4d02651cb;hb=HEAD#l609 says... damn RoanKattouw beat me to it [18:39:51] it's magic [18:40:00] LeslieCarr: bast1001 doesn't have /home mounted, so I can't sync code from it [18:40:08] oh it doesn't have nfs [18:40:12] which is why it is magic [18:40:17] and not dying [18:40:29] heh [18:40:40] And also useless for what I want to do right now [18:42:01] New patchset: Hashar; "bug 33301, bad SSL cert at integration.mediawiki.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1669 [18:42:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1669 [18:42:27] ssh'ing into fenari is timing out for me - something going on? [18:43:05] nfs death [18:43:35] sadness [18:43:41] yes [18:45:41] when oh when is the netapp we spent tons of money on going to serve /home [18:46:12] * RoanKattouw can't wait [18:46:40] that's waiting on mark .. [18:47:13] LeslieCarr: want to take it over and tell mark after its serving /home?? [18:47:17] hehehe [18:47:27] tempting [18:48:32] the entire eng team will chip in to buy you a pony. where pony might be whisky. [18:48:51] binasher: there is now process monitoring for "mobile traffic loggers", they turned CRIT first because i configured them to check for _exactly_ 2 processes, changed to accept 1-4 now [18:49:01] mm whisky pony! [18:49:21] whiskey pony?!?! [18:49:30] i finally know what i want for christmas [18:49:43] there's still the HTTP WARNING: HTTP/1.1 404 Not Found issue on cp 1041-1044 … mutante got an idea what's up with that ? [18:50:00] i tried doing the get checks myself and it seemed ok [18:50:21] whiskey made from unicorn tears [18:50:28] haha http://www.youtube.com/watch?v=zv-mHSWMWnQ [18:50:42] * AaronSchulz is always afraid to click youtube links [18:51:09] LeslieCarr: its something that broke after the puppet version upgrades, when mark refactored the varnish configs for the new version [18:51:47] AaronSchulz: for that it's cool to have a bot read the of pasted URLs.. this is just "Who are you? Whisky pony " [18:52:22] * AaronSchulz would like an advanced AI rick-roll detector [18:53:10] <mutante> LeslieCarr: let met check.. [18:53:10] <binasher> mutante: its weird that the check would have returned 1 proc in the last few days, they both have 2 running and they've been running since before you implemented the check. so if it returned 1, there's something broken with the nagios proc check [18:53:27] <binasher> mutante: i will fix the varnish backend check [18:53:29] <nagios-wm> PROBLEM - Misc_Db_Slave on db10 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [18:53:39] <mutante> binasher: it was 1 and also 3 :p [18:53:53] <mutante> binasher: ah, cool [18:54:31] <binasher> 3 could be a stupid "ps … | grep .." sort of thing. either way, a check that accepts 1 proc running isn't really helpful [18:54:47] <mutante> binasher: to be exact it checks for "processes with args varnishncsa" [18:55:10] <mutante> that is the -a option of check_procs , as opposed to -C, but it worked better for most other process checks like this [18:57:00] <hexmode> LeslieCarr: you have a bugzilla account? [18:57:15] <LeslieCarr> hexmode: actually not ... [18:57:20] <LeslieCarr> i should create one [18:58:12] <hexmode> LeslieCarr: I was gonna add you to this one https://bugzilla.wikimedia.org/show_bug.cgi?id=33293 ... but I'll let you add yourself if you want [18:58:15] <mutante> still using fenari btw.. [18:58:59] <mutante> ssh from there is ok, just dont "ls" :p [18:59:52] <binasher> mutante: i just tried "/usr/lib/nagios/plugins/check_procs -C varnishncsa" on cp1043 1000 times, and got "2 processes" every time.. -a is more of a fit if something is spawned via an interpreter (i.e. python) or subshell [18:59:55] <RoanKattouw> Don't do any FS access to /home in fact [19:00:45] <mutante> binasher: ok, let me create a custom check command then, i just made a generic one that replaced all others, and that uses -a [19:00:49] <LeslieCarr> cool added , thanks hexmode :) [19:01:00] <Jeff_Green> did someone just mess with db9? [19:01:19] <mutante> binasher: or maybe i'll make that configurable in the generic one right away [19:01:30] <binasher> Jeff_Green: whats up with db9? [19:02:15] <Jeff_Green> db10 replication fail, complaining about master's binary log [19:02:28] <mutante> binasher: and then you want a CRIT right away if it's 1 or 3, so 2!2!2!2 for the warn and crit threshholds, right [19:02:39] <Jeff_Green> also db10's mysql install is screwed--mysql-at-facebook is what's running, but the install has been corrupted [19:02:41] <binasher> Jeff_Green: that has been broken since last week, and why i have to have a db9 outage this evening [19:02:47] <binasher> please don't touch db10 [19:02:56] <Jeff_Green> i won't--just observing [19:05:01] <mark> if anyone wants to setup the netapp, go ahead eh ;) [19:05:07] <mark> I'm not particularly attached to the thing [19:05:35] <mark> or if you need a quick fix, tune drbd on nfs1/nfs2 to do async replication [19:07:06] <RoanKattouw> I thought there was a reason we didn't do that async? [19:07:09] <mark> binasher: so, sorry I didn't setup the netapp and fixed puppet instead ;p [19:07:24] <RoanKattouw> OK fenari has calmed down [19:07:31] <mark> what is up with fenari? [19:07:36] <RoanKattouw> It looks like my fatalmonitor script may have been keeping it in the NFS death state [19:07:47] <RoanKattouw> mark: It was in NFS death [19:07:47] <mark> what is that doing on fenari? :P [19:08:13] <mark> I hope NFS on the netapp is going to be a bit better, but I don't have all that much hope [19:08:14] <RoanKattouw> Better question: what are the Apache logs doing on /home [19:08:48] <LeslieCarr> so where are the netapps hooked up ? [19:08:52] <LeslieCarr> and is there a ticket i can check out ? [19:09:19] <binasher> mark: not waiting hours for puppet updates to work is pretty great :) [19:09:20] <mark> not really I think [19:09:25] <mark> binasher: I figured! [19:09:46] <mark> personally, NFS on /home hasn't bothered me much [19:09:55] <mark> perhaps that's because I use neither /home nor fenari much at all ;) [19:10:14] <mark> but yeah, I'll setup NFS /home on the netapp soon if noone beats me to it [19:10:15] <RoanKattouw> Deployment goes off of home [19:10:22] <RoanKattouw> So if you're doing deployments, it's a PITA [19:10:30] <mark> I am well aware [19:10:39] <mark> but I felt that fixing puppet was an even higher prio (for instance) [19:11:26] <RoanKattouw> Sure [19:11:28] <mark> there is no particular reason for us to sync drbd replication that I know of [19:11:48] <RoanKattouw> I vaguely recall someone protesting to setting it to async, but I'm not sure [19:12:38] <mark> protocol C is safest [19:12:46] <mark> http://www.drbd.org/users-guide/s-replication-protocols.html [19:13:45] <mark> binasher: so I see that puppet dashboard is equally braindead performance wise as puppet itself [19:13:53] <binasher> yeah :( [19:13:54] <mark> I guess it's just adding reports to the database and never wiping them [19:13:57] <mark> I can turn it off I guess :( [19:14:25] <mark> those rails people just don't care about performance at all [19:14:41] <hexmode> mark: you said ask next week about the exim puppetization. Any news for me and Nemo_bis? [19:14:47] <mark> hexmode: no [19:14:50] <hexmode> :( [19:14:53] <hexmode> ok [19:15:06] <binasher> ok.. i was going to just truncate all of its tables tonight before restoring db10 as a db9 replica but killing it would be good if it doesn't have a "don't be stupid" switch [19:15:16] <mark> binasher: I don't think it does [19:15:17] <hexmode> mark: ok to ask in new year? [19:15:20] <mark> but I haven't investigated it [19:15:25] <mark> hexmode: might be best [19:15:29] <mark> I have a few more urgent things to do [19:15:45] <hexmode> np, just trying to understand where it is at [19:16:02] <LeslieCarr> oh mark, EU router stuff, do we order from someone special out there or should we just have TP ship it over ? [19:16:18] <mark> LeslieCarr: if they CAN deliver in europe, that would be best [19:16:23] <mark> but typically they can't [19:16:26] <mark> and then I find someone locally [19:16:30] <mark> but we don't really have someone for j right now [19:17:17] <RoanKattouw> Hmm, wait CPU on fenari is back to zero but NFS isn't out of the woods set, see Ganglia for nfs1/nfs2 [19:17:17] <mark> LeslieCarr: but anyway, an MX80 doesn't replace the core switch part [19:17:30] <RoanKattouw> So any FS operation on /home will send that wait CPU right back up [19:17:38] * RoanKattouw waits some more [19:18:35] <LeslieCarr> no it doesn't [19:18:47] <LeslieCarr> but it replaces the edge, which is the most important part [19:18:53] <LeslieCarr> and will make me much happier :) [19:19:06] <mark> me as well [19:19:42] <mark> I guess I'll have a new coffee table [19:19:48] <mark> or BBQ [19:19:57] <mark> ;) [19:20:29] <LeslieCarr> hehehe [19:23:55] <mark> !log Migrated DRBD sync between nfs1 and nfs2 from protocol C (sync) to A (async) [19:24:00] <mark> there you go [19:24:02] <mark> is NFS better now? [19:24:05] <morebots> Logged the message, Master [19:25:06] <mark> if not, the problem probably isn't drbd, but the slowness of the drives in nfs1 [19:25:34] <RoanKattouw> whoa [19:25:37] <RoanKattouw> It's fast now [19:25:41] <mark> heh [19:25:48] <gerrit-wm> New patchset: Asher; "fix the nagios check for non port 80 varnish instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1670 [19:26:07] <RoanKattouw> I ran 'fatalmonitor' and it started *instantly* [19:26:10] <RoanKattouw> That never happens [19:26:50] <gerrit-wm> New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1670 [19:26:53] <gerrit-wm> Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1670 [19:27:29] <mark> !log Started oprofile run on ms1 [19:27:38] <morebots> Logged the message, Master [19:27:56] <mark> samples % image name app name symbol name [19:27:56] <mark> 517941 54.9241 python2.6 python2.6 /usr/bin/python2.6 [19:27:56] <mark> 216447 22.9527 no-vmlinux no-vmlinux /no-vmlinux [19:27:56] <mark> 94617 10.0335 libsqlite3.so.0.8.6 libsqlite3.so.0.8.6 /usr/lib/libsqlite3.so.0.8.6 [19:27:56] <mark> 86888 9.2139 libc-2.11.1.so libc-2.11.1.so /lib/libc-2.11.1.so [19:27:56] <mark> 4155 0.4406 _sqlite3.so _sqlite3.so /usr/lib/python2.6/lib-dynload/_sqlite3.so [19:28:13] <mark> why was I thinking swift was C code? [19:29:21] <nagios-wm> RECOVERY - Varnish HTTP mobile-backend on cp1043 is OK: HTTP OK HTTP/1.1 200 OK - 691 bytes in 0.064 seconds [19:31:16] <mark> samples % image name app name symbol name [19:31:17] <mark> 2981938 31.1515 no-vmlinux no-vmlinux /no-vmlinux [19:31:17] <mark> 1016399 10.6181 python2.6 python2.6 PyEval_EvalFrameEx [19:31:17] <mark> 702962 7.3437 libc-2.11.1.so libc-2.11.1.so /lib/libc-2.11.1.so [19:31:17] <mark> 625996 6.5396 libsqlite3.so.0.8.6 libsqlite3.so.0.8.6 /usr/lib/libsqlite3.so.0.8.6 [19:31:17] <mark> 380234 3.9722 python2.6 python2.6 lookdict_string [19:31:17] <mark> 248843 2.5996 python2.6 python2.6 PyObject_GenericGetAttr [19:31:18] <mark> 233595 2.4403 python2.6 python2.6 dict_traverse [19:31:18] <mark> 195101 2.0382 python2.6 python2.6 visit_reachable [19:31:19] <mark> 125498 1.3110 python2.6 python2.6 visit_decref [19:31:19] <mark> 102711 1.0730 python2.6 python2.6 PyEval_EvalCodeEx [19:31:20] <mark> 98153 1.0254 python2.6 python2.6 tupledealloc [19:34:09] <mark> heh [19:34:20] <mark> isn't it nice how debian/ubuntu split up stuff in separate packages [19:34:47] <apergos> yes, it just warms my heart, every time [19:34:48] <mark> so you have swift-proxy for the proxy server, swift-object for the object server, etc [19:34:57] <mark> and really it doesn't matter at all [19:35:07] <mark> because all they contain are short stubs in /usr/bin [19:35:16] <mark> that call into the entire swift stack under /usr/lib/python [19:35:22] <mark> ...which is entirely contained in python-swift [19:35:29] <apergos> someone over there had a reason for it but whatever [19:35:41] <mark> it looks nice on the surface ;-) [19:35:48] <apergos> :-D [19:41:11] <nagios-wm> RECOVERY - Varnish HTTP mobile-backend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 691 bytes in 0.063 seconds [19:42:54] <mark> the container servers are the main problem indeed, on the storage nodes [19:44:25] <mark> !log Ended oprofile run on ms1 [19:44:33] <morebots> Logged the message, Master [19:53:09] <mark> maplebed: still here? [20:03:14] <notpeter> mark: re: lily. is there anything I can do to help you out? [20:03:55] <mark> notpeter: I think the exim config is sort of like how it's supposed to be now [20:04:09] <mark> i'm not entirely sure about spamassassin and mailman yet [20:04:19] <mark> they could do with a bit more templatization and such [20:04:24] <mark> and after that's all done [20:04:31] <mark> we'll have to reinstall the box and start over afresh [20:04:32] <mark> and migrate all data [20:04:37] <notpeter> yep [20:04:51] <notpeter> ok, I'll take a look at spamassassin and mailman some more [20:04:55] <mark> ok [20:05:05] <mark> go over the existing docs carefully [20:05:11] <mark> I think everything is in there more or less [20:07:09] <notpeter> ok, sounds good [20:10:53] <hexmode> db9 is ok? [20:11:03] <hexmode> bugzilla seems wonky to me [20:11:16] <hexmode> oh, maybe just slow [20:11:20] <hexmode> nm then [20:13:11] <nagios-wm> PROBLEM - NTP on dataset1 is CRITICAL: NTP CRITICAL: Offset unknown [20:13:16] <apergos> grrrrr [20:14:14] <apergos> can't check itnow [20:14:16] <apergos> busy [20:16:34] <hexmode> bugzilla still slow as molasses, but I was able to leave a comment on the blog w/o a problem [20:16:51] <hexmode> that would require write to the same db, right? [20:17:09] <hexmode> anyway: taking a break, now [20:17:11] <Jeff_Green> think so [20:17:30] <Jeff_Green> db9 looks very quiet atm [20:17:41] <Jeff_Green> load average: 0.09, 0.07, 0.07 [20:18:02] <gerrit-wm> New patchset: Mark Bergsma; "Template swift storage server configurations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1671 [20:18:17] <gerrit-wm> New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1671 [20:18:18] <hexmode> Jeff_Green: fwiw bugzilla server is just sitting there spinning when I try to hit the front page [20:18:33] <Jeff_Green> looking [20:18:35] <hexmode> so nothing with db9 writing probably [20:18:55] <gerrit-wm> New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1671 [20:18:55] <hexmode> now, really leaving for a walk while you work ;) [20:18:56] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1671 [20:18:58] <Jeff_Green> k [20:21:19] <Jeff_Green> oh my [20:21:22] <Jeff_Green> swapdeath [20:21:28] <gerrit-wm> New patchset: Mark Bergsma; "Experimentally raise worker counts on account/container/object servers to processorcount" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1672 [20:21:41] <gerrit-wm> New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1672 [20:21:41] <Jeff_Green> http://ganglia3.wikimedia.org/graph.php?r=hour&z=xlarge&h=kaulen.wikimedia.org&m=cpu_report&s=descending&mc=2&g=mem_report&c=Miscellaneous%20pmtpa [20:21:50] <gerrit-wm> New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1672 [20:21:51] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1672 [20:25:47] <Jeff_Green> looks like kaulen needs a power cycle--can't get an ssh session [20:26:46] <gerrit-wm> New patchset: Hashar; "testswarm: disable mobile browsers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1673 [20:26:58] <gerrit-wm> New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1673 [20:27:21] <nagios-wm> PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:38] <gerrit-wm> New patchset: Mark Bergsma; "Restart swift processes on config changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1674 [20:32:47] <Jeff_Green> can't get a terminal via drac either, just stalls and dumps after the password prompt. anything anyone would like to do before I power cycle it? [20:32:59] <gerrit-wm> New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/1674 [20:34:22] <Jeff_Green> !log power cycled kaulen because it's deathswapped and unresponsive [20:34:32] <morebots> Logged the message, Master [20:34:41] <mark> !log Restarted swift-container on ms1 with higher worker count (4 instead of 2) [20:34:49] <morebots> Logged the message, Master [20:37:42] <nagios-wm> RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [20:40:51] <gerrit-wm> New review: Demon; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/1669 [20:48:02] <gerrit-wm> New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1673 [20:48:12] <hashar> can someone vote / merge https://gerrit.wikimedia.org/r/#change,1673 please ? [20:48:19] <hashar> that is a really minor change :D [20:48:28] <hashar> in a php script used for testswarm. [20:48:30] <hashar> thanks in advance 8-) [20:49:50] <Krinkle> hashar: Can I do that ? [20:50:39] <gerrit-wm> New review: Krinkle; "OK" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1673 [20:50:58] <Krinkle> Hm.. I can only choose "good but someone else must approve" [20:51:18] <Krinkle> like "check off" in mediawiki CR [20:54:00] <hashar> yup [20:54:08] <hashar> only ops can +2 / approve it [20:54:23] <hashar> or devs would be able to have change merged in production :) [20:54:37] <hashar> (without ops knowing about it) [20:58:52] <gerrit-wm> New patchset: Hashar; "bug 33301, bad SSL cert at integration.mediawiki.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1669 [20:59:15] <Jeff_Green> whee: FastCGI: server "/srv/org/wikimedia/bzapi/script/bugzilla_api_fastcgi.pl" has failed to remain running for 30 seconds given 3 attempts, its restart interval has been backed off to 600 seconds [21:00:46] <gerrit-wm> New patchset: Hashar; "testswarm: disable mobile browsers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1673 [21:00:52] <nagios-wm> PROBLEM - Disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10742 MB (3% inode=99%): [21:02:09] <Krinkle> hashar: what's the difference ? [21:02:31] <hashar> on 1673 ? I have changed its parent [21:02:53] <hashar> so instead of depending upon an unmerged / unrelated change, it depends upon an already merged change [21:03:07] <hashar> so whenever someone validate the change in gerrit, it will be merged [21:03:18] <Krinkle> k [21:03:21] <hashar> instead of being SUBMITTED which mean the change is on hold pending validation of its parent [21:03:28] <hashar> gerrit is fun but a bit complicated [21:05:32] <nagios-wm> PROBLEM - MySQL disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10724 MB (3% inode=99%): [21:10:05] <apergos> I think that cron job will finish by tmorrow [21:10:12] <apergos> ( maplebed ) [21:36:53] <mark> !log Running ben's swift thumb loader script in a screen on hume [21:37:02] <morebots> Logged the message, Master [21:40:25] <maplebed> it looks like throughput's higher. [21:40:31] <mark> hey [21:40:33] <mark> I only increased ms1 [21:40:35] <mark> so not ms2 and ms3 [21:40:46] <mark> (for comparison, although with everything being intertwined, it's still hard to say) [21:40:52] <maplebed> I tried that yesterday, but I only increased one at a time. [21:40:54] <mark> I see all requests are logged in syslog [21:41:10] <mark> should we turn that off already? [21:41:17] <mark> makes testing harder, but might affect performance too... [21:41:31] <maplebed> writes have to come back from 2 storage nodes before the proxy will return 200, so just increasing the count on one is unlikely to make a visible difference. [21:41:52] <mark> yeah [21:41:57] <mark> but it was just doing replication before [21:42:02] <mark> and even that takes a ridiculous amount of cpu [21:42:07] <mark> so I wanted to see if I saw a difference there [21:42:10] <maplebed> I'd rather leave it on; they're not competing for the same spindles (/ is not a swift storage spindle), so I'd rather leave it on. [21:42:11] <mark> but not really [21:42:19] <maplebed> the logging isn't blocking since it's handed off to syslog. [21:42:21] <mark> yeah, as long as the / i/o is low [21:42:26] <mark> indeed [21:42:30] <mark> might take some more cpu though [21:42:37] <mark> then again, it's not like swift is highly optimized C code either... [21:42:42] <apergos> :-D [21:42:45] <mark> i'll up ms2 and ms3 too [21:43:08] <maplebed> mark: after running puppet, you did restart the swift stuff, right? [21:43:13] <mark> only on ms1 [21:43:20] <mark> the others ran puppet, but I didn't restart yet [21:43:24] <maplebed> "swift-init all restart" or something? [21:43:24] <mark> doing that now [21:43:31] <mark> I did them individually [21:43:34] <maplebed> k. [21:43:37] <mark> and I have a change waiting for puppet to do that automatically [21:43:39] <mark> for your review [21:43:42] <mark> not sure if we want that now [21:43:45] <mark> but it's easy to take out or disable [21:43:48] <mark> it's in gerrit [21:44:34] <mark> !log Ran swift-init all restart on ms2 [21:44:42] <morebots> Logged the message, Master [21:45:33] <maplebed> mark: don't you also need to teach puppet how to restart the service? [21:45:43] <maplebed> (or add in /etc/init.d/ scripts or something) [21:45:45] <mark> the defaults should work [21:45:52] <maplebed> huh. [21:45:53] <maplebed> ok. [21:46:01] <mark> it will use /etc/init.d/swift-container reload (or so) [21:46:04] <mark> and status [21:46:14] <mark> sometimes you need to tweak a bit, but normally reload will work [21:46:42] <maplebed> oh, silli me - I ony looked on owa2, where obviously the swift-container etc. stuff didn't exist. [21:46:59] <maplebed> yeah +1 commit. [21:47:05] <mark> ok [21:47:09] <gerrit-wm> New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/1674 [21:47:18] <gerrit-wm> New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1674 [21:47:19] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1674 [21:48:56] <mark> maplebed: can we upgrade to the newer packages easily? [21:49:06] <mark> the recon scripts might be useful, and they're in the 1.4.4 packages [21:49:28] <mark> I'd love to have some better graphs of swift metrics [21:50:14] <maplebed> I haven't looked at the upgrade path. [21:50:24] <maplebed> I'd imagie it'd be pretty easy. [21:50:52] <mark> yeah [21:50:58] <maplebed> all the swift packages are imported into our own repo [21:51:07] <gerrit-wm> New patchset: Mark Bergsma; "Fix paths" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1675 [21:51:22] <Jeff_Green> mark: recall that exim tuning conversation the other day? well . . . exim is not my friend. [21:51:23] <maplebed> I chose 1.4.3 because it was "stable". [21:51:25] <gerrit-wm> New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1675 [21:51:26] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1675 [21:51:56] <mark> Jeff_Green: it's not? ;) [21:51:59] <Jeff_Green> doesn't like that conditional syntax for some reason, though it checks out via exim -bp [21:52:08] <Jeff_Green> no it's my enemy. my mortal enemy. [21:52:12] <maplebed> mark: can I join your screen session on hume to check out the uploader thingy? [21:52:17] <mark> maplebed: sure [21:52:56] <maplebed> mark: would you mark it multiuser for me? (ctrl-a :multli on) [21:53:15] <maplebed> (whoops. multiuser on, not multi on) [21:53:28] <mark> done [21:53:29] <maplebed> (then ctrl-a :acladd root) [21:53:44] <maplebed> that part might not be necessary. [21:53:48] <mark> done too [21:53:53] <maplebed> thanks! [21:54:18] <mark> I wasn't sure if I was just fetching existing files now or uploading new ones [21:54:22] <maplebed> this looks like a read test. [21:54:27] <mark> ok [21:54:50] <maplebed> you started on the wikipedia-filelist-urls.txt, right? [21:54:58] <mark> I just used the command on the wiki [21:55:05] <mark> since I had no idea where you were [21:55:14] <mark> so *-urls.txt [21:55:20] <maplebed> the first 2.2m of those (more or less) will be a read test. [21:55:26] <mark> ok [21:55:43] <maplebed> I think it'll be more interesting to switch to a write test. shall I? [21:55:48] <mark> I just needed something more interesting to look at than swift replicating itself without any other load [21:55:50] <mark> yes go ahead [21:55:55] <maplebed> k. [21:56:46] <mark> Jeff_Green: so what is not working then? [21:57:10] <mark> you can turn on debugging and see exactly what it does... [21:57:26] <Jeff_Green> ah i haven't tried that [21:57:34] <mark> !log Ran swift-init all restart on ms3 [21:57:42] <Jeff_Green> I put these under the remote_smtp tranport [21:57:43] <morebots> Logged the message, Master [21:57:47] <Jeff_Green> multi_domain = false [21:57:50] <mark> including how it expands the string expansion, if you turn up debugging high enough [21:57:58] <Jeff_Green> connect_timeout = ${lookup {$domain} lsearch{/etc/exim4/deadbeats} {30s}{5m} [21:58:14] <mark> yeah [21:58:24] <mark> that's missing one } [21:58:29] <Jeff_Green> deadbeats is there, and I ran the test with i.e. {google.com} in $domain's spot and that works [21:58:38] <Jeff_Green> oh that's just a bad irc paste, it's there in the config [21:58:42] <mark> ok [21:58:55] <mark> so perhaps it's not expanding $domain correctly despite what the docs say [21:58:59] <Jeff_Green> test works, properly with/without domain that matches on in the list [21:59:02] <Jeff_Green> yeah that's what I suspect [21:59:05] <mark> running with full debugging should reveal that [21:59:13] <Jeff_Green> k. i'll try that [21:59:35] <Jeff_Green> the on startup: invalid time value for connect_timeout [21:59:36] <maplebed> mark: switched. [21:59:48] <mark> Jeff_Green: oooh [21:59:55] <mark> perhaps it doesn't accept string expansions for that option at all [21:59:57] <mark> now that would suck [22:00:00] <maplebed> damn, that geturls script was destroying hume! I wonder why it didn't do that on fenari. [22:00:11] <Jeff_Green> well that's a possibility I suppose [22:00:38] <mark> Jeff_Green: then an easy way to do it is to make two transports [22:00:47] <mark> and select in the routers [22:00:52] <Jeff_Green> mark: that's part of why it is not my friend. it's very hard to parse the documentation for exim [22:00:59] <Jeff_Green> ah ok, that makes sense [22:01:03] <mark> oh I find that very easy [22:01:05] <Jeff_Green> I think I can actually figure that out [22:01:10] <mark> but I have a lot of experience with it, I guess [22:01:15] <mark> there's a book about it which takes another route [22:01:33] <mark> maplebed: it was destroying hume just now? [22:01:34] <Jeff_Green> well yeah, but here we are at "maybe it doesn't take string expansions just there" [22:01:45] <mark> maplebed: I think I made nfs lots faster earlier [22:02:05] <mark> Jeff_Green: yeah, unfortunately that's not listed there [22:02:11] <mark> although the docs are generally very expansive [22:02:12] <mark> which I like [22:02:18] <Jeff_Green> I gotta go feed the children, will try the transport-switching approach tomorrow [22:02:22] <mark> most of exim's options support string expansions, just few left that don't [22:02:24] <mark> ok [22:03:01] <Jeff_Green> i've probably not found the right docs yet, but what I've found doesn't say much about that config variable--just that you can set it [22:03:24] * Jeff_Green chowtime! [22:04:44] <mark> maplebed: gonna do a quick oprofile run on ms1 again [22:06:01] <maplebed> mark: the cpu utilization spike on hume over the last hour corresponds perfectly with the runtime of geturls. [22:06:20] <mark> maplebed: and fenari didn't saturize earlier? [22:06:28] <mark> then probably because I made /home NFS faster a few hours ago [22:06:37] <mark> saturate [22:07:27] <maplebed> the other difference is that yesterday the source file (the list of filenames) was ~600M, now it's 2.4G. [22:07:45] <mark> hehe [22:07:50] <maplebed> I loadthe whole thing into memory... :P [22:07:55] <mark> oh ouch [22:08:02] <mark> why is that needed? [22:08:15] <maplebed> it's not. it just makes it easier. [22:09:12] <maplebed> you can see it loading on hume's memory graph. it's amusing. [22:10:00] <mark> hume has more mem than fenari [22:10:47] <maplebed> (there are two parts it makes easier - I get a line count for free and I only have to lock the threads arount incrementing a counter rather than reading a line from the file) [22:10:59] <mark> right [22:11:31] <mark> I wonder if we'd get higher throughput with multiple copies of the script [22:11:36] <mark> I don't trust python's threading [22:11:43] <mark> it's known to have the global interpreter lock [22:12:10] <maplebed> there's also ab in ~ben/swift [22:12:28] <mark> can't we simply run a second copy on a different portion of the files? [22:12:28] <maplebed> it's better suited for high speed performance testing. [22:12:32] <mark> perhaps on a separate box ;) [22:12:35] <maplebed> we could, yes. [22:12:53] <mark> makes it harder to get stats of course... [22:12:56] <maplebed> when I did tests earlier I ran 4 copies of ab on 4 boxes and got 4x speed increases. :) [22:13:06] <maplebed> (20 threads each) [22:13:12] <mark> ok [22:13:12] <maplebed> (that was on the eqiad cluster) [22:13:38] <mark> I think the amount of cpu swift uses on the storage nodes in idle state is worrying [22:13:45] <maplebed> but yes, running multiple copies on separate client boxes is a good idea. [22:14:04] <maplebed> keep in mind, it uses its idle time to do integrity checks. [22:14:12] <mark> yeah [22:14:18] <mark> I just hope it doesn't go up a lot as the data set increases [22:14:31] <mark> we can probably turn off some of the integrity checks temporarily right [22:14:34] <mark> to see how much that matters [22:14:40] <mark> they're separate processes I think [22:15:00] <maplebed> hmm... dunno! yeah, I think they are sepraate proceses. [22:15:20] <mark> the -auditor processes [22:16:54] <maplebed> write throughput is still ~50qps. you kicked ms1 and 2 with the new processorcount, right? [22:17:00] <mark> and ms3 too [22:17:02] <mark> so it's not helping [22:17:05] <maplebed> bummer. [22:17:31] <maplebed> i haven't done these things yet: http://docs.openstack.org/bexar/openstack-object-storage/admin/content/ch04s06.html#d5e1206 [22:17:34] <mark> we can try it on the proxies [22:17:45] <mark> but i'm less optimistic that it'll help there [22:18:03] <maplebed> I know the ip_conntrack_max change is necessary on the proxies. [22:18:04] <mark> oi! [22:18:05] <mark> good point [22:18:09] <mark> [435250.560336] nf_conntrack: table full, dropping packet. [22:18:10] <mark> [435250.560638] nf_conntrack: table full, dropping packet. [22:18:10] <mark> [435250.561473] nf_conntrack: table full, dropping packet. [22:18:11] <mark> haha [22:18:16] <maplebed> lol [22:18:24] <mark> can we please, please, temporarily disable all of iptables on ms* [22:18:29] <mark> and test again afterwards? ;) [22:18:36] <maplebed> sure. [22:18:43] <mark> ok, flushing now [22:18:56] <mark> not in puppet yet [22:19:01] <mark> so puppet can restore any moment [22:19:16] <maplebed> restarting the write test so we get fresh numbers. [22:19:30] <mark> !log Flushed all iptables rules down the drain on ms1-3 (live hack, puppet will restore) [22:19:39] <morebots> Logged the message, Master [22:19:44] <mark> aiai [22:19:46] <mark> check dmesg on ms1 [22:19:48] <mark> xfs errors [22:19:54] <mark> sorry ms2 [22:19:56] <maplebed> I suspect, though, that that will be relevant only for reads, since they can hit 1100qps whereas reads are capped at 50qps. [22:20:03] <maplebed> grr... [22:20:12] <maplebed> *writes* are 50qps, reads 1100. [22:20:14] <mark> might be a broken disk [22:20:21] <mark> ok how is it looking now? [22:20:36] <maplebed> 53qps. [22:20:40] <mark> bah [22:21:35] <mark> right, those iptables errors were half an hour ago [22:22:29] <mark> do you have numbers on how long a typical write takes? [22:22:44] <mark> if it takes almost a second, then 30 threads is not enough to saturate [22:22:44] <maplebed> no, sadly. only throughput, no latency stats. [22:22:54] <mark> did you try with more threads? [22:23:18] <maplebed> yeah, but not rigorously. I just ran it randomly with different numbers. [22:23:24] <mark> ok [22:24:35] <mark> so sdab1 on ms2 is toast [22:24:40] <mark> how can we take it out of the test? [22:25:42] <maplebed> sure! [22:25:45] <maplebed> just unmount it! [22:26:01] <mark> and the empty mountpoint won't hurt? [22:26:04] <maplebed> joking aside, yes, we can take it out. [22:26:12] <maplebed> huh. [22:26:16] <maplebed> I don't think so. [22:26:24] <maplebed> but I don't know! [22:26:29] <maplebed> that'll be interesting to see. [22:26:33] <mark> indeed ;) [22:27:12] <maplebed> the docs all say "if it'll be down for not too long, just take it down and ignore it. if it'll be down for longer, adjust the rings and remove it then re-add it when it returns." [22:27:18] <maplebed> so we can remove it from the ring. [22:27:28] <mark> whatever you prefer for testing now [22:27:55] <maplebed> let's unmount it, watch it for 5m and see if swift drops any files in the unmounted directory, then remove it from the ring. [22:28:01] <mark> hrmf [22:28:03] <mark> that's a new drive [22:28:05] <mark> the one we replaced [22:28:10] <mark> oki [22:28:14] <mark> if we can unmount [22:28:15] <maplebed> maybe the connector's busted? [22:28:15] <mark> may be busy [22:28:20] <mark> yeah possible [22:28:45] <mark> do you want to or shall i? [22:30:14] <maplebed> please go ahead. [22:30:32] <maplebed> from the swift-ring-builder help doc for the 'remove' command: [22:30:34] <maplebed> Removes the device(s) from the ring. This should normally just be used for [22:30:34] <maplebed> a device that has failed. For a device you wish to decommission, it's best [22:30:34] <maplebed> to set its weight to 0, wait for it to drain all its data, then use this [22:30:37] <maplebed> remove command [22:30:59] <mark> unmounted [22:31:13] <mark> no new files have appeared yet [22:31:27] <mark> !log Unmounted /srv/swift-storage/sdab1 on ms2 (borken filesystem) [22:31:36] <morebots> Logged the message, Master [22:31:52] <mark> and this woud be why ;) [22:31:53] <mark> root@ms2:/srv/swift-storage# ls -ld sdab1 [22:31:53] <mark> drwxr-xr-x 2 root root 4096 2011-12-13 16:37 sdab1 [22:32:18] <mark> and that in itself is good I guess... [22:33:07] <mark> Dec 21 22:32:48 ms2 object-replicator Error syncing partition: #012Traceback (most recent call last):#012 File "/usr/lib/pymodules/python2.6/swift/obj/replicator.py", line 392, in update#012 reclaim_age=self.reclaim_age)#012 File "/usr/lib/pymodules/python2.6/eventlet/tpool.py", line 75, in tworker#012 rv = meth(*args,**kwargs)#012 File "/usr/lib/pymodules/python2.6/swift/obj/replicator.py", line 207, in tpooled_get_hashes#012 return [22:33:08] <mark> Dec 21 22:32:53 ms2 account-replicator Skipping sdab1 as it is not mounted [22:34:27] <maplebed> that answers that. [22:34:36] <maplebed> ok, removing it from the ring now. [22:39:07] <hexmode> Jeff_Green: any final word on what the bz problem was? [22:39:12] <hexmode> just not enough mem? [22:39:50] <mark> heh [22:39:55] <mark> maplebed: the proxies have far too few workers I think [22:40:00] <mark> I increased owa1 from 8 to 64 [22:40:07] <mark> and in top it seems that lots are being usd [22:40:09] <mark> used [22:40:30] <mark> shall I increase on all in puppet? [22:40:35] <maplebed> IIRC recommendation was #workers = #cores for the proxies. [22:40:40] <mark> really? [22:41:02] <mark> well those boxes have 12 cores at least ;) [22:41:20] <maplebed> oh, sorry. 2x # cores. [22:41:27] <mark> yeah that sounds sensible [22:41:28] <maplebed> from here: http://docs.openstack.org/bexar/openstack-object-storage/admin/content/ch04s06.html#d5e1200 [22:41:34] <mark> let me enforce that in puppet then [22:41:38] <mark> based on $processorcount [22:41:39] <mark> ok? [22:42:28] <maplebed> ok [22:42:34] <gerrit-wm> New patchset: Pyoungmeister; "adding udplogging capabilites for varnish mobilez" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1676 [22:42:38] <maplebed> the disk is now removed and teh ring files distributed. [22:43:01] <gerrit-wm> New patchset: Mark Bergsma; "Set proxy worker count to 2x # CPU cores" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1677 [22:43:21] <gerrit-wm> New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1677 [22:43:22] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1677 [22:46:19] <maplebed> write throughput is now wandering around 58 to 65 [22:46:30] <mark> still not great [22:46:39] <mark> hrm [22:46:47] <mark> seems like owa3 has HT turned on, while ow1/2 have it off [22:46:52] <mark> $processorcount is 24 on it [22:47:05] <mark> oh well, let's us test that too [22:47:31] <maplebed> :P [22:47:52] <maplebed> I figured it was just different hardware. [22:48:00] <mark> !log proxy worker processes increased from 8 to 24 on owa1-2, 48 on owa3 [22:48:05] <maplebed> ms3 is behaving very differently from ms1 and 2... [22:48:05] <mark> probably is just a bios setting [22:48:09] <morebots> Logged the message, Master [22:48:11] <mark> ms3 is different hardware yes [22:48:18] <mark> but I think owa1-3 are the same [22:48:25] <mark> ms3 actually has twice as many cores ;) [22:48:50] <maplebed> do you know the Right(tm) way to put the sysctl stuff into puppet? [22:49:06] <mark> there is some simple stuff in generic-definitions [22:49:17] <mark> just puts simple sysctl files in /etc/sysctl.d/ [22:49:17] <maplebed> the interesting different between ms1/2 and 3 to me is that ms3 has no iowait CPU time. [22:49:23] <maplebed> ah, good. [22:49:32] <mark> could be better [22:49:40] <mark> I once tried putting all sysctl files in facter [22:49:44] <maplebed> that'll make doing the conntrack and other tcp settings easier. [22:49:54] <mark> but then puppet was stupid, and put them all in one giant GET URL param request [22:50:10] <mark> maplebed: I think you can just include high-http-performance which already exists [22:50:13] <mark> and tune that [22:50:20] <mark> since those settings are pretty similar for the squids and varnish servers [22:50:32] <mark> probably no need to differentiate for swift there [22:51:00] <mark> btw, ms3 has twice the amount of memory vs ms1-2 [22:51:07] <mark> that's probaby your i/o wait ;) [22:51:13] <maplebed> ah. [22:51:33] <mark> 16 vs 32 GB [22:51:37] <maplebed> in theory, I should probably increase the weight of the drives in ms3 in the ring, but with only three hosts it won't actually change anything. [22:51:39] <maplebed> :( [22:52:01] <mark> at least now we can see what the difference in hardware matters [22:52:23] <mark> let's get 12-core boxes for storage servers :P [22:52:39] <mark> it can't hurt anyway [22:53:00] <Ryan_Lane> anyone looked at this yet? http://referencearchitecture.org/ [22:53:34] <mark> Ryan_Lane: that looks like what we arrived at ;) [22:53:41] <Ryan_Lane> heh [22:53:43] <mark> dell C2100, heh [22:53:57] <maplebed> the settings in high-performance-http are not the same as what the swift page recommensd. [22:54:26] <mark> you can probably put them in [22:54:40] <maplebed> oh, just add the swift stuff to the high-http settionsg? [22:54:47] <mark> yeah [22:54:50] <mark> tcp time wait stuff, right? [22:55:00] <maplebed> disables TIME_WAIT, [22:55:03] <maplebed> turns off syn cookies, [22:55:09] <maplebed> increases the conntrack table size. [22:55:20] <mark> we don't run conntrack on squids and such [22:55:30] <mark> (and hopefully never will, that's quite a performance hit) [22:55:58] <mark> I did some extensive testing on that 2-3 years ago [22:56:06] <mark> it may be a bit better now [22:56:15] <mark> (or worse ;) [22:56:46] <mark> what's the qps now? [22:57:34] <mark> I guess it doesn't help very much indeed, throughput on the proxies doesn't seem higher [22:58:03] <maplebed> you didn't ever publish the performance testingc data, did you? [22:58:25] <maplebed> current throughput is about the same - 18s for 1000 urls. [22:58:48] <maplebed> (aka 55qps) [22:59:06] <mark> "publish", no, I have some notes somewhere on a wiki [22:59:28] <nagios-wm> PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [22:59:34] <maplebed> ah well. [22:59:38] <mark> biggest result was actually that running ntpd on LVS servers halved their kpps throughput [22:59:52] <maplebed> fascinating. why? [23:00:11] <mark> didn't find out why, something with the time adjust syscalls I think [23:00:23] <maplebed> wild. [23:00:26] <mark> just stop ntpd and pps doubled [23:00:52] <mark> and back then our LVS servers really couldn't take that hit [23:01:09] <mark> it was the difference between dropping 5% of packets or not [23:03:24] <Ryan_Lane> mark: this may or may not interest you, since you like networking: http://openvswitch.org/openstack/ [23:03:25] <Ryan_Lane> :D [23:03:44] <mark> heh [23:03:49] <Ryan_Lane> new openstack project, quantum. likely to be usable in essex timeframe [23:03:53] <Ryan_Lane> replaces nova-network [23:04:01] <Ryan_Lane> uses openvswitch [23:04:11] <mark> nice [23:04:20] <Ryan_Lane> can be configured via api [23:06:55] <mark> maplebed: can we see from the Date or LM header that swift returns whether the object was read or put? [23:07:10] <mark> (or, can we put a temporary debug header in which tells us that?) [23:07:54] <gerrit-wm> New patchset: Bhartshorne; "apply swift TCP tuning settings to all high-http-performance hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1678 [23:08:20] <maplebed> not sure what you mean. [23:08:39] <mark> hmm wait [23:08:45] <mark> those tw_recycle settings [23:08:46] <maplebed> whether the 404 handler triggered you mean? [23:08:51] <mark> I now recall those violate some spec [23:08:53] <mark> (yeah) [23:09:03] <mark> I recall some clients having issues when we enabled those [23:09:07] <mark> so I had to disable that again on squids [23:09:21] <mark> so perhaps let's not reenable that on public hosts, sorry [23:09:34] <mark> there's something in the linux kernel docs about it [23:09:43] <gerrit-wm> New patchset: Pyoungmeister; "adding udplogging capabilites for varnish mobilez" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1676 [23:10:09] <maplebed> ok, I'll make a separate swift sysctl conf. [23:10:11] <mark> looking... [23:11:25] <mark> some firewalls or proxy servers broke and then couldn't access wikipedia anymore [23:14:23] <gerrit-wm> Change abandoned: Bhartshorne; "need to do this separately for swift to avoid getting the time_wait stuff on the squids." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1678 [23:15:00] <mark> you probably also want the squid sysctl settings on swift, though [23:15:16] <mark> they're useful for many tcp connections/requests [23:15:22] <maplebed> ok, I'll pull them in. [23:16:07] <gerrit-wm> New patchset: Pyoungmeister; "adding udplogging capabilites for varnish mobilez" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1676 [23:17:28] <mark> maplebed: so the swift docs also recommend adjusting the xfs parameters during mkfs [23:17:36] <mark> which we didn't do [23:17:42] <mark> we have the mount options, not the mkfs options [23:17:45] <mark> bigger inode size [23:18:08] <maplebed> yeah, I noticed that. (I haven't made it all the way through this performance doc yet) [23:18:21] <maplebed> we have a bunch of pretty small files though, [23:18:26] <mark> root@ms1:~# xfs_info /srv/swift-storage/sdd1/ [23:18:26] <mark> meta-data=/dev/sdd1 isize=256 agcount=4, agsize=15262336 blks [23:18:26] <mark> = sectsz=512 attr=2 [23:18:26] <mark> data = bsize=4096 blocks=61049344, imaxpct=25 [23:18:26] <mark> = sunit=0 swidth=0 blks [23:18:27] <mark> naming =version 2 bsize=4096 ascii-ci=0 [23:18:27] <mark> log =internal bsize=4096 blocks=29809, version=2 [23:18:28] <mark> = sectsz=512 sunit=0 blks, lazy-count=1 [23:18:28] <mark> realtime =none extsz=4096 blocks=0, rtextents=0 [23:19:29] <gerrit-wm> New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1676 [23:19:29] <gerrit-wm> Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1676 [23:22:16] <mark> rsyslogd is using quite a bit of cpu... [23:22:21] <mark> let me temporarily disable access logging [23:27:35] <gerrit-wm> New patchset: Asher; "fix template name for varnishncsa.init" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1679 [23:27:48] <gerrit-wm> New patchset: Hashar; "bug 32645, add testswarm to integration homepage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1680 [23:28:46] <gerrit-wm> New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1679 [23:28:47] <gerrit-wm> Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1679 [23:30:26] <mark> lots of these in syslog: [23:30:26] <mark> Dec 21 09:52:36 ms1 object-server ERROR container update failed with 10.0.0.249:6001/sdr1 (saving for async update later): Timeout (3s) [23:30:27] <mark> Dec 21 09:52:36 ms1 object-server ERROR container update failed with 10.0.0.249:6001/sdr1 (saving for async update later): Timeout (3s) [23:30:27] <mark> Dec 21 09:52:37 ms1 object-server ERROR container update failed with 10.0.0.249:6001/sdr1 (saving for async update later): Timeout (3s) [23:30:33] <mark> all sdr1 [23:30:52] <maplebed> another bad disk? [23:31:06] <mark> maybe, maybe not... [23:31:16] <mark> ms2 is complaining about others too [23:33:23] <mark> I don't see the kernel complaining about those disks anyway [23:35:02] <gerrit-wm> New patchset: Bhartshorne; "adding recommended tcp settings to swift hosts." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1681 [23:35:15] <maplebed> mark: wanna review? [23:36:02] <mark> go ahead [23:36:19] <mark> ok, it's half past midnight here [23:36:24] <gerrit-wm> New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1681 [23:36:24] <gerrit-wm> Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1681 [23:36:31] <mark> i'm gonna call it a day soon [23:37:09] <maplebed> cool to run puppet on all the swift hosts? [23:37:14] <mark> go ahead [23:37:20] <mark> it will just reenable access logging on ms1 [23:37:40] <maplebed> throughput hasn't changed. [23:38:07] <mark> no I don't think it matters enough [23:38:45] <mark> a lot of time is spent in sqlite libs actually [23:38:57] <mark> how many objects do we have per container now? [23:48:15] <mark> ok, good luck [23:48:20] <mark> more tomorrow [23:49:05] <maplebed> 2.5m objects in the wikipedia-commons-thumb container. [23:49:31] <maplebed> if we leave the job running on hume overnight, we'll probably have 5-8m tomorrow. [23:50:07] <mark> I think that's interesting [23:50:14] <mark> if it slows down to a crawl, that's good to know [23:50:17] <mark> we can always wipe and start over [23:50:20] <maplebed> yup. [23:50:29] <maplebed> I set up the eqiad cluster to do the hashed containers, b [23:50:38] <maplebed> but I don't think it's big enough to do a high volume test like this. [23:50:43] <mark> yeah [23:51:01] <maplebed> I think I'll try and do the hash-for-commons-only thing. [23:51:05] <mark> we have 2 ms servers free there [23:51:10] <mark> do we have an es server too? [23:51:12] <mark> that's 3 ;) [23:51:36] <maplebed> I think the ES server is still down - RobH is the perc RAID card back in it? [23:51:47] <maplebed> we could set it back up as a raid10 device and just weight it incredibly heavily. [23:52:13] <mark> well the ms servers are the same [23:52:17] <mark> so also raid 10 then :/ [23:52:36] <maplebed> hrmph. [23:53:09] <mark> i'm not impressed with these thumpers so far [23:53:21] <maplebed> I'll just stick with code for the time being. the eqiad cluster is much more of a functional testing ground with its current setup; I'm not too inclined to change it. [23:53:30] <mark> ok [23:54:00] <maplebed> after we get through the obvious stuff, [23:54:18] <maplebed> we can invite notmyname to look at our ganglia stats and configs and see if he'd be willing to give us some advice. [23:54:32] <maplebed> (he's posted bunches of stuff about swift tuning aroun the web and hangs out in #openstack) [23:54:33] <mark> alright [23:55:02] <maplebed> till tomorrow... [23:56:31] * mark yawns [23:56:32] <mark> good night