of pasted URLs.. this is just "Who are you? Whisky pony " [18:52:22] * AaronSchulz would like an advanced AI rick-roll detector [18:53:10] <mutante> LeslieCarr: let met check.. [18:53:10] <binasher> mutante: its weird that the check would have returned 1 proc in the last few days, they both have 2 running and they've been running since before you implemented the check. so if it returned 1, there's something broken with the nagios proc check [18:53:27] <binasher> mutante: i will fix the varnish backend check [18:53:29] <nagios-wm> PROBLEM - Misc_Db_Slave on db10 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [18:53:39] <mutante> binasher: it was 1 and also 3 :p [18:53:53] <mutante> binasher: ah, cool [18:54:31] <binasher> 3 could be a stupid "ps … | grep .." sort of thing. either way, a check that accepts 1 proc running isn't really helpful [18:54:47] <mutante> binasher: to be exact it checks for "processes with args varnishncsa" [18:55:10] <mutante> that is the -a option of check_procs , as opposed to -C, but it worked better for most other process checks like this [18:57:00] <hexmode> LeslieCarr: you have a bugzilla account? [18:57:15] <LeslieCarr> hexmode: actually not ... [18:57:20] <LeslieCarr> i should create one [18:58:12] <hexmode> LeslieCarr: I was gonna add you to this one https://bugzilla.wikimedia.org/show_bug.cgi?id=33293 ... but I'll let you add yourself if you want [18:58:15] <mutante> still using fenari btw.. [18:58:59] <mutante> ssh from there is ok, just dont "ls" :p [18:59:52] <binasher> mutante: i just tried "/usr/lib/nagios/plugins/check_procs -C varnishncsa" on cp1043 1000 times, and got "2 processes" every time.. -a is more of a fit if something is spawned via an interpreter (i.e. python) or subshell [18:59:55] <RoanKattouw> Don't do any FS access to /home in fact [19:00:45] <mutante> binasher: ok, let me create a custom check command then, i just made a generic one that replaced all others, and that uses -a [19:00:49] <LeslieCarr> cool added , thanks hexmode :) [19:01:00] <Jeff_Green> did someone just mess with db9? [19:01:19] <mutante> binasher: or maybe i'll make that configurable in the generic one right away [19:01:30] <binasher> Jeff_Green: whats up with db9? [19:02:15] <Jeff_Green> db10 replication fail, complaining about master's binary log [19:02:28] <mutante> binasher: and then you want a CRIT right away if it's 1 or 3, so 2!2!2!2 for the warn and crit threshholds, right [19:02:39] <Jeff_Green> also db10's mysql install is screwed--mysql-at-facebook is what's running, but the install has been corrupted [19:02:41] <binasher> Jeff_Green: that has been broken since last week, and why i have to have a db9 outage this evening [19:02:47] <binasher> please don't touch db10 [19:02:56] <Jeff_Green> i won't--just observing [19:05:01] if anyone wants to setup the netapp, go ahead eh ;) [19:05:07] I'm not particularly attached to the thing [19:05:35] or if you need a quick fix, tune drbd on nfs1/nfs2 to do async replication [19:07:06] <RoanKattouw> I thought there was a reason we didn't do that async? [19:07:09] binasher: so, sorry I didn't setup the netapp and fixed puppet instead ;p [19:07:24] <RoanKattouw> OK fenari has calmed down [19:07:31] what is up with fenari? [19:07:36] <RoanKattouw> It looks like my fatalmonitor script may have been keeping it in the NFS death state [19:07:47] <RoanKattouw> mark: It was in NFS death [19:07:47] what is that doing on fenari? :P [19:08:13] I hope NFS on the netapp is going to be a bit better, but I don't have all that much hope [19:08:14] <RoanKattouw> Better question: what are the Apache logs doing on /home [19:08:48] <LeslieCarr> so where are the netapps hooked up ? [19:08:52] <LeslieCarr> and is there a ticket i can check out ? [19:09:19] <binasher> mark: not waiting hours for puppet updates to work is pretty great :) [19:09:20] not really I think [19:09:25] binasher: I figured! [19:09:46] personally, NFS on /home hasn't bothered me much [19:09:55] perhaps that's because I use neither /home nor fenari much at all ;) [19:10:14] but yeah, I'll setup NFS /home on the netapp soon if noone beats me to it [19:10:15] <RoanKattouw> Deployment goes off of home [19:10:22] <RoanKattouw> So if you're doing deployments, it's a PITA [19:10:30] I am well aware [19:10:39] but I felt that fixing puppet was an even higher prio (for instance) [19:11:26] <RoanKattouw> Sure [19:11:28] there is no particular reason for us to sync drbd replication that I know of [19:11:48] <RoanKattouw> I vaguely recall someone protesting to setting it to async, but I'm not sure [19:12:38] protocol C is safest [19:12:46] http://www.drbd.org/users-guide/s-replication-protocols.html [19:13:45] binasher: so I see that puppet dashboard is equally braindead performance wise as puppet itself [19:13:53] <binasher> yeah :( [19:13:54] I guess it's just adding reports to the database and never wiping them [19:13:57] I can turn it off I guess :( [19:14:25] those rails people just don't care about performance at all [19:14:41] <hexmode> mark: you said ask next week about the exim puppetization. Any news for me and Nemo_bis? [19:14:47] hexmode: no [19:14:50] <hexmode> :( [19:14:53] <hexmode> ok [19:15:06] <binasher> ok.. i was going to just truncate all of its tables tonight before restoring db10 as a db9 replica but killing it would be good if it doesn't have a "don't be stupid" switch [19:15:16] binasher: I don't think it does [19:15:17] <hexmode> mark: ok to ask in new year? [19:15:20] but I haven't investigated it [19:15:25] hexmode: might be best [19:15:29] I have a few more urgent things to do [19:15:45] <hexmode> np, just trying to understand where it is at [19:16:02] <LeslieCarr> oh mark, EU router stuff, do we order from someone special out there or should we just have TP ship it over ? [19:16:18] LeslieCarr: if they CAN deliver in europe, that would be best [19:16:23] but typically they can't [19:16:26] and then I find someone locally [19:16:30] but we don't really have someone for j right now [19:17:17] <RoanKattouw> Hmm, wait CPU on fenari is back to zero but NFS isn't out of the woods set, see Ganglia for nfs1/nfs2 [19:17:17] LeslieCarr: but anyway, an MX80 doesn't replace the core switch part [19:17:30] <RoanKattouw> So any FS operation on /home will send that wait CPU right back up [19:17:38] * RoanKattouw waits some more [19:18:35] <LeslieCarr> no it doesn't [19:18:47] <LeslieCarr> but it replaces the edge, which is the most important part [19:18:53] <LeslieCarr> and will make me much happier :) [19:19:06] me as well [19:19:42] I guess I'll have a new coffee table [19:19:48] or BBQ [19:19:57] ;) [19:20:29] <LeslieCarr> hehehe [19:23:55] !log Migrated DRBD sync between nfs1 and nfs2 from protocol C (sync) to A (async) [19:24:00] there you go [19:24:02] is NFS better now? [19:24:05] <morebots> Logged the message, Master [19:25:06] if not, the problem probably isn't drbd, but the slowness of the drives in nfs1 [19:25:34] <RoanKattouw> whoa [19:25:37] <RoanKattouw> It's fast now [19:25:41] heh [19:25:48] <gerrit-wm> New patchset: Asher; "fix the nagios check for non port 80 varnish instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1670 [19:26:07] <RoanKattouw> I ran 'fatalmonitor' and it started *instantly* [19:26:10] <RoanKattouw> That never happens [19:26:50] <gerrit-wm> New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1670 [19:26:53] <gerrit-wm> Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1670 [19:27:29] !log Started oprofile run on ms1 [19:27:38] <morebots> Logged the message, Master [19:27:56] samples % image name app name symbol name [19:27:56] 517941 54.9241 python2.6 python2.6 /usr/bin/python2.6 [19:27:56] 216447 22.9527 no-vmlinux no-vmlinux /no-vmlinux [19:27:56] 94617 10.0335 libsqlite3.so.0.8.6 libsqlite3.so.0.8.6 /usr/lib/libsqlite3.so.0.8.6 [19:27:56] 86888 9.2139 libc-2.11.1.so libc-2.11.1.so /lib/libc-2.11.1.so [19:27:56] 4155 0.4406 _sqlite3.so _sqlite3.so /usr/lib/python2.6/lib-dynload/_sqlite3.so [19:28:13] why was I thinking swift was C code? [19:29:21] <nagios-wm> RECOVERY - Varnish HTTP mobile-backend on cp1043 is OK: HTTP OK HTTP/1.1 200 OK - 691 bytes in 0.064 seconds [19:31:16] samples % image name app name symbol name [19:31:17] 2981938 31.1515 no-vmlinux no-vmlinux /no-vmlinux [19:31:17] 1016399 10.6181 python2.6 python2.6 PyEval_EvalFrameEx [19:31:17] 702962 7.3437 libc-2.11.1.so libc-2.11.1.so /lib/libc-2.11.1.so [19:31:17] 625996 6.5396 libsqlite3.so.0.8.6 libsqlite3.so.0.8.6 /usr/lib/libsqlite3.so.0.8.6 [19:31:17] 380234 3.9722 python2.6 python2.6 lookdict_string [19:31:17] 248843 2.5996 python2.6 python2.6 PyObject_GenericGetAttr [19:31:18] 233595 2.4403 python2.6 python2.6 dict_traverse [19:31:18] 195101 2.0382 python2.6 python2.6 visit_reachable [19:31:19] 125498 1.3110 python2.6 python2.6 visit_decref [19:31:19] 102711 1.0730 python2.6 python2.6 PyEval_EvalCodeEx [19:31:20] 98153 1.0254 python2.6 python2.6 tupledealloc [19:34:09] heh [19:34:20] isn't it nice how debian/ubuntu split up stuff in separate packages [19:34:47] <apergos> yes, it just warms my heart, every time [19:34:48] so you have swift-proxy for the proxy server, swift-object for the object server, etc [19:34:57] and really it doesn't matter at all [19:35:07] because all they contain are short stubs in /usr/bin [19:35:16] that call into the entire swift stack under /usr/lib/python [19:35:22] ...which is entirely contained in python-swift [19:35:29] <apergos> someone over there had a reason for it but whatever [19:35:41] it looks nice on the surface ;-) [19:35:48] <apergos> :-D [19:41:11] <nagios-wm> RECOVERY - Varnish HTTP mobile-backend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 691 bytes in 0.063 seconds [19:42:54] the container servers are the main problem indeed, on the storage nodes [19:44:25] !log Ended oprofile run on ms1 [19:44:33] <morebots> Logged the message, Master [19:53:09] maplebed: still here? [20:03:14] <notpeter> mark: re: lily. is there anything I can do to help you out? [20:03:55] notpeter: I think the exim config is sort of like how it's supposed to be now [20:04:09] i'm not entirely sure about spamassassin and mailman yet [20:04:19] they could do with a bit more templatization and such [20:04:24] and after that's all done [20:04:31] we'll have to reinstall the box and start over afresh [20:04:32] and migrate all data [20:04:37] <notpeter> yep [20:04:51] <notpeter> ok, I'll take a look at spamassassin and mailman some more [20:04:55] ok [20:05:05] go over the existing docs carefully [20:05:11] I think everything is in there more or less [20:07:09] <notpeter> ok, sounds good [20:10:53] <hexmode> db9 is ok? [20:11:03] <hexmode> bugzilla seems wonky to me [20:11:16] <hexmode> oh, maybe just slow [20:11:20] <hexmode> nm then [20:13:11] <nagios-wm> PROBLEM - NTP on dataset1 is CRITICAL: NTP CRITICAL: Offset unknown [20:13:16] <apergos> grrrrr [20:14:14] <apergos> can't check itnow [20:14:16] <apergos> busy [20:16:34] <hexmode> bugzilla still slow as molasses, but I was able to leave a comment on the blog w/o a problem [20:16:51] <hexmode> that would require write to the same db, right? [20:17:09] <hexmode> anyway: taking a break, now [20:17:11] <Jeff_Green> think so [20:17:30] <Jeff_Green> db9 looks very quiet atm [20:17:41] <Jeff_Green> load average: 0.09, 0.07, 0.07 [20:18:02] <gerrit-wm> New patchset: Mark Bergsma; "Template swift storage server configurations" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1671 [20:18:17] <gerrit-wm> New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1671 [20:18:18] <hexmode> Jeff_Green: fwiw bugzilla server is just sitting there spinning when I try to hit the front page [20:18:33] <Jeff_Green> looking [20:18:35] <hexmode> so nothing with db9 writing probably [20:18:55] <gerrit-wm> New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1671 [20:18:55] <hexmode> now, really leaving for a walk while you work ;) [20:18:56] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1671 [20:18:58] <Jeff_Green> k [20:21:19] <Jeff_Green> oh my [20:21:22] <Jeff_Green> swapdeath [20:21:28] <gerrit-wm> New patchset: Mark Bergsma; "Experimentally raise worker counts on account/container/object servers to processorcount" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1672 [20:21:41] <gerrit-wm> New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1672 [20:21:41] <Jeff_Green> http://ganglia3.wikimedia.org/graph.php?r=hour&z=xlarge&h=kaulen.wikimedia.org&m=cpu_report&s=descending&mc=2&g=mem_report&c=Miscellaneous%20pmtpa [20:21:50] <gerrit-wm> New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1672 [20:21:51] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1672 [20:25:47] <Jeff_Green> looks like kaulen needs a power cycle--can't get an ssh session [20:26:46] <gerrit-wm> New patchset: Hashar; "testswarm: disable mobile browsers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1673 [20:26:58] <gerrit-wm> New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1673 [20:27:21] <nagios-wm> PROBLEM - SSH on kaulen is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:32:38] <gerrit-wm> New patchset: Mark Bergsma; "Restart swift processes on config changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1674 [20:32:47] <Jeff_Green> can't get a terminal via drac either, just stalls and dumps after the password prompt. anything anyone would like to do before I power cycle it? [20:32:59] <gerrit-wm> New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/1674 [20:34:22] <Jeff_Green> !log power cycled kaulen because it's deathswapped and unresponsive [20:34:32] <morebots> Logged the message, Master [20:34:41] !log Restarted swift-container on ms1 with higher worker count (4 instead of 2) [20:34:49] <morebots> Logged the message, Master [20:37:42] <nagios-wm> RECOVERY - SSH on kaulen is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [20:40:51] <gerrit-wm> New review: Demon; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/1669 [20:48:02] <gerrit-wm> New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1673 [20:48:12] <hashar> can someone vote / merge https://gerrit.wikimedia.org/r/#change,1673 please ? [20:48:19] <hashar> that is a really minor change :D [20:48:28] <hashar> in a php script used for testswarm. [20:48:30] <hashar> thanks in advance 8-) [20:49:50] <Krinkle> hashar: Can I do that ? [20:50:39] <gerrit-wm> New review: Krinkle; "OK" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1673 [20:50:58] <Krinkle> Hm.. I can only choose "good but someone else must approve" [20:51:18] <Krinkle> like "check off" in mediawiki CR [20:54:00] <hashar> yup [20:54:08] <hashar> only ops can +2 / approve it [20:54:23] <hashar> or devs would be able to have change merged in production :) [20:54:37] <hashar> (without ops knowing about it) [20:58:52] <gerrit-wm> New patchset: Hashar; "bug 33301, bad SSL cert at integration.mediawiki.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1669 [20:59:15] <Jeff_Green> whee: FastCGI: server "/srv/org/wikimedia/bzapi/script/bugzilla_api_fastcgi.pl" has failed to remain running for 30 seconds given 3 attempts, its restart interval has been backed off to 600 seconds [21:00:46] <gerrit-wm> New patchset: Hashar; "testswarm: disable mobile browsers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1673 [21:00:52] <nagios-wm> PROBLEM - Disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10742 MB (3% inode=99%): [21:02:09] <Krinkle> hashar: what's the difference ? [21:02:31] <hashar> on 1673 ? I have changed its parent [21:02:53] <hashar> so instead of depending upon an unmerged / unrelated change, it depends upon an already merged change [21:03:07] <hashar> so whenever someone validate the change in gerrit, it will be merged [21:03:18] <Krinkle> k [21:03:21] <hashar> instead of being SUBMITTED which mean the change is on hold pending validation of its parent [21:03:28] <hashar> gerrit is fun but a bit complicated [21:05:32] <nagios-wm> PROBLEM - MySQL disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10724 MB (3% inode=99%): [21:10:05] <apergos> I think that cron job will finish by tmorrow [21:10:12] <apergos> ( maplebed ) [21:36:53] !log Running ben's swift thumb loader script in a screen on hume [21:37:02] <morebots> Logged the message, Master [21:40:25] <maplebed> it looks like throughput's higher. [21:40:31] hey [21:40:33] I only increased ms1 [21:40:35] so not ms2 and ms3 [21:40:46] (for comparison, although with everything being intertwined, it's still hard to say) [21:40:52] <maplebed> I tried that yesterday, but I only increased one at a time. [21:40:54] I see all requests are logged in syslog [21:41:10] should we turn that off already? [21:41:17] makes testing harder, but might affect performance too... [21:41:31] <maplebed> writes have to come back from 2 storage nodes before the proxy will return 200, so just increasing the count on one is unlikely to make a visible difference. [21:41:52] yeah [21:41:57] but it was just doing replication before [21:42:02] and even that takes a ridiculous amount of cpu [21:42:07] so I wanted to see if I saw a difference there [21:42:10] <maplebed> I'd rather leave it on; they're not competing for the same spindles (/ is not a swift storage spindle), so I'd rather leave it on. [21:42:11] but not really [21:42:19] <maplebed> the logging isn't blocking since it's handed off to syslog. [21:42:21] yeah, as long as the / i/o is low [21:42:26] indeed [21:42:30] might take some more cpu though [21:42:37] then again, it's not like swift is highly optimized C code either... [21:42:42] <apergos> :-D [21:42:45] i'll up ms2 and ms3 too [21:43:08] <maplebed> mark: after running puppet, you did restart the swift stuff, right? [21:43:13] only on ms1 [21:43:20] the others ran puppet, but I didn't restart yet [21:43:24] <maplebed> "swift-init all restart" or something? [21:43:24] doing that now [21:43:31] I did them individually [21:43:34] <maplebed> k. [21:43:37] and I have a change waiting for puppet to do that automatically [21:43:39] for your review [21:43:42] not sure if we want that now [21:43:45] but it's easy to take out or disable [21:43:48] it's in gerrit [21:44:34] !log Ran swift-init all restart on ms2 [21:44:42] <morebots> Logged the message, Master [21:45:33] <maplebed> mark: don't you also need to teach puppet how to restart the service? [21:45:43] <maplebed> (or add in /etc/init.d/ scripts or something) [21:45:45] the defaults should work [21:45:52] <maplebed> huh. [21:45:53] <maplebed> ok. [21:46:01] it will use /etc/init.d/swift-container reload (or so) [21:46:04] and status [21:46:14] sometimes you need to tweak a bit, but normally reload will work [21:46:42] <maplebed> oh, silli me - I ony looked on owa2, where obviously the swift-container etc. stuff didn't exist. [21:46:59] <maplebed> yeah +1 commit. [21:47:05] ok [21:47:09] <gerrit-wm> New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/1674 [21:47:18] <gerrit-wm> New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1674 [21:47:19] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1674 [21:48:56] maplebed: can we upgrade to the newer packages easily? [21:49:06] the recon scripts might be useful, and they're in the 1.4.4 packages [21:49:28] I'd love to have some better graphs of swift metrics [21:50:14] <maplebed> I haven't looked at the upgrade path. [21:50:24] <maplebed> I'd imagie it'd be pretty easy. [21:50:52] yeah [21:50:58] <maplebed> all the swift packages are imported into our own repo [21:51:07] <gerrit-wm> New patchset: Mark Bergsma; "Fix paths" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1675 [21:51:22] <Jeff_Green> mark: recall that exim tuning conversation the other day? well . . . exim is not my friend. [21:51:23] <maplebed> I chose 1.4.3 because it was "stable". [21:51:25] <gerrit-wm> New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1675 [21:51:26] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1675 [21:51:56] Jeff_Green: it's not? ;) [21:51:59] <Jeff_Green> doesn't like that conditional syntax for some reason, though it checks out via exim -bp [21:52:08] <Jeff_Green> no it's my enemy. my mortal enemy. [21:52:12] <maplebed> mark: can I join your screen session on hume to check out the uploader thingy? [21:52:17] maplebed: sure [21:52:56] <maplebed> mark: would you mark it multiuser for me? (ctrl-a :multli on) [21:53:15] <maplebed> (whoops. multiuser on, not multi on) [21:53:28] done [21:53:29] <maplebed> (then ctrl-a :acladd root) [21:53:44] <maplebed> that part might not be necessary. [21:53:48] done too [21:53:53] <maplebed> thanks! [21:54:18] I wasn't sure if I was just fetching existing files now or uploading new ones [21:54:22] <maplebed> this looks like a read test. [21:54:27] ok [21:54:50] <maplebed> you started on the wikipedia-filelist-urls.txt, right? [21:54:58] I just used the command on the wiki [21:55:05] since I had no idea where you were [21:55:14] so *-urls.txt [21:55:20] <maplebed> the first 2.2m of those (more or less) will be a read test. [21:55:26] ok [21:55:43] <maplebed> I think it'll be more interesting to switch to a write test. shall I? [21:55:48] I just needed something more interesting to look at than swift replicating itself without any other load [21:55:50] yes go ahead [21:55:55] <maplebed> k. [21:56:46] Jeff_Green: so what is not working then? [21:57:10] you can turn on debugging and see exactly what it does... [21:57:26] <Jeff_Green> ah i haven't tried that [21:57:34] !log Ran swift-init all restart on ms3 [21:57:42] <Jeff_Green> I put these under the remote_smtp tranport [21:57:43] <morebots> Logged the message, Master [21:57:47] <Jeff_Green> multi_domain = false [21:57:50] including how it expands the string expansion, if you turn up debugging high enough [21:57:58] <Jeff_Green> connect_timeout = ${lookup {$domain} lsearch{/etc/exim4/deadbeats} {30s}{5m} [21:58:14] yeah [21:58:24] that's missing one } [21:58:29] <Jeff_Green> deadbeats is there, and I ran the test with i.e. {google.com} in $domain's spot and that works [21:58:38] <Jeff_Green> oh that's just a bad irc paste, it's there in the config [21:58:42] ok [21:58:55] so perhaps it's not expanding $domain correctly despite what the docs say [21:58:59] <Jeff_Green> test works, properly with/without domain that matches on in the list [21:59:02] <Jeff_Green> yeah that's what I suspect [21:59:05] running with full debugging should reveal that [21:59:13] <Jeff_Green> k. i'll try that [21:59:35] <Jeff_Green> the on startup: invalid time value for connect_timeout [21:59:36] <maplebed> mark: switched. [21:59:48] Jeff_Green: oooh [21:59:55] perhaps it doesn't accept string expansions for that option at all [21:59:57] now that would suck [22:00:00] <maplebed> damn, that geturls script was destroying hume! I wonder why it didn't do that on fenari. [22:00:11] <Jeff_Green> well that's a possibility I suppose [22:00:38] Jeff_Green: then an easy way to do it is to make two transports [22:00:47] and select in the routers [22:00:52] <Jeff_Green> mark: that's part of why it is not my friend. it's very hard to parse the documentation for exim [22:00:59] <Jeff_Green> ah ok, that makes sense [22:01:03] oh I find that very easy [22:01:05] <Jeff_Green> I think I can actually figure that out [22:01:10] but I have a lot of experience with it, I guess [22:01:15] there's a book about it which takes another route [22:01:33] maplebed: it was destroying hume just now? [22:01:34] <Jeff_Green> well yeah, but here we are at "maybe it doesn't take string expansions just there" [22:01:45] maplebed: I think I made nfs lots faster earlier [22:02:05] Jeff_Green: yeah, unfortunately that's not listed there [22:02:11] although the docs are generally very expansive [22:02:12] which I like [22:02:18] <Jeff_Green> I gotta go feed the children, will try the transport-switching approach tomorrow [22:02:22] most of exim's options support string expansions, just few left that don't [22:02:24] ok [22:03:01] <Jeff_Green> i've probably not found the right docs yet, but what I've found doesn't say much about that config variable--just that you can set it [22:03:24] * Jeff_Green chowtime! [22:04:44] maplebed: gonna do a quick oprofile run on ms1 again [22:06:01] <maplebed> mark: the cpu utilization spike on hume over the last hour corresponds perfectly with the runtime of geturls. [22:06:20] maplebed: and fenari didn't saturize earlier? [22:06:28] then probably because I made /home NFS faster a few hours ago [22:06:37] saturate [22:07:27] <maplebed> the other difference is that yesterday the source file (the list of filenames) was ~600M, now it's 2.4G. [22:07:45] hehe [22:07:50] <maplebed> I loadthe whole thing into memory... :P [22:07:55] oh ouch [22:08:02] why is that needed? [22:08:15] <maplebed> it's not. it just makes it easier. [22:09:12] <maplebed> you can see it loading on hume's memory graph. it's amusing. [22:10:00] hume has more mem than fenari [22:10:47] <maplebed> (there are two parts it makes easier - I get a line count for free and I only have to lock the threads arount incrementing a counter rather than reading a line from the file) [22:10:59] right [22:11:31] I wonder if we'd get higher throughput with multiple copies of the script [22:11:36] I don't trust python's threading [22:11:43] it's known to have the global interpreter lock [22:12:10] <maplebed> there's also ab in ~ben/swift [22:12:28] can't we simply run a second copy on a different portion of the files? [22:12:28] <maplebed> it's better suited for high speed performance testing. [22:12:32] perhaps on a separate box ;) [22:12:35] <maplebed> we could, yes. [22:12:53] makes it harder to get stats of course... [22:12:56] <maplebed> when I did tests earlier I ran 4 copies of ab on 4 boxes and got 4x speed increases. :) [22:13:06] <maplebed> (20 threads each) [22:13:12] ok [22:13:12] <maplebed> (that was on the eqiad cluster) [22:13:38] I think the amount of cpu swift uses on the storage nodes in idle state is worrying [22:13:45] <maplebed> but yes, running multiple copies on separate client boxes is a good idea. [22:14:04] <maplebed> keep in mind, it uses its idle time to do integrity checks. [22:14:12] yeah [22:14:18] I just hope it doesn't go up a lot as the data set increases [22:14:31] we can probably turn off some of the integrity checks temporarily right [22:14:34] to see how much that matters [22:14:40] they're separate processes I think [22:15:00] <maplebed> hmm... dunno! yeah, I think they are sepraate proceses. [22:15:20] the -auditor processes [22:16:54] <maplebed> write throughput is still ~50qps. you kicked ms1 and 2 with the new processorcount, right? [22:17:00] and ms3 too [22:17:02] so it's not helping [22:17:05] <maplebed> bummer. [22:17:31] <maplebed> i haven't done these things yet: http://docs.openstack.org/bexar/openstack-object-storage/admin/content/ch04s06.html#d5e1206 [22:17:34] we can try it on the proxies [22:17:45] but i'm less optimistic that it'll help there [22:18:03] <maplebed> I know the ip_conntrack_max change is necessary on the proxies. [22:18:04] oi! [22:18:05] good point [22:18:09] [435250.560336] nf_conntrack: table full, dropping packet. [22:18:10] [435250.560638] nf_conntrack: table full, dropping packet. [22:18:10] [435250.561473] nf_conntrack: table full, dropping packet. [22:18:11] haha [22:18:16] <maplebed> lol [22:18:24] can we please, please, temporarily disable all of iptables on ms* [22:18:29] and test again afterwards? ;) [22:18:36] <maplebed> sure. [22:18:43] ok, flushing now [22:18:56] not in puppet yet [22:19:01] so puppet can restore any moment [22:19:16] <maplebed> restarting the write test so we get fresh numbers. [22:19:30] !log Flushed all iptables rules down the drain on ms1-3 (live hack, puppet will restore) [22:19:39] <morebots> Logged the message, Master [22:19:44] aiai [22:19:46] check dmesg on ms1 [22:19:48] xfs errors [22:19:54] sorry ms2 [22:19:56] <maplebed> I suspect, though, that that will be relevant only for reads, since they can hit 1100qps whereas reads are capped at 50qps. [22:20:03] <maplebed> grr... [22:20:12] <maplebed> *writes* are 50qps, reads 1100. [22:20:14] might be a broken disk [22:20:21] ok how is it looking now? [22:20:36] <maplebed> 53qps. [22:20:40] bah [22:21:35] right, those iptables errors were half an hour ago [22:22:29] do you have numbers on how long a typical write takes? [22:22:44] if it takes almost a second, then 30 threads is not enough to saturate [22:22:44] <maplebed> no, sadly. only throughput, no latency stats. [22:22:54] did you try with more threads? [22:23:18] <maplebed> yeah, but not rigorously. I just ran it randomly with different numbers. [22:23:24] ok [22:24:35] so sdab1 on ms2 is toast [22:24:40] how can we take it out of the test? [22:25:42] <maplebed> sure! [22:25:45] <maplebed> just unmount it! [22:26:01] and the empty mountpoint won't hurt? [22:26:04] <maplebed> joking aside, yes, we can take it out. [22:26:12] <maplebed> huh. [22:26:16] <maplebed> I don't think so. [22:26:24] <maplebed> but I don't know! [22:26:29] <maplebed> that'll be interesting to see. [22:26:33] indeed ;) [22:27:12] <maplebed> the docs all say "if it'll be down for not too long, just take it down and ignore it. if it'll be down for longer, adjust the rings and remove it then re-add it when it returns." [22:27:18] <maplebed> so we can remove it from the ring. [22:27:28] whatever you prefer for testing now [22:27:55] <maplebed> let's unmount it, watch it for 5m and see if swift drops any files in the unmounted directory, then remove it from the ring. [22:28:01] hrmf [22:28:03] that's a new drive [22:28:05] the one we replaced [22:28:10] oki [22:28:14] if we can unmount [22:28:15] <maplebed> maybe the connector's busted? [22:28:15] may be busy [22:28:20] yeah possible [22:28:45] do you want to or shall i? [22:30:14] <maplebed> please go ahead. [22:30:32] <maplebed> from the swift-ring-builder help doc for the 'remove' command: [22:30:34] <maplebed> Removes the device(s) from the ring. This should normally just be used for [22:30:34] <maplebed> a device that has failed. For a device you wish to decommission, it's best [22:30:34] <maplebed> to set its weight to 0, wait for it to drain all its data, then use this [22:30:37] <maplebed> remove command [22:30:59] unmounted [22:31:13] no new files have appeared yet [22:31:27] !log Unmounted /srv/swift-storage/sdab1 on ms2 (borken filesystem) [22:31:36] <morebots> Logged the message, Master [22:31:52] and this woud be why ;) [22:31:53] root@ms2:/srv/swift-storage# ls -ld sdab1 [22:31:53] drwxr-xr-x 2 root root 4096 2011-12-13 16:37 sdab1 [22:32:18] and that in itself is good I guess... [22:33:07] Dec 21 22:32:48 ms2 object-replicator Error syncing partition: #012Traceback (most recent call last):#012 File "/usr/lib/pymodules/python2.6/swift/obj/replicator.py", line 392, in update#012 reclaim_age=self.reclaim_age)#012 File "/usr/lib/pymodules/python2.6/eventlet/tpool.py", line 75, in tworker#012 rv = meth(*args,**kwargs)#012 File "/usr/lib/pymodules/python2.6/swift/obj/replicator.py", line 207, in tpooled_get_hashes#012 return [22:33:08] Dec 21 22:32:53 ms2 account-replicator Skipping sdab1 as it is not mounted [22:34:27] <maplebed> that answers that. [22:34:36] <maplebed> ok, removing it from the ring now. [22:39:07] <hexmode> Jeff_Green: any final word on what the bz problem was? [22:39:12] <hexmode> just not enough mem? [22:39:50] heh [22:39:55] maplebed: the proxies have far too few workers I think [22:40:00] I increased owa1 from 8 to 64 [22:40:07] and in top it seems that lots are being usd [22:40:09] used [22:40:30] shall I increase on all in puppet? [22:40:35] <maplebed> IIRC recommendation was #workers = #cores for the proxies. [22:40:40] really? [22:41:02] well those boxes have 12 cores at least ;) [22:41:20] <maplebed> oh, sorry. 2x # cores. [22:41:27] yeah that sounds sensible [22:41:28] <maplebed> from here: http://docs.openstack.org/bexar/openstack-object-storage/admin/content/ch04s06.html#d5e1200 [22:41:34] let me enforce that in puppet then [22:41:38] based on $processorcount [22:41:39] ok? [22:42:28] <maplebed> ok [22:42:34] <gerrit-wm> New patchset: Pyoungmeister; "adding udplogging capabilites for varnish mobilez" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1676 [22:42:38] <maplebed> the disk is now removed and teh ring files distributed. [22:43:01] <gerrit-wm> New patchset: Mark Bergsma; "Set proxy worker count to 2x # CPU cores" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1677 [22:43:21] <gerrit-wm> New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1677 [22:43:22] <gerrit-wm> Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1677 [22:46:19] <maplebed> write throughput is now wandering around 58 to 65 [22:46:30] still not great [22:46:39] hrm [22:46:47] seems like owa3 has HT turned on, while ow1/2 have it off [22:46:52] $processorcount is 24 on it [22:47:05] oh well, let's us test that too [22:47:31] <maplebed> :P [22:47:52] <maplebed> I figured it was just different hardware. [22:48:00] !log proxy worker processes increased from 8 to 24 on owa1-2, 48 on owa3 [22:48:05] <maplebed> ms3 is behaving very differently from ms1 and 2... [22:48:05] probably is just a bios setting [22:48:09] <morebots> Logged the message, Master [22:48:11] ms3 is different hardware yes [22:48:18] but I think owa1-3 are the same [22:48:25] ms3 actually has twice as many cores ;) [22:48:50] <maplebed> do you know the Right(tm) way to put the sysctl stuff into puppet? [22:49:06] there is some simple stuff in generic-definitions [22:49:17] just puts simple sysctl files in /etc/sysctl.d/ [22:49:17] <maplebed> the interesting different between ms1/2 and 3 to me is that ms3 has no iowait CPU time. [22:49:23] <maplebed> ah, good. [22:49:32] could be better [22:49:40] I once tried putting all sysctl files in facter [22:49:44] <maplebed> that'll make doing the conntrack and other tcp settings easier. [22:49:54] but then puppet was stupid, and put them all in one giant GET URL param request [22:50:10] maplebed: I think you can just include high-http-performance which already exists [22:50:13] and tune that [22:50:20] since those settings are pretty similar for the squids and varnish servers [22:50:32] probably no need to differentiate for swift there [22:51:00] btw, ms3 has twice the amount of memory vs ms1-2 [22:51:07] that's probaby your i/o wait ;) [22:51:13] <maplebed> ah. [22:51:33] 16 vs 32 GB [22:51:37] <maplebed> in theory, I should probably increase the weight of the drives in ms3 in the ring, but with only three hosts it won't actually change anything. [22:51:39] <maplebed> :( [22:52:01] at least now we can see what the difference in hardware matters [22:52:23] let's get 12-core boxes for storage servers :P [22:52:39] it can't hurt anyway [22:53:00] <Ryan_Lane> anyone looked at this yet? http://referencearchitecture.org/ [22:53:34] Ryan_Lane: that looks like what we arrived at ;) [22:53:41] <Ryan_Lane> heh [22:53:43] dell C2100, heh [22:53:57] <maplebed> the settings in high-performance-http are not the same as what the swift page recommensd. [22:54:26] you can probably put them in [22:54:40] <maplebed> oh, just add the swift stuff to the high-http settionsg? [22:54:47] yeah [22:54:50] tcp time wait stuff, right? [22:55:00] <maplebed> disables TIME_WAIT, [22:55:03] <maplebed> turns off syn cookies, [22:55:09] <maplebed> increases the conntrack table size. [22:55:20] we don't run conntrack on squids and such [22:55:30] (and hopefully never will, that's quite a performance hit) [22:55:58] I did some extensive testing on that 2-3 years ago [22:56:06] it may be a bit better now [22:56:15] (or worse ;) [22:56:46] what's the qps now? [22:57:34] I guess it doesn't help very much indeed, throughput on the proxies doesn't seem higher [22:58:03] <maplebed> you didn't ever publish the performance testingc data, did you? [22:58:25] <maplebed> current throughput is about the same - 18s for 1000 urls. [22:58:48] <maplebed> (aka 55qps) [22:59:06] "publish", no, I have some notes somewhere on a wiki [22:59:28] <nagios-wm> PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [22:59:34] <maplebed> ah well. [22:59:38] biggest result was actually that running ntpd on LVS servers halved their kpps throughput [22:59:52] <maplebed> fascinating. why? [23:00:11] didn't find out why, something with the time adjust syscalls I think [23:00:23] <maplebed> wild. [23:00:26] just stop ntpd and pps doubled [23:00:52] and back then our LVS servers really couldn't take that hit [23:01:09] it was the difference between dropping 5% of packets or not [23:03:24] <Ryan_Lane> mark: this may or may not interest you, since you like networking: http://openvswitch.org/openstack/ [23:03:25] <Ryan_Lane> :D [23:03:44] heh [23:03:49] <Ryan_Lane> new openstack project, quantum. likely to be usable in essex timeframe [23:03:53] <Ryan_Lane> replaces nova-network [23:04:01] <Ryan_Lane> uses openvswitch [23:04:11] nice [23:04:20] <Ryan_Lane> can be configured via api [23:06:55] maplebed: can we see from the Date or LM header that swift returns whether the object was read or put? [23:07:10] (or, can we put a temporary debug header in which tells us that?) [23:07:54] <gerrit-wm> New patchset: Bhartshorne; "apply swift TCP tuning settings to all high-http-performance hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1678 [23:08:20] <maplebed> not sure what you mean. [23:08:39] hmm wait [23:08:45] those tw_recycle settings [23:08:46] <maplebed> whether the 404 handler triggered you mean? [23:08:51] I now recall those violate some spec [23:08:53] (yeah) [23:09:03] I recall some clients having issues when we enabled those [23:09:07] so I had to disable that again on squids [23:09:21] so perhaps let's not reenable that on public hosts, sorry [23:09:34] there's something in the linux kernel docs about it [23:09:43] <gerrit-wm> New patchset: Pyoungmeister; "adding udplogging capabilites for varnish mobilez" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1676 [23:10:09] <maplebed> ok, I'll make a separate swift sysctl conf. [23:10:11] looking... [23:11:25] some firewalls or proxy servers broke and then couldn't access wikipedia anymore [23:14:23] <gerrit-wm> Change abandoned: Bhartshorne; "need to do this separately for swift to avoid getting the time_wait stuff on the squids." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1678 [23:15:00] you probably also want the squid sysctl settings on swift, though [23:15:16] they're useful for many tcp connections/requests [23:15:22] <maplebed> ok, I'll pull them in. [23:16:07] <gerrit-wm> New patchset: Pyoungmeister; "adding udplogging capabilites for varnish mobilez" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1676 [23:17:28] maplebed: so the swift docs also recommend adjusting the xfs parameters during mkfs [23:17:36] which we didn't do [23:17:42] we have the mount options, not the mkfs options [23:17:45] bigger inode size [23:18:08] <maplebed> yeah, I noticed that. (I haven't made it all the way through this performance doc yet) [23:18:21] <maplebed> we have a bunch of pretty small files though, [23:18:26] root@ms1:~# xfs_info /srv/swift-storage/sdd1/ [23:18:26] meta-data=/dev/sdd1 isize=256 agcount=4, agsize=15262336 blks [23:18:26] = sectsz=512 attr=2 [23:18:26] data = bsize=4096 blocks=61049344, imaxpct=25 [23:18:26] = sunit=0 swidth=0 blks [23:18:27] naming =version 2 bsize=4096 ascii-ci=0 [23:18:27] log =internal bsize=4096 blocks=29809, version=2 [23:18:28] = sectsz=512 sunit=0 blks, lazy-count=1 [23:18:28] realtime =none extsz=4096 blocks=0, rtextents=0 [23:19:29] <gerrit-wm> New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1676 [23:19:29] <gerrit-wm> Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1676 [23:22:16] rsyslogd is using quite a bit of cpu... [23:22:21] let me temporarily disable access logging [23:27:35] <gerrit-wm> New patchset: Asher; "fix template name for varnishncsa.init" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1679 [23:27:48] <gerrit-wm> New patchset: Hashar; "bug 32645, add testswarm to integration homepage" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1680 [23:28:46] <gerrit-wm> New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1679 [23:28:47] <gerrit-wm> Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1679 [23:30:26] lots of these in syslog: [23:30:26] Dec 21 09:52:36 ms1 object-server ERROR container update failed with 10.0.0.249:6001/sdr1 (saving for async update later): Timeout (3s) [23:30:27] Dec 21 09:52:36 ms1 object-server ERROR container update failed with 10.0.0.249:6001/sdr1 (saving for async update later): Timeout (3s) [23:30:27] Dec 21 09:52:37 ms1 object-server ERROR container update failed with 10.0.0.249:6001/sdr1 (saving for async update later): Timeout (3s) [23:30:33] all sdr1 [23:30:52] <maplebed> another bad disk? [23:31:06] maybe, maybe not... [23:31:16] ms2 is complaining about others too [23:33:23] I don't see the kernel complaining about those disks anyway [23:35:02] <gerrit-wm> New patchset: Bhartshorne; "adding recommended tcp settings to swift hosts." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1681 [23:35:15] <maplebed> mark: wanna review? [23:36:02] go ahead [23:36:19] ok, it's half past midnight here [23:36:24] <gerrit-wm> New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1681 [23:36:24] <gerrit-wm> Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1681 [23:36:31] i'm gonna call it a day soon [23:37:09] <maplebed> cool to run puppet on all the swift hosts? [23:37:14] go ahead [23:37:20] it will just reenable access logging on ms1 [23:37:40] <maplebed> throughput hasn't changed. [23:38:07] no I don't think it matters enough [23:38:45] a lot of time is spent in sqlite libs actually [23:38:57] how many objects do we have per container now? [23:48:15] ok, good luck [23:48:20] more tomorrow [23:49:05] <maplebed> 2.5m objects in the wikipedia-commons-thumb container. [23:49:31] <maplebed> if we leave the job running on hume overnight, we'll probably have 5-8m tomorrow. [23:50:07] I think that's interesting [23:50:14] if it slows down to a crawl, that's good to know [23:50:17] we can always wipe and start over [23:50:20] <maplebed> yup. [23:50:29] <maplebed> I set up the eqiad cluster to do the hashed containers, b [23:50:38] <maplebed> but I don't think it's big enough to do a high volume test like this. [23:50:43] yeah [23:51:01] <maplebed> I think I'll try and do the hash-for-commons-only thing. [23:51:05] we have 2 ms servers free there [23:51:10] do we have an es server too? [23:51:12] that's 3 ;) [23:51:36] <maplebed> I think the ES server is still down - RobH is the perc RAID card back in it? [23:51:47] <maplebed> we could set it back up as a raid10 device and just weight it incredibly heavily. [23:52:13] well the ms servers are the same [23:52:17] so also raid 10 then :/ [23:52:36] <maplebed> hrmph. [23:53:09] i'm not impressed with these thumpers so far [23:53:21] <maplebed> I'll just stick with code for the time being. the eqiad cluster is much more of a functional testing ground with its current setup; I'm not too inclined to change it. [23:53:30] ok [23:54:00] <maplebed> after we get through the obvious stuff, [23:54:18] <maplebed> we can invite notmyname to look at our ganglia stats and configs and see if he'd be willing to give us some advice. [23:54:32] <maplebed> (he's posted bunches of stuff about swift tuning aroun the web and hangs out in #openstack) [23:54:33] alright [23:55:02] <maplebed> till tomorrow... [23:56:31] * mark yawns [23:56:32] good night

[00:00:58] !log readded /dev/sda2 partition on streber, it was somehow deleted, borking the raidset [00:01:07] Logged the message, Master [00:01:13] which is stupid, because it's a raid1. wtf is the point of a raid1 that doesn't allow access when a disk is down? [00:01:55] where the hell is the bot? [00:01:56] morebots: poke poke [00:04:12] did someone upgrade streber recently? [00:04:17] it's pretty royally fucked [01:47:27] Error connecting to 10.0.6.47: Lost connection to MySQL server at 'reading authorization packet', system error: 0 [01:47:49] that's an odd one. [01:48:02] db37, s7 slave [01:48:10] ganglia only shows a drop in api and application traffic but not squid. This makes me think it's not something out there on the internet (eg routing issue) but something internal to us. [01:48:16] !replag [01:48:19] does that work? [01:48:26] isn't there a bot that gives us replag? [01:48:30] anon is quick, logged-in is slow [01:48:33] It did [01:49:47] Yeah... Squid load is constant [01:49:52] look on the daily graph [01:50:03] TimStarling: which daily graph? [01:50:15] of network [01:50:34] I'm sorry, we have so many graphs. could you link the one you mean? [01:50:40] http://ganglia.wikimedia.org/graph.php?g=network_report&z=medium&c=Application%20servers%20pmtpa&m=network_report&r=day&s=descending&hc=3&mc=3&st=1324432189 [01:51:01] oh, you mean that it's not a drop but the end of a spike. [01:51:02] interesting. [01:51:05] yeah [01:51:27] I can buy that. [01:51:55] http://ganglia.wikimedia.org/?m=bytes_out&r=day&s=descending&c=Application+servers+pmtpa&h=&sh=1 [01:52:28] it was just on srv229 apparently [01:53:16] TimStarling: you should start playing with ganglia3-tip.wikimedia.org. way better features. [01:53:43] ganglia.wikimedia.org is bookmarked and in my history a million times [01:53:55] you should move ganglia3-tip to ganglia if you think it's better [01:54:18] TimStarling: leslie's currently packaging the 2.2.0 release; we'll move to that as soon as it's puppetized. [01:54:42] but back to site suckage... [01:54:47] I don't see a candidate yet, still poking. [01:56:02] is there actually any problem on the site? [01:56:20] to me it just looks like there was a single fast download [01:56:31] just "[5:45 PM] People reporting editing is slow" [01:56:51] let's profile then [01:57:06] apparently the conversation is happening in -tech, not here. [01:57:08] ::sigh:: [02:55:16] PROBLEM - mobile traffic loggers on cp1043 is CRITICAL: PROCS CRITICAL: 1 process with args varnishncsa [02:55:38] RECOVERY - Puppet freshness on ms1002 is OK: puppet ran at Wed Dec 21 02:55:20 UTC 2011 [04:41:43] New patchset: tstarling; "Attempting to fix l10nupdate on the image scalers. Everything in the mediawiki-installation dsh node group should be able to get LU updates. Hume is also broken and should probably be in applicationserver::home-no-service, but I'll leave that for another " [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1653 [04:41:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1653 [04:42:44] New review: tstarling; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1653 [04:42:44] Change merged: tstarling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1653 [04:48:16] New review: tstarling; "Tested." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1653 [05:47:53] PROBLEM - Puppet freshness on es1002 is CRITICAL: Puppet has not run in the last 10 hours [08:18:31] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [08:19:31] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [08:21:51] PROBLEM - Disk space on hume is CRITICAL: DISK CRITICAL - free space: / 341 MB (5% inode=79%): /a/static/uncompressed 23167 MB (2% inode=99%): [08:29:02] Good evening TimStarling [08:29:16] What was going on with LocalisationUpdate earlier today? [09:06:35] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1606 [09:13:14] PROBLEM - Disk space on db9 is CRITICAL: DISK CRITICAL - free space: /a 10754 MB (3% inode=99%): [10:17:50] anyone keeping track: [10:17:56] ds1 kernel panics, log ful of em [10:18:08] so *no one close the ticket* this means you robh [10:22:24] RECOVERY - MySQL slave status on es1004 is OK: OK: [10:24:14] 1200 GiB transferred [10:54:44] RoanKattouw: some servers didn't have the l10nupdate user [10:55:32] https://gerrit.wikimedia.org/r/#change,1653 [10:56:46] also LocalisationCache::recache() was showing up in profiling for a while, probably triggered by LU [10:57:27] I'd set up manualRecache if our sync scripts didn't suck so much [11:13:36] Oh, I've long known not all servers have that user, but I figured it wasn't harmful [11:14:12] Were there any "real" issues? [11:14:33] (Like increased CPU usage, downtime, etc) [11:27:08] can someone please have a look at gallium puppet run? [11:27:31] change https://gerrit.wikimedia.org/r/#change,1644 was merged yesterday but is not yet applied [11:28:10] it is supposed to copy a bunch of html / css files in /srv/org/mediawiki/integration/WikipediaMobile/nightly/ [11:28:41] that then make http://integration.mediawiki.org/WikipediaMobile/nightly/ available (nightly builds for the wikipedia mobile application) [11:39:13] hashar: ok, checking [11:39:41] good morning David [11:39:58] there must be an error somewhere in the puppet file [11:40:15] morning [11:40:19] Duplicate definition: File[/srv/org/mediawiki/integration/WikipediaMobile/nightly] is already defined [11:40:29] \o/ [11:40:31] in file /var/lib/git/operations/puppet/manifests/misc/contint.pp at line 160; cannot redefine at /var/lib/git/operations/puppet/manifests/misc/contint.pp:160 [11:41:11] fixing it. Thanks [11:41:17] k:) [11:44:51] New patchset: Hashar; "nightly mobile build dir was duplicated" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1654 [11:45:01] mutante: ^^^ that one should fix it [11:45:01] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/1654 [11:45:08] well it does not lint of course [11:46:01] New patchset: Hashar; "nightly mobile build dir was duplicated" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1654 [11:46:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1654 [11:46:34] that one is better [11:47:23] New patchset: Dzahn; "make the process check on mobile traffic loggers a bit more relaxed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1655 [11:47:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1655 [11:48:02] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1655 [11:48:02] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1655 [11:49:33] New review: Dzahn; "looks good. should fix gallium. checking" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1654 [11:49:33] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1654 [11:50:31] ask for a few secs [11:50:32] afk [11:50:34] err [11:50:37] well will be back soon [11:51:07] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Wed Dec 21 11:50:55 UTC 2011 [11:51:13] hashar: done:) [11:51:45] great! [11:51:48] it created several files in /integration/WikipediaMobile/nightly .. [11:52:01] I can access them at http://integration.mediawiki.org/WikipediaMobile/nightly/ [11:52:07] but / give a 403 error [11:52:18] I forgot an apache directive osomewhere [11:52:28] or add an index.html? [11:52:34] na [11:52:40] the idea is to list the files [11:52:57] and add HTML header & footer to format the default Apache directory listing [11:55:56] I don't get it [11:56:00] the apache conf says: [11:56:01] [11:56:02] Options +Indexes [11:56:15] +Indexes should allow directory browsing as I understand it [11:56:27] Did you graceful Apache after changing the config? [11:56:33] lol [11:56:36] that must be it [11:56:41] I have forgot add a subscribe [11:56:48] let me do it for you, i am still on gallium [11:56:52] \O/ [11:57:11] gracefulled [11:57:21] and there they are:) [11:57:30] yeah https://integration.mediawiki.org/WikipediaMobile/nightly/ [11:57:44] please applaud my wonderful design :D [11:58:00] thanks again mutante ! [11:58:05] nice! [11:58:12] * mutante claps hands [11:59:17] heh, i guess i will have to install one of those .apk files now:) [12:14:19] heading lunch :D [12:14:45] will bug you this afternoon to get the testswarm class enabled on gallium :) [12:14:45] https://gerrit.wikimedia.org/r/#change,1646 [12:14:55] but that will be for after the lunch [12:18:56] does anyone happen to know how to use the pdus to powercycle a server? [12:19:08] first time I've needed to and there's no rob here [12:19:48] I don't have the faintest idea what make they are or anything, so while I could try to ssh to the right ip, after that I would be stuck (doesn't seem right to monkey around on one of those guessing) [12:24:29] it doesnt have drac nor lom? [12:25:06] no, it's ds1, no ilom, [12:25:40] you can get onto the serial console via pmshell but here it does us no good, I'm on the host but it won't shut down cause of the kernel issue [12:25:51] (SM hardware) [12:26:44] maybe you can still send a magic sysrequest sequence? [12:27:12] http://en.wikipedia.org/wiki/Magic_SysRq_key#Alternate_ways_to_invoke_Magic_SysRq [12:27:15] I need to hard pwoer it down [12:27:27] I can type commands on the box, that's not the isssue [12:28:09] it just won't complete the shutdown sequence, see [12:29:03] hmm,ok, it doesn't shutdown "-h" [12:29:17] if I shutdown -h then I can't bring it back up [12:29:22] no ilom! [12:32:20] i can tell you more about the PDUs, from the snmp traps, OIDs we use [12:32:28] $servertech_tree = ".1.3.6.1.4.1.1718" [12:32:28] great, I'll take it [12:32:35] I was looking at the snmp stuff [12:32:45] so they are "Servertech" [12:32:49] and I saw those but didn't know how to get anything more oout of them [12:32:49] servertech.com [12:32:56] going to look em up now [12:34:02] http://wikitech.wikimedia.org/view/ServerTech_CDU [12:34:28] that says almost nothing about them :-D [12:40:56] ok, found a manual for their firmware generally, going to have a look at that [12:41:17] also I am *starving* [12:41:43] was stuck somewhere around http://www.servertech.com/products/remotepwrmgmtconsoleportaccess/sentry-commander-pt40 [12:45:36] ah [12:45:45] PROBLEM - Puppet freshness on amssq53 is CRITICAL: Puppet has not run in the last 10 hours [12:46:09] http://www.servertech.com/products/smart-pdus/smart-pdu-cs-48vd I look ed at this one randomly and found the firmware pdfs [12:48:27] apergos: yes, yours is probably better, they look more like something from that category (CDUs), not like "sentry-commander" [12:48:46] I guess I'll poke at this after I eat [12:55:26] Happy holidays guys. Try not to work too hard ^^. [13:03:42] apergos powercycling dataset1? [13:03:53] yes [13:03:56] don't do it please [13:04:01] did you figure it out? [13:04:11] you login to powerstrip via browser [13:04:13] but if you can tell me (since I just got done eating lunch and am now looking for the username :-P) [13:04:18] via a browser? [13:04:25] I was gonna ssh in [13:04:27] no, huh? [13:04:41] I am not aware of a way to do via command line [13:04:54] so you have to have the proxy setup for lan browsing the internal vlan [13:04:59] so do I assume rigiht, first off, that it would be ps1-a1-sdtpa? [13:05:08] so i do a ssh to fenari with -D 8080 [13:05:17] well, dataset1 is in b1 i think [13:05:28] so ps1-b1-sdtpa.mgmt.pmtpa.wmnet [13:05:45] username? [13:06:09] root with mgmt info [13:06:22] I'm gonna try ssh :-P [13:06:32] just to see, might learn something! (if it's been set up [13:06:32] ) [13:07:37] nice! I'm in :-D [13:07:48] Sentry Switched CDU Version 6.0h (090310) [13:08:26] heh and now we try a fee "show me some info" commands.... [13:08:54] there is a group setup for dataset1 [13:08:54] t [13:08:54] hat [13:08:55] [13:09:00] that will cycle the array and chassis [13:11:06] hrmm, my foxyproxy stuff isnt working ... [13:13:11] wtf. [13:14:18] so when you say there's a group set up, what do you mean? (I'm gonna try to translate what you say about the web interface tothe right command line stuff) [13:16:07] hmm I am looking at show traps (very interesting), there are a couple lines: [13:16:13] .AC6 dataset1_a:xz:6 ON OFF ON 0 A 12 A [13:16:14] and [13:16:20] .BC6 dataset1_b:xz:6 ON OFF ON 0 A 12 A [13:16:24] ah and one for the array also [13:16:29] .BC7 dataset1-array1_b:xz:7 ON OFF ON 0 A 12 A [13:16:42] two for the array, sorry [13:16:44] .AC7 dataset1-array1_z:xz:7 ON OFF ON 0 A 12 A [13:20:46] yea, and the array and chassis are in the dataset1 group in the software [13:20:54] so it can do a power cycle on all 4 ports at the same time [13:21:04] but sudddenly my tunnel via -D 8080 isnt working [13:21:11] i cannot bring up anything on lan =P [13:21:16] the dataset1 group... not sure what you mean by that [13:21:21] * apergos continues to poke around [13:21:33] i have no idea how to do in command line, not sure if you can [13:21:38] so no idea how to help you there [13:22:08] wtf =P [13:23:22] apergos: so its my config locally, but i can get in now kinda [13:23:27] huh, I am on 6.0 of the firmware, better get that manual :-D [13:23:35] so in the web interfect its on outlet control group [13:23:54] so if you want i can do it here, but not touching unless you say. [13:25:28] re [13:25:34] ooohh I see a group [13:25:37] dataset1-all [13:25:46] sounds right [13:25:53] plus some groups for storage1 and 2 :-P [13:26:13] (I am totally going to document this plus a link to the manual when I get done :-P) [13:27:12] cool [13:28:23] +1 [13:29:11] so, did you try installing the wikipedia app on your android yet;) [13:29:31] the "nightly builds" link works since a couple hours [13:32:15] http://integration.mediawiki.org/WikipediaMobile/nightly/ [13:32:51] RECOVERY - mobile traffic loggers on cp1043 is OK: PROCS OK: 3 processes with args varnishncsa [13:32:51] RECOVERY - mobile traffic loggers on cp1044 is OK: PROCS OK: 1 process with args varnishncsa [13:33:10] going to try it on a tablet [13:38:01] PROBLEM - NFS on dataset1 is CRITICAL: Connection refused [13:40:52] New patchset: Hashar; "enable testswarm on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1646 [13:41:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1646 [13:41:54] mutante: can you ping me when you have time to unleash testswarm on gallium? [13:44:00] "latest" installs and starts on HTC Desire Z aka. T-mobile G2 - check. - installs and starts on Sony Tablet S - check [13:45:08] hashar: yea, will do, just a little while to finish initial commit for my labs instance [13:45:24] take your time [13:45:30] I am doing MediaWiki code review meanwhile [13:53:21] PROBLEM - SSH on dataset1 is CRITICAL: Connection refused [14:04:23] you saw the backread RobH that dataset1 kernel panic etc etc same old same old? [14:04:42] unfortunately [14:04:51] only we have the first instance of the panic, it was kswapd, then the resst are all scp broking [14:04:53] *borking [14:05:08] so this could be [14:05:29] still hardware, some obscure kernel bug, an issue with the kernel and the particular boards, [14:05:34] whoe th fsck knows [14:05:51] that was "who the" [14:09:39] the shadow knows [14:09:54] none of these kids will get that. [14:29:06] http://wikitech.wikimedia.org/view/PDU [14:29:08] no they won't. [14:29:19] so please feel free to add anything you like, oh just a sec actually [14:30:07] chris will be updating the firmware on all of them [14:30:22] cuz the newest row in sdtpa has new firmware, and its interface is a lot nicer for balacing the power [14:30:40] and he wanted something more sysadmin to do [14:30:53] and this was something that would be nice, but isnt time sensitive, so he can take his time and get it right [14:31:15] but i doubt the command line options will have changed, as its going grom 6.0 to 6.2 [14:31:37] well the 6.1 manual is there, I'm pretty sure these command will be good through it [14:31:53] so now you can add whatever you like, I just finished the one thing I wantd to put in [14:32:08] for example how do I know which pdu really is the right one for some host? [14:32:15] which I sort ofhandwaved over [14:32:47] for second example ds1 will automagically come back up when we do that, if it was powered on when we turned off the outlet [14:32:51] what about other hosts? [14:32:54] well, there are only 3 racks with the switching [14:32:59] the other hosts cannot be powercycled [14:33:05] ok [14:33:17] wanna scribble something about that stuff onthe page? [14:33:21] so only the network racks (a1-sdtpa, a1/8-eqiad) [14:33:25] and b1 have them [14:33:33] b1 was for old servers like storage1/2 [14:33:36] ds1 [14:33:40] sure [14:33:44] thanks. [14:34:11] ahh shit [14:34:19] so i used keepassx to redo my passwords [14:34:23] and now im locked out of wikitech [14:34:37] cuz i tried to set it to an invlaid one, it trimmed off the excess, and i have no idea what characters were tossed [14:34:44] apergos: you an admin on wikitech? [14:34:46] the "excess" eh? [14:34:50] pretty sure its email is borked [14:34:57] so someone has to reset it for me. [14:35:00] I think I am an admin over there [14:35:42] but I don't know how I reset your password [14:35:58] hrmm [14:36:04] I'm not even sure that's a feature we have in mw [14:36:09] it may have to be done via php console, in which case i can do [14:36:18] yea, i was being stupid thinking mediawiki could do something like that ;P [14:36:42] try the email though first, you never know [14:36:43] of course mediawiki expects the server to actually work [14:36:51] with email and basic services ;] [14:37:24] heh, im ssh'd into ps1-b1 now [14:37:37] apergos: i normally do this to do the initial setup, then never again [14:37:50] which should *also* be documented someplace *cough* [14:37:52] since most of our power strips are stupid, there is no reason to login to them, so never bothered to mess with this [14:37:56] feel free to add it.... [14:38:01] i cannot login! [14:38:03] heh [14:38:05] hahahaha [14:38:12] it's funny cause it's true [14:38:23] maybe we can find some other thing for you to document while we're at it [14:39:44] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [14:41:38] apergos: so yea, i just used php console to do that [14:41:42] ok [14:41:43] you know how to do it, wanna learn? [14:42:17] its interesting, cuz this is the only way we used to have to scramble user accounts, so back in the day I would do a bunch of these for projects that were internal [14:42:17] changePassword.php out of maintenance directory? [14:42:38] using eval.php with dbdefined [14:42:46] so php eval.php wikidbname [14:43:05] then you have the php console up and can just directly update entries [14:43:15] ok [14:43:27] its a root level access thing but its decent, if you want i can email you my notes [14:43:32] so changePassowrd just takes a username and a password [14:43:52] in the past i have used this to null passwords and email fields [14:43:57] thus preventing accounts from logging in [14:44:21] I think we lock accounts now for that [14:44:31] yep [14:44:42] when i did this stuff mediawiki didnt allow users to do that stuff [14:44:51] but it was handy for this, heh [14:44:51] feel free to send em, I just think I'll probably use these other tools since we have em [14:45:28] which reminds me that we'll prolly have eval.php a bit longer since we aren't on the php-hiphop train right now [14:47:05] ah so... please add stuff about the racks that have these smart pdus, how to know which one is the one you want, and how to set 'em up... now that your account is working :-P [14:48:33] updated page with notes on limited deployment of the switched cdus [14:49:01] yay! [14:49:08] how to setup from scratch is already on both my and chris's todo on whoever sets up a new one next [14:49:20] since i walked him through how to do just that the other day [14:49:40] ok awesome [14:49:42] New patchset: Dzahn; "small fixes to make the "nightly builds"-page validate as XHTML 1.0 Strict" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1657 [14:49:44] if he doesnt write the page before me, we should be building out row c next month [14:49:50] since we are paying for it already [14:49:53] just gotta point him to the wikitech page so he can add em [14:49:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1657 [14:50:05] oops, yeah well let's use it then for sure! [14:50:14] so I look ed at ds1's wikitech page today [14:50:16] it was depressing [14:50:19] huh, my wikitech password reset finally showed up [14:50:22] guess email is owrking on it [14:50:25] heh [14:50:29] =P [14:50:33] I should add all the rt stuff there but it's waaaay too depressing [14:50:41] (nice. snail mail eh?) [14:50:51] New review: Dzahn; "this makes it look nice on validator.w3.org" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/1657 [14:51:10] i hate that we dont have a better solitoin for our inventory =P [14:51:31] i just want a mediawiki extension that does all of what racktables does, and can flag specific fields private for only speciifc groups to see [14:51:39] and that will visually lay out racks [14:51:57] and that ties into RT custom field to pull all info tied to an asset tag, which we would populate in RT tickets [14:52:12] so pulling up the server on the inventory mgmt can at minimum list all tickets involving the server. [14:52:20] thats all...... [14:52:24] yeah I would love it [14:52:54] !change 1657 | hashar [14:52:54] hashar: https://gerrit.wikimedia.org/r/1657 [14:53:14] but by rt I meant rt, not racktables [14:53:28] the last entry on the page is from late 2010. [14:53:38] no, jan 2011. [14:53:39] hashar: re.. we can look at testswarm too now [14:53:40] anyways... [14:54:00] mutante: great! maybe in private to let apergos/RobH works ? [14:54:03] work [14:54:08] sure, lets do that [14:54:30] ? [14:54:48] we dont have an outage, you guys dont have to not work in here on our account [14:54:51] ;] [14:55:44] if we're cluttering up your work, let us know [14:56:51] so here we ware with dataset1. I could... try upgrading to amore recent kernel and see if tht does anything. we could... [14:56:58] call sm back. we could... [14:56:59] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/1657 [14:57:07] get a frkickin sledgehammer [14:57:21] any thoughts? [14:57:30] s/ware/aree/ [14:57:32] -e [14:57:32] hehe,ok [14:57:55] that was about "dont have to not work in here" [15:00:18] mutante: let s make it public :D [15:00:27] alright [15:00:34] apt-get remove testswarm [15:00:44] Deconfigure database for testswarm with dbconfig-common? Y|N [15:00:49] N [15:00:52] we can keep it [15:00:55] it should be fine [15:01:12] the only issue was with the testswarm system user not being created by the package [15:01:18] ok, done,merging the change to include it then [15:01:24] yep [15:01:26] merge + puppet run [15:02:02] most usefull linux command: watch [15:02:06] ex: [15:02:11] New review: Dzahn; "re-enabling after manual package removal and fix for user account being created" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1646 [15:02:11] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1646 [15:02:13] watch grep /etc/passwd [15:02:46] RobH: ? [15:03:13] everything should work on this one though [15:03:26] i would email back sm saying there are still issues and include all the info [15:03:49] hashar: merged on sockpuppet, running puppet on gallium..applying config ..NOW [15:03:54] if they want us to run some diagnostics they can provide thats fine [15:04:08] apergos: though in all honesty i wish we could just give up on this [15:04:14] * hashar grabs a coffee while puppet install stuff :D [15:04:15] ok. It's just that it's possible it's a kernel issue [15:04:16] its just been a stupid amount of time [15:04:23] notice: /Stage[main]/Misc::Contint::Test::Testswarm/Package[testswarm]/ensure: create [15:04:26] apergos: Ok, well, lets try a different kernel then [15:04:43] and hopefully it also doesnt work, so we can throw this thing away and just use the hard disks elsewhere =P [15:04:45] hashar: done, no errors [15:04:51]