[00:18:29] !log updating OpenStackManager to r114754 on virt0 [00:18:31] Logged the message, Master [00:45:13] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [00:50:01] New patchset: Bhartshorne; "teaching swiftcleaner to save the files it deletes for later inspection" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4382 [00:50:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4382 [00:50:49] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4382 [00:50:52] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4382 [00:55:35] New patchset: Bhartshorne; "enabling saving deleted files in the swift cleaner" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4383 [00:55:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4383 [00:55:52] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4383 [00:55:55] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4383 [01:07:39] New patchset: Bhartshorne; "grumble. overlapped option letters." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4384 [01:07:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4384 [01:07:55] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4384 [01:07:58] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4384 [01:21:43] !log updating OpenStackManager to r114757 on virt0 [01:21:47] Logged the message, Master [05:26:04] PROBLEM - Squid on brewster is CRITICAL: Connection refused [05:40:46] RECOVERY - Squid on brewster is OK: TCP OK - 0.001 second response time on port 8080 [05:47:04] PROBLEM - Squid on brewster is CRITICAL: Connection refused [06:32:22] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 202 seconds [06:33:43] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 247 seconds [06:33:52] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 210 seconds [06:34:01] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 214 seconds [06:51:21] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [06:51:21] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [07:06:21] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [07:06:21] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [07:12:57] RECOVERY - Squid on brewster is OK: TCP OK - 0.004 second response time on port 8080 [08:24:21] PROBLEM - Puppet freshness on search1022 is CRITICAL: Puppet has not run in the last 10 hours [08:24:48] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:28:51] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [08:39:21] PROBLEM - Puppet freshness on search1021 is CRITICAL: Puppet has not run in the last 10 hours [08:45:48] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:47:45] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [09:11:09] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:06] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [09:19:33] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 196 seconds [09:19:51] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 203 seconds [09:51:21] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [09:55:24] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 0 seconds [09:55:33] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [10:46:24] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [11:06:52] apergos: wanna review my "change master" mysql query, re: reslaving es-* ? [11:07:14] oookkkkk [11:07:37] (that's me answerign with some skepticism about review abilities, but yeah I'll look at it) [11:08:34] so i followed the steps on External_storage#Making_a_new_slave_using_snapshots on es1004 [11:08:42] rsyncing from es1001 [11:08:45] and that finished now [11:08:51] right [11:09:12] so the next step is cat /a/slave_status_YYYY-MM-DD.txt [11:09:20] uh huh [11:09:34] es1004: /a/slave_status_2011-04-04.txt [11:09:48] this file has been created by leslie on es1002 though [11:10:01] and we have been rsyncing from the same snapshot [11:10:12] yeah I thought that was the case [11:10:22] since ther were two rsyncs going and only the one snap mounted [11:10:33] mutante: jenkins still alive :-D [11:10:34] now I of course have a separate snap mounted :-P anyways [11:10:41] so i need to relace the values for master_host, master_user, password , log_file, log_pos [11:10:51] uh huh [11:11:23] the one i would execute is : cat /root/query.sql [11:12:02] the password 'xxxx' i replaced with the one i found as "repl-password" on fenari [11:12:39] note though, the master_host IP resolves to es3 , not es1001 [11:12:48] is that really right? [11:14:11] so lemme do what I do (which is make sure I undrstand how these commands are supposed to work) [11:14:18] then I'll chime in in a bit [11:14:18] es1001 has that "intermediate master" status [11:14:29] uh huh [11:14:39] ok [11:14:43] thanks [11:14:46] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 8 seconds [11:15:21] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [11:20:54] RECOVERY - MySQL replication status on es1004 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : s [11:39:57] !log restarting lsearchd on search3... again... [11:40:00] Logged the message, notpeter [11:51:17] !lop es1004 - rsync was finished, deleted all binlogs from old host, mysqld_safe& , but did not "change master.." and "start slave" (see mail) [11:51:23] !log es1004 - rsync was finished, deleted all binlogs from old host, mysqld_safe& , but did not "change master.." and "start slave" (see mail) [11:51:25] Logged the message, Master [12:01:19] New review: Dzahn; "looks good." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/4380 [12:10:38] New review: Dzahn; "should fix RT #2777 and BZ #35709. "SSL certificate problem, verify that the CA cert is OK." on git ..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/4334 [12:11:44] we should make the bot take popular tyops [12:11:47] typos! [12:12:12] hehe, that always happens [12:14:58] imho we should set SSLCACertificateFile as well on all SSL hosts, besides SSLCertificateFile SSLCertificateKeyFile [12:35:27] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [12:37:18] checking spence [12:58:45] New patchset: Hashar; "testwarm: set innodb buffer pool size to 256M" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4395 [12:59:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4395 [12:59:19] that one is for you mutante :-) [13:00:48] RECOVERY - Puppet freshness on spence is OK: puppet ran at Fri Apr 6 13:00:34 UTC 2012 [13:02:59] New review: Hashar; "Some context:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4395 [13:14:50] New patchset: Mark Bergsma; "Set SO_REUSEADDR in varnishhtcpd, to fix conflict with ganglia HTCP monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4396 [13:15:05] New patchset: Mark Bergsma; "Use Upstart for varnishhtcpd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4397 [13:15:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4396 [13:15:19] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/4397 [13:16:26] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4396 [13:16:28] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4396 [13:17:35] New patchset: Mark Bergsma; "Use Upstart for varnishhtcpd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4397 [13:17:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4397 [13:18:00] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4397 [13:18:03] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4397 [13:20:14] New patchset: Mark Bergsma; "upstart_job will overwrite the init.d file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4398 [13:20:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4398 [13:20:38] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4398 [13:20:40] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4398 [13:22:53] New patchset: Mark Bergsma; "install => true doesn't work" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4399 [13:23:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4399 [13:23:17] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4399 [13:23:20] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4399 [13:26:46] New patchset: Hashar; "testswarm: log slow queries (bug 35028)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4400 [13:27:02] New patchset: Hashar; "testwarm: set innodb buffer pool size to 256M" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4395 [13:27:15] OH NO [13:27:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4400 [13:27:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4395 [13:27:47] stupid git-review rebased my change :-/ [13:28:17] New review: Hashar; "second patchset is just a rebase done automatically by git-review :-/" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4395 [13:30:03] <^demon> hashar: `git config --global alias.r 'review -R'` [13:30:12] <^demon> I keep forgetting the -R, so I aliased it :) [13:30:30] I use git-review [13:30:36] from their master [13:31:34] <^demon> ah [13:31:37] which means I can use defaultrebase=0 in .gitreview!!!! [13:32:03] <^demon> git-review doesn't explode on unknown items in .gitreview does it? [13:32:22] <^demon> If not, I'd say go ahead and start preemptively adding it. [13:33:51] here is the change set : https://review.openstack.org/#change,5784 [13:34:09] I don't think It will cause any trouble with previous versions of git-review [13:36:13] <^demon> Just tried it out, doesn't seem to break anything. [13:36:20] \O/ [13:36:54] <^demon> So between disabling this behavior and the coming "automatic rebase" in gerrit, we've almost fix the annoying rebases for simple cases. [13:37:11] for i in extensions/*/.gitreview; do echo defaultrebase = 0 >> $i; done; [13:37:47] the last thing we will have to fix are the annoying merge conflicts in RELEASE-NOTES [13:38:04] I think we should use the RELEASE-NOTES to explain stuff [13:38:25] and have the list of bugs fixed automatically generated on release and added to something like a ChangeLog file [13:40:50] <^demon> Yeah, we can bikeshed over that a bit more on the list and come up with something. [13:41:06] <^demon> I think we all agree that release-notes as we do them now is going to become a rebase nightmare. [13:44:08] I would not say it is a nightmare [13:44:12] but that is surely annoying [13:45:37] you could make it not nightmare by having each entry in it's own file and then a build step or jenkins job to aggregate them into one file [13:46:02] that would be fun [13:46:07] but I prefer enforcing firstline standards and generating automatically [13:46:08] there is no way my changes that eliminate that many lines can be right mutante [13:46:11] i must be doing it wrong [13:46:13] =P [13:46:19] <^demon> jeremyb: We've also discussed using something like "Release-Notes: " in the commit message footer and parse those out. [13:46:37] ^demon: for something that has to be more than one line? [13:46:44] i haven't watched the bikeshedding [13:47:24] <^demon> Well the standard for git commit footers is to prefix your info with something like "Change-Id:" or "Signed-Off-By:" [13:47:41] <^demon> We could do something similar for Release-Notes: [13:47:56] right. but in what case could we not just use the first line of the msg? [13:48:10] <^demon> Well not all commits deserve a release notes entry. [13:48:20] <^demon> "Coding style fixes -- braces and indentation" probably doesn't. [13:48:39] hah [13:48:56] so, we could have a minor flag of some kind [13:48:59] either in footer [13:49:17] of course, git doesnt wanna take my review for productoin [13:49:17] Tim argument is that the commit summary is intended to other developers where as the release notes are for end users [13:49:18] or "[m] Coding style fixes -- braces and indentation" [13:49:19] which make sense [13:49:20] ;_; [13:49:39] <^demon> I think training people to "include the Release-Notes: if you want them" is easier than saying "Mark changes as minor when they shouldn't clutter release notes." [13:49:52] there is proposal for an alternate git merge algorithm that could potentially merge release notes :-D [13:49:58] New patchset: RobH; "endless corrections of tabulation and spacing, updating dhcp related files for install server Change-Id: I0a41915af9982819d19766df1747daff6f9f82bd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4374 [13:50:10] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/4374 [13:50:11] the fact i corrected a git error just now on my own frightens me. [13:50:14] RobH: still on that one?! [13:50:24] i completely redid the file mgmt to use recursion [13:50:33] nicely took out like 30 lines of shite [13:50:50] <^demon> RobH: Would you mind approving a puppet change for me? It's just a capitalization fix. [13:51:03] hrmm, now i have an entirely different git error [13:51:42] ^demon is it already in gerrit? [13:51:46] <^demon> Yep. https://gerrit.wikimedia.org/r/#change,4357 [13:52:14] ^demon: i think no matter what the solution is people will forget to flag for release notes (or forget to flag for exclusion). and we can't change commit msgs post merge. at least not without lots of headache. so idk what to do [13:52:18] heh, that is quite the change! [13:52:37] <^demon> Yeah. Took me longer to diagnose than to fix ;-) [13:52:45] <^demon> By a factor of about 10:1. [13:52:46] New review: RobH; "yes, capitalization matters" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4357 [13:52:49] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4357 [13:53:28] i ahve crazy erorrs, i imagine i need to rebase or something since my checkout was old? [13:53:34] cuz lint is bitching about things that i didnt touch. [13:53:44] https://gerrit.wikimedia.org/r/#change,4374 [13:54:17] (anyone can feel free to advise, as everyone prolly knows more about git/gerrit than me ;) [13:54:37] <^demon> Files you didn't touch? Eww [13:54:42] RobH: source => "puppet:///files/dhcpd". [13:54:46] I think that's your problem [13:54:48] New patchset: Mark Bergsma; "Actually drop privileges; remove old unused code" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4408 [13:54:56] shouldn't be . [13:54:59] I think should be ; [13:55:03] well shit. [13:55:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4408 [13:55:24] check that in with a --amend [13:55:25] notpeter: or , [13:55:27] and it should work [13:55:34] heh, that patch set has 4 ammends [13:55:38] yeah, although ; is proper [13:55:43] on that i am clear how to do, heh, oh is it? [13:55:48] i should do it properly. [13:55:51] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4408 [13:55:53] <^demon> RobH: We've got a couple in mediawiki with 8 or so patchsets. [13:55:53] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4408 [13:56:19] well, all my patchsets except one are due to typos and tabulation [13:56:20] RobH: the gerrit errors are hella misleading. usually you only need to pay attention to the last line [13:56:26] bad tabbing makes mark sad [13:56:47] Syntax error at '.'; expected '}' at ./manifests/misc/install-server.pp:185 err: Try 'puppet help parser validate' for usage [13:56:50] New patchset: RobH; "endless corrections of tabulation and spacing, updating dhcp related files for install server Change-Id: I0a41915af9982819d19766df1747daff6f9f82bd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4374 [13:57:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4374 [13:57:12] notpeter: good to know! [13:57:18] and indeed, it is much happier now [13:57:21] ja [13:57:26] says it passed lint now [13:57:29] mutante: what apache file did you want me to peek at? [13:57:40] i am happy to do so [13:57:44] ah, i didnt even scroll down:) [13:57:48] cool [13:58:08] !change 4334 | RobH [13:58:08] RobH: https://gerrit.wikimedia.org/r/4334 [13:58:19] oh, this is apache in gerrit, i assumed you meant in plain old httpd heh [13:58:42] it just changes a line in the apache config file though [13:58:55] plain file in file/apache/sites [13:59:16] so the pem file is copied down by an unrelated manifest that i dont need to hunt down and confirm exists? [13:59:25] i assume yes, but figured it would ask ;] [13:59:29] right [13:59:33] coolness [13:59:35] i checked it is on the server [13:59:45] I wouldn't merge that rob [14:00:02] you mean my change or the apache one for daniel? [14:00:03] it looks full of file conflicts and potential for data loss [14:00:07] your dhcp stuff [14:00:17] ok, can you notate the change? [14:00:35] no i'm busy [14:00:53] but you're doing a recursive file type on a directory you're also putting files in? [14:01:10] why purge true? [14:01:14] yes, palcing the files via the recurse type [14:01:22] the purge ditches any local cruft folks put in they shouldnt [14:01:30] is that really necessary? [14:01:45] it seems like a good practice since folks may not be used to editing this stuff in puppet [14:01:55] but if the purge is gone, would that alliviate the issue? [14:02:23] you don't want to mention all the files individually for subscribe [14:02:27] mutante: im not sure of our proper protocol here, would i do public and submit, or just publish? [14:02:38] (for your patch) [14:03:06] mark: just file "directory" and subscribe "directory" ? [14:03:17] yes that should work [14:03:30] so i will change that and dump the purge [14:03:54] though i like the purge, its bold ;] [14:03:55] i'll check again when you've done that [14:04:05] RobH: if you want to change the "Verified" and "Code Review" values, "Publish and Submit". If you want _only_ text comments, ""Publish Comments" [14:04:31] mutante: yea but thats my question, which is what i should do. i assume publish and submit, as it then saves you from self-submit [14:04:36] which i imagine is the point of the code review? [14:05:06] but then its goign to be pending a push on sockpuppet, which then you take care of right? [14:05:23] this is Ops, right? https://rt.wikimedia.org/Ticket/Display.html?id=2783 [14:05:35] i wouldnt't mind a merge, yeah:). so if it looks ok, +2 and Publish and Submit. then one of use, i guess by protocol the one who clicked +2 , should also merge on sockpuppet [14:05:53] New review: RobH; "simple enough change, looks good to me." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4334 [14:05:55] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4334 [14:05:59] ok, well, this is me passing that buck back to you [14:06:05] RobH: i'm happy to do the sockpuppet part [14:06:05] merge your own change on sockpuppet ;] [14:06:09] heh [14:06:11] hexmode: yes [14:06:13] thx [14:06:35] RobH: tyvm :) [14:06:41] hexmode: those redirects live in an ops only area in /home/wikipedia/conf/httpd [14:06:48] you can of course read them on NOC [14:06:51] but not edit [14:07:06] someday those will be in gerrit, but not until someone takes a ton of time to clean them up [14:07:18] messing with the cluster apache files is like playing jenga. [14:07:38] sounds scary [14:07:59] RobH: so, I have a primative bz4 package... I need to spend some time this w/e working on puppetizing it [14:08:07] <^demon> Now I have an image of RobH shouting "jenga" when he breaks the site. [14:08:21] i have not had to yell that in years. [14:08:27] i hope that doesn't jinx me. [14:08:42] I'm really surprised that no one has packaged bz4 yet [14:08:45] <^demon> Oh man, you said that on a Friday too. [14:08:47] we could have a live, in person jenga game [14:09:02] <^demon> jeremyb: I'd hate to be the person on the bottom of the tower. [14:09:40] hexmode: so I want to help, but not sure how useful i will be, as when i am doing puppet stuff like now I am totally letting a shitton of other things slide [14:09:40] ^demon: so... 06 13:52:13 < jeremyb> ^demon: i think no matter what the solution is people will forget to flag for release notes (or forget to flag for exclusion). and we can't change commit msgs post merge. at least not without lots of headache. so idk what to do [14:09:46] like every single eqiad ticket =P [14:10:03] ^demon: don't forget about all the people in the basement! [14:10:24] <^demon> jeremyb: I don't know either. I'm willing to let everyone else bikeshed over it :) [14:10:25] RobH: just giving you an update, not asking for help :) I'll just randomly bang on people here and in -labs ;) [14:10:48] awesomesauce [14:10:54] jeremyb: you are more than welcome to comment on wikitech-l :-))) [14:10:59] jeremyb: that is a fun place [14:11:09] hashar: thread too long... [14:11:18] too many lists [14:11:48] same issue there [14:11:54] I have so many mail that I end up using two different clients [14:12:21] wikitech-l, mediawiki-l, wikidata-l. plus at least a few private lists. at least no foundation-l [14:13:18] hashar: seen notmuch? [14:13:18] notmuch ? [14:13:18] sorry I don't understand [14:13:36] http://packages.debian.org/notmuch [14:13:54] !log manganese (gerrit) now sends SSL CA certificate on https, (curl -vvv says verify ok), should resolve [[RT:2777]] and [[BZ:35709]] [14:13:56] Logged the message, Master [14:14:05] RobH: <- looks good live [14:14:23] jeremyb: well I have them filtered nicely [14:14:25] hexmode: ^ [14:14:34] * jeremyb waits for a slow identi.ca [14:14:39] yay [14:14:41] jeremyb: I am just missing a way to dynamically create sub folders based on rules like X-Gerrit-Change-ID [14:15:18] mutante: tyvm [14:15:40] oh, i know someone that's done dynamic creation by list name. should be similar i guess [14:16:09] I might end up writing my own thunderbird extension to do that [14:17:23] hashar: http://identi.ca/notice/92346721 [14:17:40] oh yeah mutt [14:18:26] would have to learn that one day [14:18:55] but then people will be staring at me [14:20:06] paravoid won't stare ;) [14:22:00] * RobH stares at hashar [14:22:22] * hashar hands RobH a few gerrit changes to review :D [14:22:31] cant, busy staring [14:22:45] New patchset: RobH; "endless corrections of tabulation and spacing, updating dhcp related files for install server Change-Id: I0a41915af9982819d19766df1747daff6f9f82bd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4374 [14:22:49] who mentioned 8 patch sets on a change? [14:22:52] im closing in on it. [14:22:54] <^demon> Me :) [14:23:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4374 [14:23:31] <^demon> RobH: https://gerrit.wikimedia.org/r/#change,3363 [14:24:06] i am sure there will be a mistake in my work [14:24:09] and i shall also hit 8 [14:24:35] http://upload.wikimedia.org/wikipedia/commons/6/64/Time_100_Jimmy_Wales_stares_and_grins.jpg \o/ [14:24:38] coffee break [14:24:41] brback [14:25:08] <^demon> Upgrading to 2.3 is going to be cool. It makes the "Drafts" easier to use. [14:25:17] <^demon> That'll be a nice user-facing feature. [14:26:10] mark: whenever you have a moment, https://gerrit.wikimedia.org/r/#change,4374 [14:26:57] hashar: http://www.flickr.com/photos/williambrawley/3277223827/ [14:27:04] the past three days have had more puppet style guidelines pushed into my eyeballs than the past three months. [14:27:18] will check in a bit [14:27:33] ok, im gonna run down the street, insulin pickup time at the Rx. [14:27:36] back shortly. [14:27:42] <^demon> RobH: gerrit has made the trailing whitespace in mediawiki painfully obvious. [14:28:02] it does that to everything, i know understand why mark is hard on excess tabs and whitespace [14:28:22] i even had to PM ben/maplebed to change my view on vim whitespace markers. [14:28:24] heh [14:28:43] now know why even =P [14:30:40] why is every nrpe check_procs a separate check_command? [14:31:55] because we cant pass arguments why the nrpe, it is disabled for security reasons, so it can just take hardcoded stuff [14:32:03] via [14:32:17] to NRPE [14:32:21] but you can to checkcommands, right? [14:32:26] like all other checkcommands? [14:34:11] eh, yeah, there need to be matching checkcommands it the nrpe.local.cfg [14:34:19] sure [14:34:31] but why does there ALSO need to be a matching checkcommand in the nagios config? [14:34:36] that seems pointless [14:34:39] unless I'm missing something [14:36:56] New patchset: Mark Bergsma; "Setup process monitoring for varnishhtcpd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4412 [14:36:57] mutante: ^^ why wouldn't this work? [14:37:08] assuming I make a generic nrpe_check_procs checkcommand, once [14:37:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4412 [14:37:12] mark -- the matching checkcommand is in the server side nagios config? [14:37:22] not yet but it will be [14:37:39] and the nrpe.local.cfg is client side? [14:37:42] yes [14:37:54] note that I also used /etc/nagios/nrpe.d instead of nrpe_local [14:37:58] sounds like the latter is a janky ACL? [14:38:06] no need to have every possible NRPE command on every server [14:41:38] New patchset: Mark Bergsma; "Add one generic NRPE check_procs command to rule^replace them all..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4413 [14:41:41] there [14:41:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4413 [14:41:58] speak now, or forever be more productive. [14:42:30] sweet indifference :-P [14:42:43] i like simpler, sounds like that's what you've done, so I will upvote. [14:42:57] and I would like to make it even nicer [14:43:00] is there any reason to not setup a seperate root crontab list for labs? [14:43:09] by having a nice definition for installing a separate NRPE file for each NRPE service we want to monitor [14:43:24] notpeter: raising that b/c of the cronspam topic of the AM? [14:43:43] yeah [14:43:58] so you can say nrpe::check { "check_varnishhtcpd": command => "/usr/lib/nagios/plugins/check_procs -c 1:1 -u varnishhtcpd -a 'varnishhtcpd worker'" } [14:44:04] for what I've now done with a manual file in the above [14:44:08] or just have the mail go to people who are on the project [14:44:25] notpeter: that one was just insane, and I finally got annoyed enough to fix it [14:44:43] thing is, it wasn't project-specific--it was ldap stuff on the nfs server [14:44:58] hurray.... [14:45:17] wel, I gues I meant "only have it go to people who have access to that instance" [14:45:25] that way there's some ownership/accountability [14:45:36] i agree re. projects [14:45:52] mark: that looks nice! i was looking for a literal problem in the varnishhtcpd check. i can use that to replace others [14:45:53] but in this case I think it was labs infrastructure [14:46:35] i'm making it even nicer now [14:46:39] gimme a few mins [14:47:12] Jeff_Green: yeah, but the by-project thing might reduce the "somebody else's problem" effect [14:47:20] even on infrastructure [14:47:47] yeah [14:47:54] i'm not by any means disagreeing with you [14:48:02] i'm just not sure in this case whose project it would be? [14:48:03] WHY ARE YOU DISAGREEING WITH ME?!?!?! [14:48:17] anyone who has access? [14:48:19] SEND ME MORE COFFEE DAMMIT. [14:48:23] MOAR [14:48:38] * hashar sends coffee to everyone [14:48:40] I'm at a sweet hipster coffee shop, actually [14:48:48] I rode my fixxie here :) [14:48:56] another approach would be to revert insane cronjobs [14:49:09] yeah [14:49:26] or feed them to a script that does a git history and pummels the last commiter [14:49:29] I mean, the real problem is that it's a tragedy of the commons [14:49:40] a tragicomedy, if you will [14:49:56] tragicommondy [14:52:24] any op willing to review / merge my pending changes in puppet ? :D [14:52:33] most of them are trivial ones. List at https://gerrit.wikimedia.org/r/#q,owner:hashar+project:operations/puppet+status:open+branch:production,n,z [14:56:28] Jeff_Green: http://upload.wikimedia.org/wikipedia/commons/6/60/Turkishcoffee....jpg [14:56:34] New patchset: Demon; "Squashing 11 local commit for svn2git rules and such." [operations/software] (master) - https://gerrit.wikimedia.org/r/4415 [14:56:54] apergos: ooh, that looks tasty [14:57:17] apergos: I konw what I'm having with lunch! [14:57:44] New review: Dzahn; "well, like it says. minimmal git cli config, error description makes sense" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4295 [14:57:47] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4295 [15:00:23] frappes are big around here but meh [15:00:34] New patchset: Mark Bergsma; "Two new definitions for dealing with NRPE checks in a nice way, monitor varnishhtcpd with it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4416 [15:00:43] mutante: check change 4416, how do you like that? [15:00:44] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/4416 [15:00:57] after I fix syntax errors ;) [15:01:21] heh,ok [15:01:51] I also have two changes to testswarm mysql configuration (on gallium) https://gerrit.wikimedia.org/r/4395 https://gerrit.wikimedia.org/r/4400 [15:02:01] diederik: /j #wikimedia-tech ? [15:02:07] one just describe in puppet what is already in production, the other allow slow query logs :-) [15:06:26] why are there two gerrit bots... [15:06:35] New patchset: Mark Bergsma; "Two new definitions for dealing with NRPE checks in a nice way, monitor varnishhtcpd with it" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4416 [15:06:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4416 [15:07:08] mutante: check now [15:08:40] mark: i was already looking. it looks really cool. one comment on this " Command run by NRPE, e.g. "/usr/lib/nagios/plugins/check_procs -c 1:1 -C varnishtcpd"" [15:08:54] sometimes we wanted to use -c and sometimes -a with check_procs [15:09:02] I know [15:09:05] so do use that [15:09:10] look how I did it in varnish.pp [15:09:12] it's not like this example [15:11:08] Jeff_Green: to battle the cron spam from labs, we're gonna setup a special labs mail relay [15:11:15] which sends it to the instance owners instead of to us :P [15:12:27] i see :) yes. i shall use that to replace other checks [15:12:30] notpeter: have a look at https://gerrit.wikimedia.org/r/#change,4416,patchset=2 as well if you want [15:12:45] mark: On adding the tftp pxelinux.cfg files. Would you suggest appending them to the misc::install-server::tftp-server class, or as a subclass of that since they are technically distro specific? [15:12:54] i assume the latter, but i wanted to get your take. [15:12:55] a subclass [15:13:02] cool [15:13:28] will do misc::install-server::tftp-server::distroname [15:14:04] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4416 [15:14:08] New review: Dzahn; "looks cool, elegant way to reduce the number of check command definitions and i also like how we don..." [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/4416 [15:14:09] distroname? [15:14:17] lucid. [15:14:19] just do 'ubuntu-pxeboot' or something like that [15:14:24] with all distros in one class [15:14:32] or no [15:14:34] the boot.txt is populated automatically? [15:14:37] just make one big tftpboot directory [15:14:40] and use one recursive dir [15:14:49] much easier [15:14:57] mark: are you going to have a scoping issue with [15:14:57] https://gerrit.wikimedia.org/r/#change,4416,patchset=2 [15:14:59] er [15:15:08] monitor_service{ $title: [15:15:21] yeah I might [15:15:26] in that case I'll have to make that ::monitor_service [15:15:29] yeah [15:15:37] but I mean, do you like the general idea of this? [15:15:48] I never understood why you all were adding nrpe checkcommands all the time [15:15:52] the tftp directory has each distro listed then subdirectories for the pxelinux.cfg [15:16:11] not sure how to best accomplish what you are saying since those individual distro directories have files that differ [15:16:14] RobH: just replicate that exactly under files/tftpboot/ [15:16:40] the /tftboot has edgy-installer, lucid-installer, etc... [15:16:43] mark: yep! [15:16:48] RobH: yes [15:16:49] so? [15:17:04] remove all the distros we no longer use, then put what remains exactly in git [15:17:08] so then puppet will ahve to replicate each of those directories in the file structure? [15:17:10] and add one file resource to put that there [15:17:11] mark: is a very good bit of cleanup [15:17:13] yes [15:17:45] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4412 [15:17:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4412 [15:17:53] ok, on removing old distros, i imagine i should actually remove all the stuff, not just the tftp directories yes? [15:18:04] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4413 [15:18:07] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4416 [15:18:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4413 [15:18:13] what do you mean by all the stuff? [15:18:38] i mean if we no longer use karmic anywhere, and wont be installing it, should I taket the time to remove all its data from elsewhere on brewster? [15:18:51] what data? [15:18:54] like in /srv/wikimedia/conf/distributions [15:18:57] no [15:18:59] don't touch that [15:19:17] so just remove the install data, dont bother with apt data located elsewhere [15:19:18] that's the apt repositories, they are unrelated [15:19:23] indeed [15:19:25] we might need that some day [15:19:31] ok [15:19:40] we only install hardy and lucid presently right? [15:19:43] yes [15:19:50] good enough, thanks [15:20:01] mark: eh? it says i merged it, but i didnt [15:20:11] I did [15:20:33] ok. it had my name next to it in Gerrit all of a sudden [15:21:16] New patchset: Mark Bergsma; "I'm proud of you Peter." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4421 [15:21:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4421 [15:21:51] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4421 [15:21:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4421 [15:22:31] mark: re. labs cronspam relay, yay! [15:23:45] New patchset: Mark Bergsma; "Need to qualify one more reference" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4423 [15:24:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4423 [15:24:11] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4423 [15:24:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4423 [15:24:50] :) [15:26:24] New patchset: Mark Bergsma; "Fix erroneous }" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4425 [15:26:36] fwiw gmail is considerably snappier now that i've purged ~100K messages [15:26:39] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4425 [15:26:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4425 [15:26:41] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4425 [15:28:39] mark: on brewster ubuntu-installer is linked to hardy-installer, shouldnt that be lucid to force lucid to be default install at this time? [15:28:56] im puppetizing that link, so i ask. [15:30:48] New patchset: Mark Bergsma; "Add NRPE dependencies" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4426 [15:30:58] nevermind ubuntu-installer [15:31:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4426 [15:31:03] that's done explicitly these days [15:31:17] but that is what sets the default install right? [15:31:18] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4426 [15:31:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4426 [15:31:25] I don't believe so [15:31:33] oh, then wtf does it do =/ [15:31:45] dhcpd.conf sets that [15:32:18] ahh, indeed it does [15:32:21] i can see it now. [15:32:31] So that still begs the question, what exactly is that soft link for? [15:32:41] nothing anymore [15:32:43] old cruft [15:32:50] ahh, so i can yank it out right? [15:32:53] yes [15:32:57] cool, thank you [15:39:38] mark: i have a couple .php files, imported from an existing small webapp, mw statistics stuff, i want them in gerrit to let volunteers enhance them, and put them on a labs instance. there would also be labs users with access to the instance it is running on. since there arent too many files, i would just like to put them in puppet or manage dir recursive. ok to merge these in operations/puppet, when it stays in test branch? if not, should i [15:40:11] no, that's really bad practice [15:40:32] you should probably make a separate git repository for it [15:40:38] or even better, a deb package [15:40:44] a repo which can also make a deb package [15:41:07] yea, option b) was "tell puppet to clone from another repo" and option c) .deb package [15:41:10] we should not bloat the puppet repository with each and every app we're too lazy to make packages for [15:41:58] so how would i go for creating a new repo that is a "project" in gerrit's terms [15:42:26] it's not really operations, so I wouldn't put it under operations/ [15:42:39] although [15:42:47] * Jeff_Green gets interested [15:42:49] just make operations/debs/packagename [15:43:05] you can import your php files in there, and also add the debian package dir in the same repo [15:43:11] PROBLEM - RAID on db40 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:43:13] that's easiest now I guess [15:43:20] you can do that by ssh'ing into gerrit [15:43:23] the gerrit port [15:43:38] ssh dzahn@gerrit.wikimedia.org:29whatever gerrit --help [15:43:48] there's a command to make a new repo [15:44:09] cool, thanks! [15:44:11] hrmm, so the changes i am making now are dependent on the changes I already made. when i got to commit, i just keep working in my local non production branch and do the rest normally right? [15:44:26] push for review, etc... they just cannot merge until my other changes merge. [15:44:29] RobH: yes [15:44:41] thought so but wanted to ask before i ran into potential issues [15:44:46] thank ya [15:44:49] <^demon> mark, others: `ssh -p 29418 gerrit.wikimedia.org gerrit create-project --name=foo/bar.git --parent=foo.git` [15:45:02] yeah that [15:45:14] you can probably modify the description in the web interface [15:45:43] <^demon> Yep. [15:46:11] so use --name=operations/debs/appname.git --parent=operations/debs here [15:47:23] RECOVERY - RAID on db40 is OK: OK: 1 logical device(s) checked [15:47:24] thanks again mark and ^demon [15:51:16] Change abandoned: Hashar; "After discussion with Ryan Lane this week, it seems we want to separate the shell login and the web ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4166 [15:52:20] New patchset: RobH; "tftp directory with remote recursive set, including all the serial configuration files for tftpboot" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4429 [15:52:32] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/4429 [15:53:08] robh: ssd's and adapters arrived and resolved [15:53:18] RobH: cool, but where are the tftpboot images? [15:53:44] you mean the stuff in ubuntu-installer under each distro dir? [15:53:48] yes [15:53:53] is that what you meant? [15:53:56] I assumed those were auto populated when you added them, i misunderstood [15:54:08] hmm [15:54:09] i thought only the manual add was the serial configurations [15:54:14] they might make the repo quite big though [15:54:31] how big are those files in total? [15:54:33] i suppose they are custom images, but are generated on brewster right? [15:54:47] they're generated by ubuntu [15:54:49] cmjohnson1: uh, i need to follow up with you on that, as im unclear which orders, but will shortly [15:55:04] rt 2720 and 2728 [15:55:05] my brain can only do gerrit or procurement, not both at same time ;] [15:55:24] mark/^demon perhaps you can humor the git noobs here--does operations/debs already exist? [15:55:26] mark: since they are generated, i would assume we didnt want to put in puppet and bloat the repo [15:55:27] !log used gerrit create-project to create operations/debs/wikistats.git [15:55:29] Logged the message, Master [15:55:41] unless their generation is an ordeal? [15:55:45] RobH: they are NOT generated [15:55:46] <^demon> Jeff_Green: Yes [15:55:49] oh, sorry [15:55:52] not by us [15:55:57] right, but by ubuntu [15:56:13] i just do not understand how they are generated, hence my confusion, sorry =/ [15:56:16] then you need another method to get those files installed [15:56:24] ^demon can you help me figure out how to fetch a checkout? as a git noob I did what I thought made sense and failed [15:56:48] its only 14m [15:56:55] so tossing in puppet isnt exactly horrible. [15:57:06] i just dont get how they are made/placed [15:57:17] manually, from ubuntu ftp servers [15:57:34] 14mb is not small either [15:57:37] ahh, so either have to toss files in puppet, or have some kind of puppet check to ensure they are there [15:57:40] and if not pull via ftp [15:57:46] the latter prolly best right? [15:57:59] yeah that would be best [15:58:06] puppet can perhaps do that too [15:58:13] you have any notes on the ftp pull info? [15:58:21] or point me in right direction? [15:59:03] meh [15:59:14] puppet can't pull from arbitrary http servers [15:59:32] RobH: use the volatile file module [16:00:30] put the files in /var/lib/puppet/volatile/tftpboot/ [16:00:47] and have puppet pull them from puppet:///volatile/tftpboot/ [16:00:50] the install image files that is [16:00:50] then they're not in git [16:00:52] not the other ones. [16:00:53] but they are in puppet [16:00:53] yes [16:01:19] and store those files on sockpuppet, or stafford /var/lib? [16:01:24] stafford [16:01:36] ok, spiffy i get to figure out a new module ;] [16:01:41] * RobH is already googling away [16:01:48] squid configs are installed that way too [16:01:59] so i can check there for reference, gtk [16:02:08] no need to add that module, it already exists [16:02:12] it's just a subdir under the volatile module [16:02:25] you only need to copy the files there [16:02:51] noted [16:08:32] PROBLEM - Varnish HTCP daemon on cp1043 is CRITICAL: NRPE: Command check_varnishhtcpd not defined [16:08:55] (i'm investigating that atm) [16:08:59] PROBLEM - Varnish HTCP daemon on cp1041 is CRITICAL: NRPE: Command check_varnishhtcpd not defined [16:08:59] PROBLEM - Varnish HTCP daemon on cp1027 is CRITICAL: NRPE: Command check_varnishhtcpd not defined [16:09:08] PROBLEM - Varnish HTCP daemon on cp1023 is CRITICAL: NRPE: Command check_varnishhtcpd not defined [16:09:08] PROBLEM - Varnish HTCP daemon on cp1025 is CRITICAL: NRPE: Command check_varnishhtcpd not defined [16:09:08] PROBLEM - Varnish HTCP daemon on cp1021 is CRITICAL: NRPE: Command check_varnishhtcpd not defined [16:09:17] PROBLEM - Varnish HTCP daemon on cp1042 is CRITICAL: NRPE: Command check_varnishhtcpd not defined [16:09:17] PROBLEM - Varnish HTCP daemon on cp1044 is CRITICAL: NRPE: Command check_varnishhtcpd not defined [16:09:35] PROBLEM - Varnish HTCP daemon on cp1024 is CRITICAL: NRPE: Command check_varnishhtcpd not defined [16:09:35] PROBLEM - Varnish HTCP daemon on cp1022 is CRITICAL: NRPE: Command check_varnishhtcpd not defined [16:09:35] PROBLEM - Varnish HTCP daemon on cp1028 is CRITICAL: NRPE: Command check_varnishhtcpd not defined [16:09:35] PROBLEM - Varnish HTCP daemon on cp1026 is CRITICAL: NRPE: Command check_varnishhtcpd not defined [16:11:04] ok, time to take puppet break and order all the osm stuff [16:11:41] PROBLEM - MySQL disk space on db1047 is CRITICAL: DISK CRITICAL - free space: /a 68144 MB (3% inode=99%): [16:11:53] ah, that looks like it just caught the revision before "erroneous {" [16:11:59] RECOVERY - Varnish HTCP daemon on cp1021 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [16:12:00] New review: RobH; "it has issues, bad syntax, plus after discussion with mark i will be adding in support for the insta..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/4429 [16:12:24] ah [16:12:27] files need to be named .cfg [16:12:34] New patchset: Demon; "Revert "Temporary hack to disable extension list cron for the next 24 hours so I can unbreak Translatewiki" like I promised I would." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4430 [16:12:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4430 [16:14:14] New patchset: Mark Bergsma; "NRPE include_dir configs need to be named .cfg" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4431 [16:14:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4431 [16:14:43] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4431 [16:14:46] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4431 [16:16:47] RECOVERY - Varnish HTCP daemon on cp1022 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [16:18:08] RECOVERY - Varnish HTCP daemon on cp1026 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [16:18:08] RECOVERY - Varnish HTCP daemon on cp1044 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [16:21:53] RECOVERY - Varnish HTCP daemon on cp1023 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [16:22:11] RECOVERY - Varnish HTCP daemon on cp1042 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [16:25:12] morning! [16:25:12] ping mutante [16:25:24] hi maplebed [16:25:27] morning! [16:25:29] morning [16:25:34] err... evening? [16:25:44] true, yes [16:25:49] thanks for stopping where you did on the es slaving! [16:26:05] you're quite right that we want to slave from es1001, not es3. [16:26:09] and that section differs from the docs. [16:26:20] do you have 20 minutes now for us to finish? [16:26:22] ok, good [16:26:24] yea [16:26:30] sweet. [16:26:36] LeslieCarr: ^^^ [16:26:58] the crucial difference - we want to pull the new slave position from the output of 'show master status' that we captured rather than the 'show slave status'. [16:27:07] so the mysql process running in this state didnt make a difference to it being stopped ,right [16:27:16] I think that file only exists on es1001; IIRC we captured it after taking the snapshot not before. [16:27:20] mutante: correct. [16:27:57] apergos: notpeter ^ [16:28:08] mm? [16:28:20] PROBLEM - MySQL Slave Running on db1007 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Incorrect key file for table ./centralauth/spoofuser.MYI: tr [16:28:23] right now mysql is running but since you didn't start slaving, it's not doing anything except sitting there ready. [16:28:24] uh huh, the show master status on the master, [16:28:26] what you said about "show master status" vs. "slave status" [16:28:37] cause otherwise you've copied over s snap with writes in it that [16:28:45] likely will be duplicated [16:28:56] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 242 seconds [16:28:56] apergos: no, that's not it. [16:28:57] cool, nice to know I understood right [16:29:01] no? [16:29:24] if you're duplicating a slave and setting up the new one to slave off the same master as the old one you use the output of show slave status. [16:29:36] if you're setting up a slave and you want it to slave from the one you're copying from, you use show master status. [16:29:42] let me anchor that in example. [16:29:53] pull up http://noc.wikimedia.org/dbtree/ [16:29:53] yes, that's what I'm saying [16:30:00] we're taking a copy from the master [16:30:12] oh, ok. I wasn't sure what you mean by stuff would be duplicated. [16:30:22] well if we use the master snap [16:30:35] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [16:30:35] and then tell the slave to start at whatever positions from the slave status [16:30:46] that we recorded (who knows what those are) [16:30:55] presumably those poing to stuff that's a bit behind [16:31:18] so you might get dups [16:31:22] nope. [16:31:31] ok, please tell me what you would get [16:31:37] both are valid paths. [16:31:40] this means I don't understand well [16:31:50] if you look at dbtree, imagine we're taking the snapshot from db1017. [16:31:50] em [16:31:55] how is it valid in this case? [16:31:56] RECOVERY - Varnish HTCP daemon on cp1025 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [16:32:05] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [16:32:25] if we take the snap from db1017 and use the output of 'show slave status' to start slaving on the new host, it will be in the same position as, say, db53. [16:32:32] RECOVERY - Varnish HTCP daemon on cp1024 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [16:32:40] if we take the snap from db1017 and use the output of 'show master status' to start slaving, then it will be in the same position as db1033. [16:32:52] the only difference is who it treats as a master and therefore where it lives in the replication tree. [16:33:33] yes, but if you use the one master ip and the other log info [16:33:37] what does that get you? [16:34:13] true, you cannot mix and match masters and log position. [16:34:26] you can't use db1017's log position to slave off of db53, for example. [16:34:33] but that's not what we were doing here... [16:34:50] I still don't get it I guess [16:34:51] sorry [16:35:19] I think it'd be easier by voice. [16:35:23] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [16:35:27] so first is, we want to use a different master_host= IP, the one of ssl1001, right [16:35:33] can we talk in a few, after we get the es slaves running?+ [16:35:41] why not get them going [16:35:48] mutante: yes; we want to use es1001 as the master host [16:35:49] I'll keep looking at what you wrote [16:35:56] see if I can make itmake any sense [16:35:59] RECOVERY - Varnish HTCP daemon on cp1027 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [16:36:08] RECOVERY - Varnish HTCP daemon on cp1041 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [16:36:44] RECOVERY - Varnish HTCP daemon on cp1028 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [16:36:56] mutante: and then the master log and position come from the output of master status that we capturued on es1001; es1001-bin.000548 and 797575763 [16:38:11] and was i right about the password? [16:38:18] yes. [16:38:35] we are so bad :( [16:39:48] ok, im executing that query and starting slave [16:40:02] RECOVERY - Varnish HTCP daemon on cp1043 is OK: PROCS OK: 1 process with UID = 1001 (varnishhtcpd), args varnishhtcpd worker [16:41:04] mutante: rock on. [16:41:15] error in sql syntax,looking [16:41:27] can you copy/paste all but the password here? [16:42:03] mysql> change master to master_host='10.64.0.25', master='reply', master_password='xxxxxxx', master_log_file='es1001-bin.000548', master_log_pos=797575763; [16:42:15] repl .arg [16:42:20] what's the 'master=reply' bit? [16:43:11] RECOVERY - MySQL replication status on es1002 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : s [16:43:25] yea, stupid copy/paste mistake, all good [16:43:39] it should be master_user='repl'; right? [16:43:42] !log changed master and started slave on es1004 [16:43:44] Logged the message, Master [16:43:45] yes [16:43:58] rock on. [16:44:21] if you look at the outptu of 'show slave status' on es1004 now, you'll see that it's got a huge seconds_behind_master [16:44:24] but that it's catching up. [16:44:59] RECOVERY - MySQL slave status on es1004 is OK: OK: [16:45:08] LeslieCarr: http://wikitech.wikimedia.org/view/Setting_up_a_MySQL_replica [16:46:31] maplebed: confirmed, i see it going down [16:47:14] PROBLEM - MySQL replication status on es1004 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 116580s [16:47:30] well, nagios tells us [16:50:59] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [16:51:15] 09:30 < maplebed> true, you cannot mix and match masters and log position. [16:51:18] 09:30 < maplebed> you can't use db1017's log position to slave off of db53, for example. [16:51:38] fwiw, if you use mk-slave-move you can do operations like --slave-of-sibling pretty trivially [16:51:46] * rcoli says the lurker-who-happens-to-be-a-DBA [16:52:02] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [16:52:02] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [16:53:20] huzzah! [16:53:27] LeslieCarr: mutante congrats! [16:53:45] :) [16:54:17] RECOVERY - MySQL slave status on es1002 is OK: OK: [16:56:41] PROBLEM - MySQL replication status on es1002 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 122535s [16:57:35] RECOVERY - MySQL replication status on es1004 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [16:58:39] done catching up [17:00:22] LeslieCarr: you want to umount and lvremove on 1001 ? then we are through the steps [17:01:51] oh no, wait, apergos is still rsyncing from it [17:01:55] right [17:02:00] from a different snap [17:02:04] not from yours [17:02:04] ok [17:02:13] LeslieCarr just took off for the gym. [17:02:33] but yes, that's the last step and ok to run now that both slaves are good. [17:03:19] i see, "apersnap" i wont touch:) [17:03:50] hey, that name was not my idea :-P [17:04:10] you should totally take credit for it because it is such an AWESOME name. [17:04:17] er [17:04:19] noooooooo [17:04:44] ok, umount /mnt/snap , lvremove /dev/es1001/snap [17:04:48] done [17:06:17] apergos: http://flickr.com/gp/maplebed/FL747N [17:06:36] that might help us talk through which slave / master status to use where and what would happen. [17:07:02] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [17:07:02] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [17:07:05] see here's my question [17:07:36] so as I understand it we are taking a full copy of the master db with whatever logs it writes at the time of the rsync [17:07:47] (stop me as soon as I say something wrong) [17:08:09] correct (though slaving is stopped when we take the snap so it's not modifying any of the logs) [17:08:23] RECOVERY - MySQL replication status on es1002 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [17:08:28] ok [17:08:50] hmm basically I guess I don't understand that mechanism either [17:09:22] do you want me to talk through the various logs in that picture? [17:09:23] it might still be making changes to the tables [17:09:37] not yet [17:09:43] Is today a WMF holiday? and/or is Monday? [17:09:46] it won't be making any changes to the tables. [17:09:55] andrewbogott: today isn't (at least for me) [17:09:59] I don't think monday is either, but will check. [17:10:24] no, monday is not. [17:10:32] (according to the HR holiday list) [17:10:33] Huh. OK. [17:10:35] monday is for me [17:10:40] I didn't have plans, anyway :) [17:11:01] sorry, mark, I was only looking at the US holiday list. [17:11:13] yeah... I just realized it is for me ;) [17:11:22] holiday on monday over here [17:11:27] I realized last night I'm flying to Berlin during a holiday. [17:11:35] hehe [17:11:37] what prevents some process from writing to the master in between the time we note the master status and when we take the snapshot? [17:12:01] apergos: us saying 'stop slave io_thread' and the fact that we're not running on the actual master (the tip of the tree). [17:12:03] sorry to be so slow. but if I don't understand this then all I'm doing is guessing about how it works, or sort of doing things by rote, and I don't like either [17:12:11] +1 asking questions. [17:12:20] oh, because it's nt the actual master [17:12:22] that's the piece [17:12:37] so if we ever need to take a copy from the actual master (the one to which random clients are connecting) we also have to run 'flush tables with read lock' [17:12:44] uh huh [17:12:49] that statement locks all the tables so that nobody can write to them. [17:12:58] I figured [17:13:12] since we were copying from a slave (even a slave that's master to other hosts), it's set to 'read_only=true' [17:13:21] ok [17:13:25] and nobody can write to the tables (except the replication thread, which is exempt from the read_only setting) [17:13:45] and just running 'stop slave' is sufficient to stop all changes to the content (both tables and logs) of the db. [17:13:54] sure [17:14:33] ok now show slave status, run from the master [17:14:52] shows what exactly? I mean it shows a log position and log name of... which boxes exactly? [17:15:38] the names 'master' and 'slave' are only useful when describing the relationship between two hosts; it becomes vague when we have tree replication. [17:15:41] so looking at http://www.flickr.com/photos/maplebed/6905042356/lightbox/ [17:15:53] 'show slave status' on the real master (host a) is empty. [17:15:57] !log Power cycled down host lvs5 [17:15:57] it has no slave status. [17:15:59] Logged the message, Master [17:16:02] uh huh, wait is that the same image? [17:16:07] cause if it is I won't reload it [17:16:17] don't reload. [17:16:19] ok [17:16:23] it's the same, just in the fullscreen mode. [17:16:29] (easier to read) [17:16:33] ok, I have the original large size loaded up here [17:16:36] maplebed: that's normal... a master shouldn't have slave status [17:16:55] (assuming this is mysqld?) [17:16:56] jeremyb: pull up that flickr link for the tree wer'e talking about. [17:17:12] i think it's asking me to login to yahoo? [17:17:29] so show slave status run on box X should show where it is in reading some binlog in order to load up transactions it received from elsewhere [17:17:30] sorry, try this one: http://flickr.com/gp/maplebed/FL747N [17:17:33] is that correct? [17:17:57] I have that. I have a good copy of the image. it's all good. really. [17:18:08] (that was for jeremyb) [17:18:11] oh :-D [17:18:14] show slave status shows two logs [17:18:17] RECOVERY - Host lvs5 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:18:31] the binary log, which is the position and log file on the master from which it is replicating, [17:18:35] RECOVERY - BGP status on cr1-sdtpa is OK: OK: host 208.80.152.196, sessions up: 8, down: 0, shutdown: 0 [17:18:43] and the relay log, where mysql queues up the statements before executing them. [17:18:47] ok [17:19:07] so host c's show slave status shows the same log (for the master bin log) as the output of 'show master status' on host a. [17:19:26] lemme stare at that for a minute [17:19:58] ok [17:20:02] so far so good [17:20:11] maplebed: is this still about slaving a new es? [17:20:21] then what does show master show over there? [17:20:44] host b and host c have the same values in the output of 'show slave status' for the binary log (a-bin.005) but different values for the relay log (because it's only used internally on the host) [17:20:51] looks like i have a lot of scrollback to catch up on [17:20:56] uh huh [17:21:05] jeremyb: yeah, though this is more just a general 'how mysql slaving works' at this point. [17:21:20] you will likely know all this already jeremyb [17:22:02] the process that host c goes through when replicating a query: [17:22:17] * read it from the binary log on host a, write it to the relay log on host c [17:22:25] (that's the io_thread) [17:22:29] ok [17:22:38] * the sql_thread reads it from the relay log on host c and executes it. [17:22:55] * mysql then says "I ran a query" and so writes it to its own binary log. [17:23:06] ok now let me stop you [17:23:09] k. [17:23:09] * jeremyb gave up and plugged in ethernet [17:23:13] the in logs from a get to c by..? [17:23:17] *bin logs [17:23:38] the io_thread logs into host a as the repl user and read sthem. [17:23:57] ok [17:24:03] if you look at 'show processlist;' on host a, you'll see two users logged in as 'repl' - host b and c. [17:24:51] I've seen that in the processlist but not known what it was except that obviously it had to do with making the binlog available someway [17:24:53] b and c both maintain a persistent mysql connection (same as a regular mysql client) waiting for new content to read from the binlog. [17:25:53] ok so now, if I take a copy of c at this point, stuff it on, say, e, and then give it c as its master but a 's bin log and position, how can that work? [17:25:58] here is where I just [17:26:01] dont... get it [17:26:31] won't it likely either miss transations or have dups, or something other than the right thing? [17:26:33] it don't work [17:26:41] ok, so when you take a copy of c (and record both the output of slave and master status) then copy it all over to host e [17:26:46] you have a choice. [17:26:50] give it a's log+pos+host or c's log+pos+host [17:26:54] don't mix+match [17:26:58] ok [17:27:20] host e can be placed in the tree along side b and c or it can be a new child of c (next to d) [17:27:34] (guys, if you use "start slave until" you can gaurantee the slaves stop at the same position in the master binlog) [17:27:38] so what I was asking about all this time was how we could use es1001 slave show status info (log name and position) and yet have the new slave using es1001 as its master [17:27:44] (also,this is a very good reference : http://www.mysqlperformanceblog.com/2008/07/07/how-show-slave-status-relates-to-change-master-to/) [17:27:53] let's take the first case - e wants to be a new host at the same level as b and c. [17:27:54] apergos: that link answers your question [17:27:54] and what I thought you wer telling me is that this was possible [17:28:49] rcoli: wait a little bit please [17:29:02] we would want to use the output of 'show slave status' and get the binlog position for a-bin.005. [17:29:18] b and c and e would all then have the same a-bin.005 in the output of 'show slave status' and be siblings. [17:29:21] apergos: for sure, just an outsider (who happens to be a DBA in RL) trying to be of assistance. [17:29:31] you're answerim a different question, I think maplebed [17:29:32] that's the "normal" way that the replication doc we were following suggests. [17:29:35] I am. [17:29:40] Now on to the second case. [17:29:51] where we want e to be a child of c (sibling to d) [17:29:55] apergos: you don't use es1001 show slave status+make es1001 the master. use show master status and make es1001 the master or use show slave status and use the master in that output as the master. (so es1001's master) [17:30:02] jeremyb: please wait [17:30:17] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 199 seconds [17:30:17] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 199 seconds [17:30:24] in that case, we need the output of 'show master status' in order to tell e where to start replicating from. [17:30:40] you're correct that the output of 'show slave status' is useless when we want e to be a child of c. [17:30:58] * jeremyb wonders if apergos is typing a question [17:31:20] when replication is running again, e and d will have the same output of 'show slave status' - they'll both be at c-bin.204 [17:31:38] no jeremyb, I just have three people all telling me different things when in the end I have one very particular question and tha's all [17:33:00] apergos: does that make more sense yet? [17:33:08] !log Sending Japanese upload traffic to varnish in eqiad [17:33:10] Logged the message, Master [17:34:15] maplebed: case 1, e wants to be s sibling of b and c, we waht the output of show slave status from... where? [17:34:28] not from a surely, as it's empty [17:34:36] so I guess I am stuck there [17:34:47] you want show master status from a [17:34:48] from whichever host we're using as the source of our copy, in this case c [17:34:54] ok, that's fine [17:35:07] 100% works for me and that's how I thought that worked. [17:35:43] both of those cases are how I thought they worked at the beginning of this discussion [17:35:49] case 1 is the case for which the ES repl doc was written. [17:35:50] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 185 seconds [17:36:13] uh huh [17:36:31] it got confusing because we wanted case 2 but were following the same docs. [17:36:32] :( [17:37:00] Jeff_Green: funny enough, that cronspam was legitimate [17:37:02] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 226 seconds [17:37:08] Jeff_Green: and not spam [17:37:17] Ryan_Lane: thousands per day? [17:37:18] if that script is outputting errors, it's a problem [17:37:29] well, I should have noticed it [17:37:31] it's been going on for at least weeks, if not months [17:37:36] I don't read cronspam often enough [17:37:36] :-P [17:37:46] apergos: so you're clear on all of it? [17:37:56] (repeating from at the beginning of this conversation) well if we use the master snap [to put on the new slave] and then tell the slave to start at whatever positions from the slave status [taken from the master [17:38:01] this can't possibly work right? [17:38:07] correct. [17:38:25] ok, that is what I was saying earlier when you weren't here so yay I actually had a clue [17:38:26] thanks [17:38:27] wait. [17:38:29] hang on. [17:38:41] * apergos waits [17:38:48] it does work, so long as you take *all* the data from the 'show slave status'. [17:38:51] that's case 1. [17:39:07] no, I meant: log position and name from show slave status [17:39:10] you take the snap from the master (host c) copy it to host e, and use the output of 'show slave status' on host c to start up replicaito [17:39:17] then it'll be a sibling of c and replicating from a. [17:39:28] but not the master ip [17:39:31] log name / log position / master host is a triple that can't be broken. [17:39:41] I guess you read that differently than I wrote it [17:40:04] but the whole discussion was: using es1001 for master, but the log name and position from show slave status on es1001 [17:40:24] which I could not see how that could work [17:40:55] I agree that we're saying the same thing; so long as you keep the host/log/position triple fixed, you'll be fine. so 'show slave status gives a-bin and needs host a; 'show master status' gives c-bin and is fixed to host c. [17:40:59] New patchset: Mark Bergsma; "New version 3.0.2-2wm2" [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4435 [17:41:03] uh huh [17:41:21] you're right that you can't take a-bin/pos and host c and use that triple to start up replication; it'll fail. [17:41:25] ok [17:41:27] Jeff_Green: heh. fixed it [17:41:34] Jeff_Green: we should revert that change [17:41:41] cool. jeremyb rcoli - wanna chime in? [17:41:43] thanks for waiting. [17:41:43] New review: Mark Bergsma; "(no comment)" [operations/debs/varnish] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4435 [17:41:45] Change merged: Mark Bergsma; [operations/debs/varnish] (master) - https://gerrit.wikimedia.org/r/4435 [17:41:46] I do actually need to see if that's broken [17:41:47] we should setup pt-heartbeat on the es cluster [17:41:53] binasher: yes we should! [17:41:53] I'll pay more attention, I promise :) [17:41:56] :) [17:42:28] although i'd much rather the current es cluster be 100% read only archival storage where replication doesn't matter :) [17:42:32] ok so I think I (mostly) just wasted a lot of your time [17:42:37] for which I apologized [17:42:41] *apologize [17:44:38] apergos: no waste at all. [17:45:54] do you actually have daisy chained replication anywhere? [17:46:16] if nothing else, we just spent time thinking about replication and reaffirmed our undestanding! ;) [17:46:20] jeremyb: we do; [17:46:37] most of our clusters that replicate cross-colo have one cross-colo link and the rest of the hosts in the other colo replicate from that host. [17:46:37] i've read about some of the relevant options once upon a time but never heard of anyone actually using it [17:46:46] jeremyb: take a look at http://noc.wikimedia.org/dbtree/ [17:46:48] ahhh [17:47:01] and what about the TS slaves? [17:47:07] TS? [17:47:15] toolserver [17:47:22] does anyone actually understand how trainwreck works? [17:47:26] heh... [17:47:46] barely. I prefer not to think about it. [17:47:51] hah [17:48:07] the other advantage of using tiered replication [17:48:11] * jeremyb wonders how this chart is related to the new org chart for employees/contractors [17:48:26] is that when you need to do a master rotation there are fewer slaves to move. [17:48:56] so if you have, say, 30 slaves (which we don't), tiered replication lets you move 3-4 hosts in the case of a crash isntead of all 30. [17:49:35] at the same time it guarantees consistency among the children of a specific middle slave (whereas the various children of a crashed master might have slightly different content in their binlogs) [17:50:33] jeremyb: same software builds both charts is why it looks familiar. [17:51:29] so, what's the deal with db42? [17:51:48] it failed to do the 1.19 schema migration [17:52:06] it's got a bunch of additional databases (that don't exist on its master) for the researchers [17:52:17] I tihnk it ran out of space trying to do the schema change. [17:52:22] ahh [17:52:33] IIRC binasher managed to coax extra space out of its users and it's running the schema change again tonight. [17:52:34] lag is way off the charts and it's name indicates sdtpa but the chart says it's replicating off eqiad [17:52:46] re: chained replication on the core db's, we can "select * from heartbeat.heartbeat" on for example the slave of a slave and see: [17:52:48] in a iittle while I'll look at the other links that were tossed into the channel [17:52:51] | 2012-04-06T17:51:28.000750 | 101623 | db1034-bin.000144 | 803062442 | db13-bin.000337 | 555969791 | [17:52:51] | 2012-04-06T17:51:28.000830 | 1006 | db1002-bin.000137 | 534054918 | db1034-bin.000144 | 803062442 | [17:52:52] | 2012-04-06T17:51:28.001190 | 10623 | db13-bin.000337 | 555970128 | NULL | NULL | [17:52:54] see if I can get any new details out of em [17:53:06] jeremyb: yes, it's in pmtpa and replicating off of eqiad [17:54:04] binasher: what are the 2nd, 4th, and 6th columns? [17:54:35] also maplebed thanks for taking th dtime to draw up the diagram [17:54:40] or just all columns ;) [17:54:44] shouldn't it go on commons though? :-P [17:55:00] I'm sure others have more "professional" diagrams that have the same content... [17:55:17] 2nd col = serverid, 4th = master position (the master or slaves own binlogs), 6th = exec master position (null = the actual master) [17:55:31] cool. [17:55:59] why are there two rows with the same server ID? [17:56:27] i think you're misreading 101623 vs 10623 [17:56:36] oh so I am. [17:57:05] from 10.64.16.23 vs. 10.0.6.23 [18:01:55] binasher: did you see MHA? [18:02:13] binasher: http://code.google.com/p/mysql-master-ha/ - we just hired the author [18:03:24] jeremyb: re: db42, i don't want non-prod slaves that get trashed and super behind replicating directly off prod masters.. it can cause issues with switching the master [18:04:02] domas: oooh [18:05:31] mha does look pretty. [18:05:49] domas: i still think fb should fix all the limitations in google's global txn id patch [18:06:45] hehe [18:08:13] I just deployed my consistent hashing director in the eqiad varnish upload cluster [18:08:17] it's serving japanese traffic [18:08:24] it would make the world a better place! [18:08:40] consistent hashing in varnish does too [18:09:04] 400 purges a second do not :( [18:09:24] how do we get that many purges on upload? [18:09:33] it's one stream for both wikis and upload [18:09:40] ah [18:09:40] we should probably fix that at some point [18:10:38] i think when the frontend varnish instances weren't getting purges and instead just gave everything a <= 5min ttl (eventual consistency!), their cpu utilization was a lot lower [18:10:48] yes [18:11:18] but I think user experience is our main goal, now low cpu ;) [18:11:29] it sucks indeed [18:11:31] i think the low ttl also helped reduce lru work [18:11:58] we can play with that [18:12:09] there's not a lot of point in keeping objects > 5 min anyway [18:13:25] i don't know if taking up to a few min for an image purge to propagate globally (and that assumes the image to be purged is staying in the small frontend ram cache) is that bad of an experience [18:14:41] for images that's fine I think [18:14:43] for text it's not [18:16:05] New patchset: Bhartshorne; "moving the save-object stuff so we only save stuff that's different from ms5" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4473 [18:16:07] thats fair, although logged in users altering the text would only see a previously cached version if they're checking it from a different browser within that window [18:16:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4473 [18:16:30] but i guess the faster we can clear vandalism, the better [18:18:25] oooh, MHA is nice ;) [18:19:28] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4473 [18:19:31] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4473 [18:25:02] PROBLEM - Puppet freshness on search1022 is CRITICAL: Puppet has not run in the last 10 hours [18:27:20] !log updating OpenStackManager to r114758 on virt0 [18:27:21] Logged the message, Master [18:29:14] domas: he has a bit of a spam comment problem: http://yoshinorimatsunobu.blogspot.com/2012/01/mha-for-mysql-053-released.html [18:30:10] i blame google [18:31:09] they probably put the blogger team in shackles and forced them to work on nothing but google+ [18:31:10] I blame google too [18:31:33] blogging isn't social enough ;) [18:31:36] haha binasher [18:31:45] when did fb hire yoshinori? [18:31:57] look at his latest blog ;) [18:32:28] oh, duh [18:32:36] binasher: which mysql build do you use? do your own cherry picking? [18:34:16] facebook's patch set here, percona binaries elsewhere  [18:36:01] danke [18:40:02] PROBLEM - Puppet freshness on search1021 is CRITICAL: Puppet has not run in the last 10 hours [19:05:15] New patchset: Lcarr; "fixing icinga nrpe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4479 [19:05:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4479 [19:05:46] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4479 [19:05:49] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4479 [19:44:51] Ryan_Lane: can I name it vumi.pp [19:45:03] you should call it mobile.pp [19:45:13] if you want to call the subclass vumi, that's fine [19:45:17] mobile::vumi [19:45:29] no one will know what it's for if you call it vumi.pp [19:47:38] Ryan_Lane: I guess I'll just ask maplebed to help me [19:48:16] heh [19:48:57] when you say help, you really mean "do this for me" [19:52:01] Ryan_Lane: NO I DO NOT [19:52:02] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [19:52:15] :D [19:53:36] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4374 [19:53:38] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4374 [20:14:14] New patchset: preilly; "Add new mobile puppet manifest with simple vumi class * provides ussd application server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4483 [20:14:30] New patchset: preilly; "Add X-Carrier to response from Varnish" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4032 [20:14:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4483 [20:14:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4032 [20:15:48] New patchset: preilly; "Add new mobile puppet manifest with simple vumi class * provides ussd application server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4483 [20:16:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4483 [20:17:24] New patchset: preilly; "Add new mobile puppet manifest with simple vumi class * provides ussd application server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4483 [20:17:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4483 [20:31:56] New patchset: preilly; "Add new mobile puppet manifest with simple vumi class * provides ussd application server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4483 [20:32:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4483 [20:39:03] New patchset: preilly; "Add new mobile puppet manifest with simple vumi class * provides ussd application server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4483 [20:39:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4483 [20:45:05] maplebed: fixed things in a way that you might notice if you look really carefully ;) [20:45:17] heh... [20:45:46] thanks. I didn't know we had a template for channels. [20:46:12] no problem, I think it might be newish [20:46:56] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [20:48:18] oh actually i edited yours on meta - yeah it's been around for a while [20:48:25] we also have {{irc| for the same effect IIRC [20:49:58] the reason for removing my ssh key - I've got commit access now so it's unnecessary? [21:13:38] PROBLEM - Squid on brewster is CRITICAL: Connection refused [21:14:15] so, hoping someone has some ideas … i've got a lot of squids on neon (new nagios) with 403 errors … but i haven't seen any configurations in puppet that would be locking out everything but spence … anyone know what might be going on ? [21:14:28] actually i believe it is all squids [21:18:44] User-Agent header? [21:20:11] +1 [21:20:55] hrm, good thought, let me see if spence is doing something special [21:25:20] New patchset: Lcarr; "adding in ubuntu logo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4488 [21:25:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4488 [21:25:36] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4488 [21:25:38] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4488 [21:32:08] !log restarted squid on brewster [21:32:10] Logged the message, Mistress of the network gear. [21:33:41] oh, that's why it's borked [21:33:48] brewster is 100% full [21:33:54] again huh? [21:34:44] hrm [21:34:48] who put jenkins on root home ? [21:34:53] that's pretty big [21:35:28] !log moved jenkins_1.458_all.deb to /srv/wikimedia/incoming/ on brewster [21:35:30] Logged the message, Mistress of the network gear. [21:36:12] LeslieCarr: re: the backend squid 403's [21:36:22] it's based one the tiertwo acl in squid.conf [21:36:39] it contains 208.80.152.0/24 for pmtpa but only 10.64.0.0/22 for eqiad [21:36:56] ah [21:37:04] so it needs to be a tiertwo ? [21:37:06] only hosts matching the tiertwo acl get to talk to backend squids [21:37:16] which is why the monitoring is ok for the frontend instances [21:37:33] okay :) [21:37:37] thanks binasher ! [21:38:25] i wonder if we should move nagios (and most everything else) to hosts only on internal ip's [21:38:56] and let lvs or a software proxy expose 80/443 to the world [21:39:12] hah i was just saying nagios was a bit harder because it's public but we could do a port forwarding ;) [21:40:56] maybe an ops nginx cluster that did ssl termination and forwarded to private hosts.. we could have all services behind one ip address and force ssl while we're at it [21:48:40] RECOVERY - Squid on brewster is OK: TCP OK - 0.003 second response time on port 8080 [21:55:35] !log restarted puppet on spence [21:55:37] Logged the message, Mistress of the network gear. [22:00:03] binasher: so how do we generate our squid config files ? [22:00:15] it's not a git file ... [22:00:53] nope, see /home/w/conf/squid on fenari [22:01:38] mark reworked the deployment process around a month ago, there might be a wiki page about it, but try searching email [22:02:13] but generate.php creates the configs [22:02:24] and the deploy script in there should have --help options [22:02:40] after using generate [22:03:04] do a recursive diff of generated/ against deployed/ [22:03:56] hrm, any idea what the email could be titled ? though the wikitech page actually looks sort of in date :) so let's see if that works or if i kill the site [22:04:40] wow we have some old old acl's in there [22:04:40] binasher / LeslieCarr when I was doing the swift stuff, it was 'make' rather than 'generate' to create the configs. [22:04:41] 2005 [22:04:42] hehe [22:05:43] the makefile contains nothing but - all: php -n generate.php [22:06:36] so that works too [22:06:41] New patchset: Ryan Lane; "Disabling access log for install server, it keeps filling up the disk" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4490 [22:06:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4490 [22:06:58] Ryan_Lane: you saw my answer about the cisco install right? [22:07:10] LeslieCarr: did you find http://wikitech.wikimedia.org/view/Squid#Configuration [22:07:12] the one puppet change is live, so you are able to actually install cisco's now. [22:07:23] gah [22:07:31] i went to "squids" which is linked off the main page [22:07:39] just need to att whatever cisco it is to the linuxsoandso.ttyS0-115200 file via puppet [22:07:47] same page, I think. [22:07:48] oh same thing [22:07:49] whew [22:08:16] the thing that doesn't mention is that you can deploy to groups other than 'all' [22:08:33] such as individual servers (eg sq85) or specific groups (eg 'upload') [22:08:36] ahha i see the issue [22:08:44] we have tiertwo marked by which subnet the squids are in [22:08:54] in pmtpa that happens to be the same subnet as spence [22:08:58] and in eqiad it's not [22:09:02] :) [22:09:09] and it sneakily got around my greps [22:09:11] very sneaky [22:09:40] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4490 [22:09:43] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4490 [22:10:58] so will deploying to all kill anything ? [22:12:22] LeslieCarr: only if there's an error in your config. [22:12:29] :) [22:13:03] LeslieCarr: I strongly suggest deploying to one host and verifying there first. [22:13:31] ok [22:13:37] !log deploying new squid config to amssq35 [22:13:39] Logged the message, Mistress of the network gear. [22:14:11] !log added neon into tiertwo of squid allowed hosts [22:14:13] Logged the message, Mistress of the network gear. [22:18:19] yay looks like it's working … and icinga is happy [22:18:36] does that mean you resolved all the duplication issues?!?! [22:19:18] no, they magically solved themselves [22:19:20] somehow [22:19:26] eerie. [22:19:27] i did sacrifice a chicken [22:21:22] New patchset: Ryan Lane; "Move virt5 into correct file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4493 [22:21:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4493 [22:22:37] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4493 [22:22:40] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4493 [22:23:01] !log deploying new squid config to all squids [22:23:03] Logged the message, Mistress of the network gear. [22:48:46] so… binasher you may know this, is there a config file where we have granted mysql permissions to login or would i have to go to each database server (it appears that spence has permission to login but neon does not) [22:50:21] it's stored internally per db, although it should be the same for all dbs with a shard [22:51:05] the best way to see all of the permissions is to run pt-show-grants [22:52:39] i think rather than changing that, mysql checks that aren't run via nrpe should be [22:52:49] so the mysql login is from localhost [22:54:50] root@es1001:~# pt-show-grants | grep -v 10.0 | grep -v 10.64 | grep -v localhost [23:15:03] Ryan_Lane: quick question, what's the puppet cmd to force a puppet run? [23:15:14] puppetd -tv [23:15:20] thx! [23:15:21] or puppetd --test [23:25:45] python-iso8601_0.1.4-0_all.deb [23:25:46] python-redis_2.4.5-1_all.deb [23:25:47] python-smpp_0.1-0_all.deb [23:25:49] python-ssmi_0.0.4-0_all.deb [23:25:50] python-txamqp_0.6.1-0_all.deb [23:25:52] redis-doc_2.4.10-ubuntu1~lucid_all.deb [23:25:53] redis-server_2.4.10-ubuntu1~lucid_amd64.deb [23:25:54] vumi_0.4.0~a+git2012040612-0_all.deb [23:25:55] vumi-wikipedia_0.1~a+git2012040614-0_all.deb [23:25:56] Do those all look like sane names for us? [23:26:01] New patchset: Lcarr; "removing old mysql checks that are overtaken by nrpe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4495 [23:26:05] binasher: can you check this out ? ^^ [23:26:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4495 [23:29:09] LeslieCarr: it looks like the ES hosts do use those [23:29:15] and nothing else [23:29:47] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 0 seconds [23:29:56] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [23:30:31] oh really ? hrm, i thought it was included in line 111 on mysql.pp ? [23:30:43] (which i could be wrong on) [23:33:50] those don't currently get included on the es hosts [23:35:06] and once we have another es cluster, we wouldn't want a lot of them on the es hosts [23:35:48] oh, db_cluster [23:35:49] sigh [23:36:14] so, would including these manually be not desired then ? [23:36:16] so i think it makes sense for the es monitoring to be defined separately from the core dbs [23:36:25] * LeslieCarr hates nagios so much [23:36:29] it's pretty deffirent. [23:37:07] (the es monitoring is pretty different from the main cluster monitoring; I'd go along with separating it in puppet/nagios) [23:38:39] ok [23:38:53] Change abandoned: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4495 [23:39:06] the es is a temporally sharded key value store that just happens to be stored in mysql.. things like long running transaction monitors aren't at all applicable [23:40:16] New review: Reedy; "Eh?" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3815 [23:41:39] maybe i'll move the ES to mongodb.. that could be lulz [23:41:59] es should totally live in a non-mysql-based persistent key/value store. [23:42:29] yes [23:43:29] mongo might actually be a very good match [23:44:27] the actual migration would suck [23:45:12] but might not be as difficult as swift [23:47:47] New patchset: Lcarr; "moving es monitoring to nrpe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4498 [23:48:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4498 [23:48:31] binasher: check out this time ? (though not going to merge it until monday [23:51:17] in mysql.pp, delete line 292 (heartbeat monitor would fail) and maybe lvs, since they don't run snapshots [23:52:03] and one of the hosts is always going to deliberately be very behind on replication, so we'll have to figure out monitoring for that [23:52:20] New patchset: Lcarr; "moving es monitoring to nrpe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4498 [23:52:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4498 [23:52:37] how about that ? [23:55:35] the $crit = $master (line 285) isn't going to work in that context [23:55:58] it looks like the es master includes db::es::master [23:56:22] i wonder if you can test in puppet if a class has been included or not [23:56:58] well we can test the variable [23:57:05] so if mysql_role = master ? [23:57:23] maybe the db::es::master and db::es::slave classes can just be removed [23:57:50] yeah, set crit to true if mysql_role = master [23:58:09] actually yeah, it's a bit silly [23:58:16] we should get paged if that host blows up [23:58:40] but $mysql_role was mainly being used for the monitoring bits that you're removing [23:58:46] so i think it could be cleaned up some