[00:12:39] RECOVERY - MySQL Replication Heartbeat on db44 is OK: OK replication delay 0 seconds [00:13:06] RECOVERY - MySQL Slave Delay on db44 is OK: OK replication delay 1 seconds [03:03:19] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [03:03:19] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [03:19:22] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [03:19:22] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [04:42:57] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [04:44:18] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [05:32:46] is it just me or is http://wikitech.wikimedia.org/ down? [05:33:44] PROBLEM - udp2log log age on emery is CRITICAL: CRITICAL: log files /var/log/squid/e3_necromancy_idle1year.log, /var/log/squid/e3_necromancy_idle3month.log, have not been written to in 6 hours [05:39:40] pgehres: works for me... [05:40:19] and so it does for me now again [05:40:36] paravoid: thanks for checking [05:51:00] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [05:53:15] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [06:02:15] PROBLEM - Host bellin is DOWN: PING CRITICAL - Packet loss = 100% [06:04:03] RECOVERY - Host bellin is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [06:38:39] RECOVERY - udp2log log age on emery is OK: OK: all log files active [06:47:29] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [07:00:32] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [07:56:55] hello there [08:58:52] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [12:17:14] PROBLEM - Host ssl1002 is DOWN: PING CRITICAL - Packet loss = 100% [12:19:42] hashar: hi. gallium has an issue with Misc::Contint::Test::Testswarm/File[/var/lib/testswarm/mediawiki-trunk [12:19:53] duplicate file [12:20:05] PROBLEM - LVS HTTPS on bits-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:06] PROBLEM - LVS HTTPS on wikimedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:16] oh [12:20:52] mark: you about? [12:21:01] yes [12:21:08] yea, whats up with ssl1002 [12:21:09] is that you changing stuff? [12:21:11] hi all [12:21:11] no [12:21:14] ok [12:21:17] PROBLEM - LVS HTTPS on wikipedia-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:21:23] hmm [12:22:10] pybal is not depooling ssl1002 [12:22:43] mutante: oops [12:22:54] !log Manually depooled down ssl1002 in pybal [12:22:56] Logged the message, Master [12:23:39] ok, I'm going to get back to waking up... [12:23:59] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 183 seconds [12:24:08] RECOVERY - LVS HTTPS on wikipedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 79291 bytes in 0.279 seconds [12:24:08] RECOVERY - LVS HTTPS on bits-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 3932 bytes in 0.112 seconds [12:24:08] RECOVERY - LVS HTTPS on wikimedia-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 79291 bytes in 0.246 seconds [12:24:48] can someone look at ssl1002 at some point? [12:25:02] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 205 seconds [12:25:43] connects to mgmt [12:27:14] !log powercycling frozen ssl1002 [12:27:16] Logged the message, Master [12:28:25] mutante: what is the root cause for the duplicate file ( Misc::Contint::Test::Testswarm/File[/var/lib/testswarm/mediawiki-trunk ) [12:28:29] RECOVERY - Host ssl1002 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [12:28:34] mutante: greping syslog does not give me any message :-( [12:30:28] getting a snack for lunch. will be back soo [12:30:29] n [12:32:42] mark: it's back, nginx running, looks normal, nothing obvious. syslog just stops in the middle of normal stuff at 12:15. i guess you could repool it [12:33:06] hashar: /var/log/daemon.log [12:33:16] !log repooled ssl1002 [12:33:18] Logged the message, Master [12:33:23] what worries me is that pybal didn't depool it [12:34:21] mutante: I do not have access to that [12:34:30] (should have applied for a junior ops position :D ) [12:34:51] then nnark would have scratched his head even more! [12:34:57] hashar: in labs we just made "common logs" readable for non-roots [12:34:59] all the cool stuff happens in ops [12:35:27] mutante: anyway, the whole /var/lib/testswarm/mediawiki-trunk should no more be there. I have marked it as ensure=>absent [12:38:11] hashar: are you on gallium? [12:38:18] yup [12:39:44] PROBLEM - udp2log log age on emery is CRITICAL: CRITICAL: log files /var/log/squid/e3_necromancy_idle1year.log, /var/log/squid/e3_necromancy_idle3month.log, have not been written to in 6 hours [12:40:47] mutante: starving for food. give me 10 minutes ;) [12:40:47] hashar: cat /home/hashar/daemon.log [12:40:59] oh reading [12:41:01] sure, i want coffee too [12:41:02] * hashar dies [12:41:16] k so I get a snack and you get coffee [12:41:20] sounds like a great plan [12:57:25] hashar: grep Filebucketed [12:58:15] there are all those files in mediawiki-trunk/checkouts [12:58:49] mutante: back [13:01:25] grep duplicate and you see the md5 sums of the duplicates .. its a long file :p [13:01:50] the first line I got is: Recursively backing up to filebucket [13:02:03] looks like puppet do a recursive copy of the content of /var/lib/testswarm to some place [13:02:10] yes [13:02:21] then takes md5 fingerprints to check if the file already exist in the bucket [13:02:31] there is "recurse" somewhere in the contint class then [13:03:35] maybe [13:03:52] anyway the whole /var/lib/testswarm/mediawiki-trunk/ should be gone by now [13:04:02] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [13:04:02] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [13:04:24] mutante: I have marked the path as ensure=>absent [13:05:02] yea, i saw that.. and found it normal [13:05:05] hmm [13:05:15] running puppet again [13:05:48] relevant change is https://gerrit.wikimedia.org/r/#change,4364 [13:05:53] commit 9ff43e547f995fe1fa12896be5b8e545a5960fb4 [13:06:10] well, it is running all the time.. because it sits there going through all those duplicates [13:06:16] hey hashar , today I'm working sane european hours for a change :) [13:06:25] merged in git yesterday, maybe you forgot to pull it in puppet master ? [13:06:57] Krinkle: is that every Thursday or just this week ? :) [13:08:06] Every thursday [13:08:10] hashar: you know what, this is not actually a broken puppet , it is not a "duplicate file/resource" thing that breaks a puppet run [13:08:24] 12-19 instead of 19-02 [13:08:42] hashar: it is just running for a looooong time, and FileBucket telling us it finds duplicate files [13:09:02] mutante: we should just disable that [13:09:33] mutante: some /var/lib/testswarm paths are holding full mediawiki copies which are going to be a nightmare for any recursive job [13:10:08] I am going to delete some [13:10:20] meanwhile can you have a look at https://gerrit.wikimedia.org/r/4743 ? :-] [13:10:23] it is for gallium too [13:10:30] now that its running i am not sure if we should just let it finish [13:10:45] I can delete the file and then have puppet to rerun? [13:10:50] might be faster I don't know [13:11:32] * hashar deleting [13:11:40] yeah, but whats better? killing it or just letting it do the job [13:12:13] as long as its not keeping you from using the services... [13:14:41] load average: 8.34, 4.60, 2.91 [13:14:42] hehe [13:18:00] it's recursive delete via puppet, yea. reading through some closed bugs .. but http://projects.puppetlabs.com/issues/3835 http://projects.puppetlabs.com/issues/3180 ..but yeah, it does NOT say there is a problem setting anything to absent, it just tells us it finds a lot of files in /.svn/ with duplicate md5 sums.. and takes a long time to do that [13:18:24] cause it has to md5sum all the millions of files there [13:18:42] it is still doing its job though [13:18:47] then most probably create a a file bucket with a huuuuuuge index [13:19:13] so I guess you should just ^C puppet [13:19:27] I have launched the rm commands to delete the files [13:20:07] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [13:20:07] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [13:21:11] hmm, ok.. [13:21:19] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 9.09349632813 (gt 8.0) [13:21:55] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [13:22:10] Change abandoned: Pyoungmeister; "file has changed significantly. going to redo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2083 [13:22:37] doh a bit more than 60 secs of real time to delete a directory [13:22:43] and there is a thousand of them [13:23:03] puppet agent would not react on regular stop anymore anyways [13:23:07] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [13:23:13] !log killed puppets on gallium [13:23:16] Logged the message, Master [13:26:56] mutante: can you possibly review https://gerrit.wikimedia.org/r/#change,4743 ? [13:27:13] looked at rm working hard [13:27:22] looking [13:34:45] wait, what? [13:34:47] what are you guys doing? [13:34:52] deleting tons of files with puppet!? [13:37:41] I got million of files in /var/lib/testswarm/mediawiki-trunk/ [13:37:50] that holds various checkouts of mediawiki source files [13:37:51] we stopped puppet [13:38:02] recently had puppet to ensure=>absent that directory [13:40:34] I ber, atleast a million [13:40:48] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/4743 [13:40:50] 1100 checkouts roughl [13:40:51] complete svn checkouts of every rev in trunk/phase3 for several months [13:41:13] and with all the .svn metadata [13:41:13] hashar: I see why you want to leave out .git now :) [13:41:21] :-]] [13:41:28] although leaving that out does mean mw won't be able to get gitinfo [13:41:35] I currently do a recursive copy excluding .git [13:41:35] but whatever [13:41:40] but will have to use git clone instead [13:41:44] nah [13:41:46] it does hardlinks [13:42:04] static snapshot of files related to mediawiki should be fine [13:42:31] hashar: btw, what's the status? anything I can do to help? [13:42:45] otherwise I'll pick up the next section to implement for testswarm [13:42:58] "rm" on gallium still going ... [13:43:21] oh.. [13:43:22] xD [13:43:24] Krinkle: maybe the addJob() script [13:43:25] the testswarm user does it, yea [13:44:23] mutante: I am confused by your comment on 4743 about setting the db directory to 2775 [13:44:59] I need members of the www-data group to be able to write to the mediawiki-git/db dir [13:45:05] and old dirs contained there [13:45:19] the sticky bit is merely to make sure any subdirs are in the www-data group [13:45:22] puppet adds the +x on directories [13:45:28] so you just need a 6 [13:46:01] well given that declaration is only on a directory and not recursive, I think it is better to directly fill the real rights [13:47:07] but I can move it to 2664 , that will avoid a mistake if we ever make that declaration recursive :-D [13:49:39] ok, yeah. i dont know about the part that sqlite databases need to be writable by apache [13:50:11] they do need to be writable :D [13:50:24] cause we are going to have MediaWiki (run under apache) to insert data in the sqlite databases [13:50:37] apache running as apache:www-data , it would be able to do so [13:50:44] Jenkins will take care of g+w [13:51:43] ok. makes sense [14:03:33] that is still rm -f ing [14:14:38] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4743 [14:16:36] hashar: so 2664 , i checked for the SGID bit in puppet because in the type reference it does not mention it. it works as expected and i got to try with "puppet apply". thats a nice way to test [14:16:53] echo 'file { "/tmp/foo_puppet_test_123": ensure => directory, mode => 2644; }' | puppet apply [14:17:37] directory created in there belongs to group [14:26:13] http://docs.puppetlabs.com/references/2.7.0/type.html#file (at "mode" still has that 3 digit example and about puppet adding +1 but the "quite limited" description and does not mention SGID etc. then on latest same section, it now says should use four-digit notation and mentions setuid/setgid/sticky .. [14:26:40] http://docs.puppetlabs.com/references/latest/type.html#file [14:29:21] mutante: whhaaa I never heard of puppet apply [14:29:28] sounds like a great way to test stuff :D [14:30:01] about the mode, it is indeed supposed to be 4 digits [14:30:16] puppet-lint complains about mode using 3 digits [14:30:56] i got that from puppet channel . yes, nice to test [14:34:47] so /bin/ls -r1 |xargs -P16 -n1 rm -fR [14:34:50] still running [14:35:12] New patchset: Demon; "Adding some extra debugging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4781 [14:35:25] ext3 is sooo slow at deleting files :-( [14:35:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4781 [14:35:39] <^demon> Someone mind approving ^ so I can do some debugging on manganese? [14:36:03] commit message is lame, does not reference a bug number, also messages.inc [14:36:17] <^demon> messages.inc? [14:36:43] sorry copy pasted my standard reply [14:36:57] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/4781 [14:36:59] looks good to me [14:37:13] though you probably would want to log any parameters passed to the hook [14:37:36] New review: Asher; "(no comment)" [operations/debs/varnish] (patches/udplog); V: 0 C: 1; - https://gerrit.wikimedia.org/r/4729 [14:39:10] <^demon> I'm logging options.project, options.branch and options.uploader. I don't need the rest. [14:40:17] k :) [14:40:26] so now you need someone to merge the apply :-D [14:40:39] <^demon> Yeah, I need an opsen :\ [14:42:02] New review: Dzahn; "2 does what it should. SGID tested with puppet apply." [operations/puppet] (production); V: 1 C: 0; - https://gerrit.wikimedia.org/r/4743 [14:42:34] ^demon: at this time there is pgehres but is probably still asleep today [14:43:17] Jeff_Green maybe you are available to have a change merged? :-) https://gerrit.wikimedia.org/r/4781 [14:43:55] New patchset: Pyoungmeister; "udp2log cleanup, new instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4782 [14:44:05] hashar: looking [14:44:08] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/4782 [14:44:51] <^demon> Jeff_Green: Thanks. I'm trying to debug something and this is logging stuff to make that easier :) [14:44:56] sure [14:45:21] mutante: want me to do 2664 ? [14:45:33] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/4781 [14:45:45] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/4781 [14:45:48] hashar: yes please [14:46:02] "requires code review" [14:46:12] I haven't run into that yet--what does it mean? [14:46:20] <^demon> Could you not +2 it? [14:46:50] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4781 [14:46:50] mutante: note that I have used 2775 for /var/lib/jenkins a few step above :-D [14:46:53] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4781 [14:47:19] ah, I thought my +1 would combine with hashar's to make for +2 action [14:47:22] <^demon> Nope. [14:47:36] +1 * n < +2 [14:47:52] that's some crazy math [14:47:54] we need that +2 message to be shown as a check mark or something [14:47:56] hashar: arr..hmm.. what should i say, we need our own style guide? :p [14:48:07] <^demon> hashar: +2 is show as a check mark after you click it :) [14:48:16] mutante: I propose to keep the 2775 to have consistent modes across the file [14:48:30] mutante: then we can submit another change that change all 2775 to 2664 :-D [14:48:43] so if 5 people all clicked +1 you would have a score of 5? naaahhhh [14:48:52] apergos: nooooooo [14:48:59] apergos: you still are at the +1 state [14:49:01] it seems to me the +1's are token at that point [14:49:03] I know [14:49:03] New patchset: Mark Bergsma; "No longer apply the previous varnishncsa patch" [operations/debs/varnish] (patches/udplog) - https://gerrit.wikimedia.org/r/4783 [14:49:04] New patchset: Mark Bergsma; "Fix compile errors" [operations/debs/varnish] (patches/udplog) - https://gerrit.wikimedia.org/r/4784 [14:49:05] New patchset: Mark Bergsma; "glibc ignores setvbuf's size argument in some cases" [operations/debs/varnish] (patches/udplog) - https://gerrit.wikimedia.org/r/4785 [14:49:09] well they aren't [14:49:15] apergos: i know you know :-D [14:49:19] hashar: ok, convinced, since technically it doesnt make a difference, you're right about keeping it consistent across file [14:49:28] the idea if there are several people reviewing is that yo can see who lkes a change and who doesn't to what degree, [14:49:29] apergos: is a +1 required in some contexts for a subsequent +2 to go through? [14:49:32] <^demon> Nope. [14:49:34] mutante: :-]] [14:49:35] New review: Mark Bergsma; "(no comment)" [operations/debs/varnish] (patches/udplog); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4729 [14:49:37] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/udplog) - https://gerrit.wikimedia.org/r/4729 [14:49:48] Jeff_Green: not that I have ever seen [14:49:49] <^demon> +1's are designed so people can review code without approving it. [14:49:58] so it's just a memo [14:50:03] New review: Mark Bergsma; "(no comment)" [operations/debs/varnish] (patches/udplog); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4783 [14:50:10] Jeff_Green: in MediaWiki we have volunteer to +1, so when you have some familiar names that +1 stuff you can sometime assume the change is fine [14:50:15] New review: Mark Bergsma; "(no comment)" [operations/debs/varnish] (patches/udplog); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4783 [14:50:17] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/udplog) - https://gerrit.wikimedia.org/r/4783 [14:50:20] well just a memo would be with no score [14:50:33] the score actually has a (somewhat subjective) meaning [14:50:44] <^demon> "Looks good to me, but someone else must approve?" [14:50:46] we could replace the +1 with a thumbs-up icon, and the +2 with a check icon [14:50:47] New review: Mark Bergsma; "(no comment)" [operations/debs/varnish] (patches/udplog); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4784 [14:50:50] no, wait, if 5 people clicked +1 you would be higher on the list of patchsets than the one that just has one +1 [14:50:56] New review: Mark Bergsma; "(no comment)" [operations/debs/varnish] (patches/udplog); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4784 [14:50:58] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/udplog) - https://gerrit.wikimedia.org/r/4784 [14:51:03] ahah [14:51:08] if the 5 peple who clicked have any knowledge about the code [14:51:08] mutante: oh interesting, I didn't realize there was a ranking thing in play [14:51:14] if 5 random people clicked... [14:51:16] indeed, +1 can be seen as a bumping status [14:51:20] <^demon> mutante: I thought it was just based off latest review, not number of reviews. [14:51:22] though -1 will keep it on top too [14:51:25] the logging change is deployed [14:51:34] ^demon: deployed! [14:51:34] New review: Mark Bergsma; "(no comment)" [operations/debs/varnish] (patches/udplog); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4785 [14:51:37] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/udplog) - https://gerrit.wikimedia.org/r/4785 [14:51:39] thanks jeff [14:51:42] <^demon> Jeff_Green: Thank you. [14:51:48] np [14:51:53] maybe i was just expecting that to be score, and got it wrong, because until now not many sets got more than one +1 [14:52:10] but the one that has the latest comment is back up in the list for sure [14:53:01] <^demon> I think it's the comment that bumps it :) [14:53:07] how can one literal quotes/double quotes in a puppet manifest? specifically, in an array/hash? [14:53:26] shell out to perl and use qw// ? [14:54:39] heh [14:54:45] * mark stabs hashar [14:55:05] notpeter: what do you mean? just by escaping? [14:55:14] yeah [14:55:18] * hashar fall on floor [14:57:32] New review: Dzahn; "ok, convinced to keep it consistent across the file since the 755 vs. 644 discussion is just style w..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4743 [14:57:35] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4743 [14:58:10] mutante: we still have to wait for the files deletion to end on gallium :/ [14:58:29] New patchset: Pyoungmeister; "udp2log cleanup, new instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4782 [14:58:33] <^demon> Jeff_Green: Could you also force a puppet run on manganese? Puppet ran just minutes before the change :p [14:58:34] yes, but puppet is stopped, so we could do this anyways [14:58:36] hashar: [14:58:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4782 [14:59:05] then it will attempt to file bucket all the files in /var/lib/jenkins/mediawiki-trunk [14:59:17] oohh let me rename it [14:59:49] ^demon: ya, done [14:59:53] mutante: I have renamed the directory, that should hides it from puppet view [15:00:33] <^demon> Jeff_Green: Ok, changes live now, thanks. [15:00:55] wtb puppet broadcast changes feature [15:02:02] New patchset: Pyoungmeister; "udp2log cleanup, new instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4782 [15:02:18] Project does not begin with mediawiki/ [15:02:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4782 [15:02:46] mark: would you be willing to look at ^ ? [15:02:58] or binasher [15:02:59] sure [15:03:13] ct asked me to spin up oxygen the rest of the way [15:03:23] I thought it was a nice excuse to finally finish my udp2log conf cleanup [15:03:42] hashar: running puppet [15:03:45] and watching it [15:04:12] File[/var/lib/testswarm/mediawiki-git/db/]/group: group changed 'jenkins' to 'www-data' [15:04:13] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Thu Apr 12 15:03:37 UTC 2012 [15:05:06] hashar: it is also changing mobile nightly builds from "jenkins" to "wikidev". . and it finishes fine [15:05:12] ok [15:05:16] the NRPE stuff can be done in a nicer way now [15:05:38] cool! [15:05:47] with nrpe::monitor_service [15:05:51] ah, yes, right! [15:06:00] that will setup the nrpe.d/ file for you etc [15:06:21] you may want to rename /etc/udp2log/squid to something more appropriate while you're at it [15:06:22] yep [15:06:24] (cdn or so) [15:06:28] i'll take a close look soon [15:06:36] mark or anyone--I'm failing to find the knobs to control file/directory attributes when packaging a deb, is there hidden magic somewhere or is it usually done with a makefile? [15:06:37] sure, cdn is fine [15:06:39] you've already seen it mutante [15:06:54] ah,ok, sounded like you made another change just now [15:07:01] Jeff_Green: it just takes them as they are installed by the debian/rules file [15:07:07] usually with "install" commands in there [15:07:18] mutante: http://integration.mediawiki.org/clone/mw/master/test.txt [15:07:22] mutante: thanks! :-] [15:07:30] hashar: :) [15:07:42] notpeter: just wondering, is there any good reason to list all filters in a puppet array instead of simply in the config file? [15:07:58] to not have a config per host? [15:08:07] and let puppet populate it from the array [15:08:12] New review: Hashar; "I have tested the deployment and it works well. Thanks!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4743 [15:08:20] that's nice if puppet can do other cool stuff with it [15:08:31] if it doesn't, I would say, just keep it simple [15:08:35] ok [15:08:38] fair enough [15:08:41] not that I object to this [15:08:43] still simple enough [15:08:47] just wondering what you were planning to do with it [15:09:03] you could for example automatically setup monitoring for each filter or something like that [15:09:04] mark: huh, ok I'll look more closely there [15:09:42] yeah. the monitoring I have parses that conf file... [15:09:52] so I could probably reuse and make it a little less janky [15:10:09] also [15:10:16] can we get rid of that aft thing? ;) [15:10:23] !log gallium - after files have been deleted/moved, puppet back to normal operation (and new clone directory in Apache) [15:10:23] it looks more like a separate instance of udp2log [15:10:25] Logged the message, Master [15:10:29] it is [15:10:29] right? [15:10:31] yeah [15:10:37] then it would be cleanest if you made generic multiple instances support [15:10:41] much like we do with varnish [15:10:46] ah, yes [15:10:49] makes sense [15:10:50] then we can easily add more instances [15:10:56] i'm not sure how necessary this is [15:11:08] so perhaps you shouldn't spend a lot of time on it [15:11:11] but it would be cleanest [15:11:25] shouldn't be too hard [15:11:35] perhaps multithreaded udp2log is around the corner, and we may not need multiple instances at all [15:11:36] I don't know [15:11:53] (afaik it's only done for performance reasons atm) [15:14:06] ok, cool. I shall make these changes [15:14:23] as long as I'm cleaning tings up, might as well put in the extra hour to actually make it nice [15:14:32] \o/ [15:18:18] hashar: one small thing is left. "File[/etc/testswarm]: Not removing directory; use 'force' to override". [15:18:48] hashar: backend uses mediawiki-git fronted says ./mw/master which is correct? [15:19:10] mutante: indeed, we have to keep /etc/testswarm it is used by the debian package [15:19:23] also, should wgScriptPath include {changePath} >? [15:19:24] ok, i figured [15:19:45] seeing the config.Debian.ini [15:20:07] I need to revert that part [15:20:23] Krinkle: mediawiki-git is a filesystem path [15:20:24] probably should since it is asynchronous and even if not, it has to be cache resident, every changeset a different url [15:20:41] I know, but this thing isn't limited to mediawiki/core:master right ? [15:20:53] Krinkle: /clone/mw/master is an Apache alias used by clients (aka an URL like http://integration.mw.org/clone/mw/master/1234/1/index.php?title=foobar [15:20:56] so why /clone/mw/master as front-end path (assuming that is correct) [15:21:27] clone/gerrit maybe, but that's just naming. Ah, okay, so it does include changePath at the end [15:21:56] it looked like it didn't since $wgScriptPath = "${testswarm.URL.path ; testswarm.URL.path=/clone/mw/master [15:22:13] ah testswarm.URL.path="${testswarm.URL.basepath}/${changePath} [15:22:20] gotcha [15:23:36] http://integration.mediawiki.org/clone/mw/master/test.txt wee [15:24:14] New patchset: Hashar; "testswarm: restore /etc/testswarm" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4787 [15:24:16] mutante: ^^^^ [15:24:27] oh? why [15:24:30] Project does not begin with mediawiki/ [15:24:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4787 [15:24:51] mutante: we do not want puppet to delete /etc/testswarm since the Debian package put its stuff there [15:25:01] mutante: so I am just reverting the change to ensure=>absent [15:25:22] "Project does not begin with mediawiki/" [15:25:27] hashar: got that part, ok good.thought for a moment you deleted stuff you still needed [15:28:24] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4787 [15:28:27] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4787 [15:29:20] k my Facebook friends are not geek [15:29:28] only two persons recognized: git push gerrit master:refs/for/master [15:29:36] heh [15:29:47] hashar: you need better friends :) [15:29:53] NOOOO [15:30:16] the good point otherwise is that when I invite them at home, we talk about everything but computer science :D [15:30:46] do you ever invite non-geeks and geeks on the same night? [15:30:46] <^demon> Other than my wmf friends I don't know anyone on facebook who'd get it. [15:31:40] my geek environment eventually grown up, got married and have childs :-D [15:31:43] at least one other who'd get it is anti-Facebook [15:32:02] so yeah, sometime we wine about how Everquest and RPG nights were the good old time [15:32:05] but that is about it [15:32:17] the main subject nowadays are diapers, school and wine :-] [15:33:35] nothing but info and notice in gallium puppet run now. and done after 60 seconds [15:34:34] \O/ [15:35:03] * hashar offers a virtual round of "alcohol free" beer [15:35:34] takes one [15:38:55] quick question: how do i find the hostname of the mysql db when i want to connect a specific wiki project? [15:38:59] <^demon> This beer is woefully unsatisfying :\ [15:39:57] <^demon> declerambaul: The db clusters are defined in db.php [15:40:02] <^demon> noc.wikimedia.org/conf/ [15:40:33] thanks a lot! [15:40:40] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.594426875 (gt 8.0) [15:45:03] hashar: wow, jenkins has a lot of room for storing data about builds. a lot more than just passed/failed + console output [15:45:17] hashar: https://integration.mediawiki.org/ci/job/MediaWiki-CheckStyle/1/checkstyleResult/ that is an awesome sample [15:45:23] goes many levels deep [15:45:31] although probably also hard to create such thing [15:48:31] <^demon> https://integration.mediawiki.org/ci/view/MediaWiki/job/MediaWiki-Tests-Misc/355/testReport/ is pretty nice too. [15:48:40] <^demon> It's just a pain to dig into this stuff if you don't know where to look. [15:49:09] yeah [15:49:27] https://integration.mediawiki.org/ci/view/MediaWiki/job/MediaWiki-Tests-Misc/355/testReport/(root)/BlockTest__testBug29116LoadWithEmptyIp/testBug29116LoadWithEmptyIp_with_data_set__1/? [15:49:31] wow [15:49:38] "Took 1ms" :) [15:49:52] it stores "EVERYTHING" [15:50:00] atleast could [15:50:31] Krinkle: sorry was out to get some fresh air and a cup of water [15:50:31] <^demon> hashar: Do we need PHP_Invoker enabled or can we configure that to be more sane? It times out way to easily. [15:50:44] we need invoker [15:50:44] hashar: np [15:51:06] ^demon: I have submitted/merged a patch yesterday that raise the default timeout to 2s [15:51:09] Krinkle: Jenkins is slick that way. That's why I want to deploy nightlies to beta labs from Jenkins instead of say cron. Also, you can make builds interdependent. [15:51:19] <^demon> hashar: Upstream? [15:51:24] chrismcmahon: sure [15:51:31] ^demon: na in our tests/phpunit/suite.xml [15:51:42] <^demon> chrismcmahon: There's a "deployment" plugin that I looked at for that sort of thing before. Worth looking at. [15:52:00] <^demon> hashar: I'd say 3s. [15:52:12] <^demon> We still hit a timeout at 2s. [15:52:34] ^demon: yes, or Jenkins can just shell out to issue a command and get the result off stdout/stderr if it comes to that (I think. something like that at least) [15:52:45] ^demon: are they small tests? [15:52:53] <^demon> https://integration.mediawiki.org/ci/view/MediaWiki/job/MediaWiki-Tests-Databaseless/599/testReport/junit/(root)/ExifRotationTest__testRotationRendering/testRotationRendering_with_data_set__0/ [15:52:55] The testswarm plugin for Jenkins needs updating though, its completely broken and outdated, and even when it worked it was just a console-output view with a link to testswarm. I'd like to use TestSwarm exclusively as an API and get all data in Jenkins (TestSwarm's latest version has an API that allows extracting all this data in a machine readable format), so that we don't rely on TestSwarm for viewing the data. [15:53:00] if not we want to add them to the medium group [15:53:04] and can easily format it whenever we want [15:53:17] ^demon: :-( [15:53:17] <^demon> chrismcmahon: Yeah, we could do something with shell commands too, but if there's an extension designed to deploy it's worth looking at :) [15:53:34] for sure [15:53:35] <^demon> Can we adjust the timeout on a per-test basis? [15:53:48] well, "latest", the "next" version rather. I'm currently working on that with jQuery. [15:53:57] ^demon: yup using "@group medium" and "@group large" IIRC [15:54:01] ^demon: relevant change is https://gerrit.wikimedia.org/r/#change,4741 [15:54:07] (jQuery team that is, not the js library) [15:54:18] ^demon: basic doc https://github.com/sebastianbergmann/phpunit/issues/58 [15:54:24] ohwell, back to work :) [15:54:27] hooo @small @medium and @large [15:55:00] <^demon> Gotcha, makes sense. [15:55:48] Krinkle: the check style plugin use PHP_CodeSniffer . I have created a very basic style syntax somewhere ( I think mediawiki/tools/codesniffer.git ) [15:56:02] ok [15:56:07] ^demon: I did a lot of I/O operations on gallium this afternoon, that probably explains the timeout [15:56:20] anyway the recommendation is to have tests doing disk I/O in @medium [15:56:35] hashar: I'm installing node right now, I'll see if I can check out our "jslint vs node vs cli vs batching" plan tomorrow [15:57:06] for node you will have to convince ops that it is something we can install on our gallium :-D [15:57:44] well, there is no actively developed (trusted) linter for JS out there that I know of that isn't written in javascript. [15:57:59] node is pretty lightweight from what I gather, shouldn't be too much of a problem. [15:58:13] or we'll keep it for testing only in labs [15:58:31] since jenkins + stuff is going to labs in the long term, right ? [15:58:39] (so that it doesn't risk production environment) [15:58:58] workflow wouldn't change since we can still use git and puppet on labs the same way) [15:59:51] yeah it is probably going to move to labs [16:00:04] I am not sure there is any need to have it on real hardware nor on production [16:00:59] hashar: performance might impact a little bit [16:01:10] hashar: although I could get a higher level instance if needed [16:01:11] I don't think so [16:01:25] those labs server are killer machines [16:01:59] probably a better server than the production hardware currently used (although it'd be shared ofcourse) [16:02:25] not sure though, can anyone confirm? avg. labs instance server vs. golom [16:03:30] <^demon> Krinkle: You were the one getting the minor parser test failures right? Where just the array key format was different? [16:04:02] It was a while ago, but with the HtmlTest yeah, the issue was related to \n \r if I recall correctly [16:04:16] no, not array key [16:04:51] <^demon> http://p.defau.lt/?vGnB18QHJzY1r7dhtbkaNw [16:04:56] I did a comparison to '…' with actual line breaks in it with my editor, it was fixed by ending them on each line and concatenating with \b [16:04:58] \n * [16:05:07] need to get out [16:05:09] brb [16:05:15] <^demon> eg: ["width"] v 'width' [16:05:32] - ["filename"]=> [16:05:32] + 'filename' => [16:05:35] never seen that before [16:05:41] is it using the same dump function? [16:06:13] the + lines look foreign to me [16:06:33] not var_dump or var_export (unless maybe new in php 3.4 haven't tried there yet) [16:06:46] 5.4 * [16:07:16] ^demon: Thanks for the help. If i want to connect to a more obscure project like e.g. qu, how would i find the mysql host? [16:07:19] ^demon: weird, not all tests are affected [16:07:19] <^demon> Ha, seems it's xdebug's fault. [16:07:27] <^demon> I disabled xdebug and it worked. [16:11:56] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 1.94461055118 [16:27:50] PROBLEM - MySQL Replication Heartbeat on db24 is CRITICAL: CRIT replication delay 221 seconds [16:27:50] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 194 seconds [16:28:08] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 182 seconds [16:28:17] PROBLEM - MySQL Slave Delay on db24 is CRITICAL: CRIT replication delay 243 seconds [16:29:40] Reedy: the OTRS upgrade, what's the RT ticket number? Has it been updated recently? From what I've been told, we're waiting for Martin now [16:30:09] tho the rt ticket numbers are in the bug higher up somewhere [16:30:16] oh ok [16:30:41] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 0 seconds [16:30:54] is there a test version up somewhere yet RD? I think the legal issue has already been sorted [16:31:08] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 0 seconds [16:31:18] a demo? [16:31:26] otrs has a demo of 3.0 i think [16:31:34] (on their site) [16:32:01] I meant with some of our queues or emails - I guess not yet [16:34:53] RECOVERY - MySQL Replication Heartbeat on db24 is OK: OK replication delay 0 seconds [16:35:20] RECOVERY - MySQL Slave Delay on db24 is OK: OK replication delay 0 seconds [16:48:41] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [16:51:50] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 9.36890070313 (gt 8.0) [16:55:12] Question: If I want to connect to a somewhat obscure project, e.g. 'qu', how do I find the appropriate mysql host? In http://noc.wikimedia.org/conf/db.php.txt there is no reference to it. [16:56:01] there's a default host that has all projects that aren't specifically mentioned. [16:56:17] I think at this point s3 is the default, but I'm not positive about that. [16:56:24] correct [16:57:32] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 0.747402204724 [16:57:44] thanks a lot! [16:57:53] mark: (since you're here) I need to stop an email address from going to sanger. Do I need to actually run sql against the sqlite db to change that or is there a tool? [17:00:23] oh, wait. [17:00:27] maybe I don't need to. [17:01:01] the ldap section comes before the imap section; maybe if it exists in ldap it'll stop processing there and the imap entry will be ignored. [17:01:44] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [17:16:37] Jeff_Green: you around ? [17:16:41] yep [17:17:53] can you check out https://gerrit.wikimedia.org/r/#change,4759 ? [17:18:08] it's switching how fundraising does monitoring so i figured you're the best person to check it out -- moving it to nrpe [17:18:54] i saw this earlier, and I didn't fully understand it [17:19:03] New review: Dzahn; "i see this as a review request, but re: slow queries and "worlds largest database" comment, i'll jus..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/4400 [17:20:23] LeslieCarr: but I will say that since monitoring has consistently failed us on the fundraising db's my take is, change is good :-P [17:25:18] hehehe [17:25:18] cool [17:25:33] well basically it's making the checks happen on the boxes themselves and then they send back the data :) [17:25:40] ah cool [17:25:45] and it's how it's working on almost everything else [17:25:48] almost… :) [17:25:50] yep [17:26:08] that's how I did all the payments stuff, but I did it using nsca instead of nrpe [17:27:06] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4759 [17:27:09] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4759 [17:28:28] all i care about is not having to change the databases themselves to allow more access blocks [17:28:33] well not all [17:28:35] but it's a big part [17:29:15] I think I'm missing too much context [17:29:31] so https://neon.wikimedia.org/icinga/ [17:29:45] however if we do direct database checks from nagios/icinga, we need to allow access from neon to said database [17:29:50] which is manual and annoying [17:29:54] but how we did it with spence [17:30:18] but if we use nrpe, don't need to allow neon to access the db directly [17:30:36] oh yeah that's better [17:32:19] * Jeff_Green prefers to have an exploitable monitoring box than an exploitable rig with a possibly-secure monitoring box [17:34:20] :) [17:34:44] maplebed: could you help me figure out a mystery (wrapped in an enigma?) [17:34:52] hm. [17:35:06] that might be puzzling. [17:35:14] for some reason, when restarting nagios-nrpe-server, while it says it's reading from the proper config file, it's not really [17:36:34] notpeter: rt! 2798 [17:36:41] rt! 2798 [17:37:15] which old sun storage servers? [17:38:30] New patchset: Demon; "Remove debugging, fix the L10n user for the final time" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4792 [17:38:30] htose should be talking about search1-12 [17:38:44] I think that those shouldn't be pulled for a couple of months [17:38:46] Project does not begin with mediawiki/ [17:38:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4792 [17:39:36] okay..i received 2 new db's wasn't sure what rob was talking about...first i heard about removing the existing storage [17:40:04] er? [17:40:08] wait, now I'm confuxed [17:40:17] 2798 should be about search servers, yes? [17:40:55] yep...search [17:40:57] not storage [17:41:09] cmjohnson1 - then those 2 are not for rt2798 [17:41:09] either way..first i heard about it [17:41:32] they're for 2614 [17:42:02] ok, they are for binasher [17:42:29] ah… even better maplebed, it's referring to an old file that was commented out but not "ensure => absent" so it was never removed :) [17:42:40] nice. [17:43:17] Allowing connections from: 127.0.0.1,208.80.152.185,208.80.152.161,208.80.154.14 [17:43:18] yay!!! [17:43:23] notpeter: i mixed the 2 tickets...sorry to bug you...i will later once the search boxes arrive [17:43:26] \o/ [17:45:03] cmjohnson1: cool! [17:49:29] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 8.4283684252 (gt 8.0) [17:56:41] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.5097984252 [17:58:47] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CRIT replication delay 213 seconds [17:59:41] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CRIT replication delay 263 seconds [18:04:00] maplebed: in any case, yes there's a tool [18:04:04] wmfmailadmin on sanger [18:04:20] on sanger? but the sqlite db is on mchenry. [18:08:14] RECOVERY - MySQL Slave Delay on db42 is OK: OK replication delay 0 seconds [18:08:50] RECOVERY - MySQL Replication Heartbeat on db42 is OK: OK replication delay 0 seconds [18:12:38] New review: Ryan Lane; "This user is used for ssh, this is going to break." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/4792 [18:15:47] maplebed: it's rsynced to mchenry [18:15:56] by cron [18:16:06] wow. [18:16:13] this is extensively documented on wikitech btw [18:16:25] (I know, you wouldn't expect it) [18:17:13] ah, I see it buried in http://wikitech.wikimedia.org/view/Mail [18:17:19] thanks. [18:23:47] New patchset: Lcarr; "making swift not monitor test clusters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4793 [18:24:02] Project does not begin with mediawiki/ [18:24:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4793 [18:24:05] maplebed: ^^ can you review plz ? [18:24:16] looking [18:24:56] hm. [18:25:12] New patchset: Jgreen; "hacky backup strategy for hacky shared 'boilerroom' script dir" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4794 [18:25:14] I don't like depending on the cluster name like that. [18:25:27] Project does not begin with mediawiki/ [18:25:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4794 [18:25:29] it'll work on our physical clusters, but will break with labs. [18:25:36] ah [18:25:58] if you want to use the name, I would suggest only monitoring names with -prod rather than not monitoring clusters with -test. [18:26:05] cool [18:26:32] but it'd be better to have a config parameter somewhere in the roles.pp cluster def that say whether or not you want it monitored. [18:26:44] which would let you declare that you do want a labs cluster monitored (for example, to test monitoring). [18:27:01] just include a monitoring class in the role classes that need monitoring, and not for test clusters [18:27:05] that's the idea of role classes [18:31:30] New patchset: Lcarr; "making swift not monitor test clusters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4793 [18:31:45] Project does not begin with mediawiki/ [18:31:45] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/4793 [18:32:22] mark/maplebed ^^ look better ? [18:33:14] not according to lint ;-) [18:33:15] but i'll look [18:33:36] hehe [18:33:40] well the basic idea at least [18:34:16] yeah exactly [18:35:07] LeslieCarr: did you see the db1004 puppet facts ticket? [18:35:25] i think i created it... [18:35:34] ah you updated [18:35:35] I debugged it [18:35:38] LeslieCarr: +1. while you're in there, want to do the same thing to the swift::proxy::monitoring class? :D [18:35:50] cool :) [18:37:36] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4794 [18:37:38] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4794 [18:38:26] New patchset: Lcarr; "making swift not monitor test clusters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4793 [18:38:41] Project does not begin with mediawiki/ [18:38:43] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/4793 [18:39:50] New patchset: Lcarr; "making swift not monitor test clusters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4793 [18:39:59] puppet is so bizarre sometimes... [18:40:06] Project does not begin with mediawiki/ [18:40:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4793 [18:40:52] maplebed/mark ^^ :) [18:42:08] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/4793 [18:42:24] New patchset: Pyoungmeister; "udp2log cleanup, new instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4782 [18:42:40] Project does not begin with mediawiki/ [18:42:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4782 [18:42:44] New patchset: Lcarr; "pushing udp-filter on stat1 to analyze old data" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4795 [18:43:00] Project does not begin with mediawiki/ [18:43:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4795 [18:43:20] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4793 [18:43:24] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4793 [18:43:27] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4795 [18:43:30] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4795 [18:46:40] lunchtime for me :) [18:47:06] notpeter: you're a brave man. [18:47:23] maplebed: how so? [18:47:32] that change you just checked in. [18:47:36] hahaha [18:47:48] well, I mean, it looks better to me. I don't kow if it will break everything... but... [18:47:50] peter is taking on all the stuff noone dares touching, lately [18:47:59] I am liking him more and more. [18:48:06] New review: preilly; "Ryan Lane — Please remember to replace the <%= REPLACE_ME %> in the ERB files." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/4483 [18:48:13] heh :) [18:48:21] ct wanted me to help out with oxygen [18:48:25] I figured it was time... [18:48:38] yes and instead of making it worse you're making it better, that's always good [18:48:44] notpeter: if your courage makes you blind, you could try moving all the actual filter definitions into roles.pp or site.pp instead of logging.pp [18:48:59] I might not cry on my next gerrit review run ;) [18:49:44] yes, maplebed has a good point [18:50:03] IF you keep those filters in a puppet array, they should move to a role class probably [18:50:22] but currently I don't see much of a reason to have them in a puppet array at all, so just having them as a static config file would be fine too [18:51:49] notpeter: I think you've got an old checkout; I added some filters to emery yesterday that you're missing. [18:52:11] maplebed: oooo, ok. I'll make sure to update [18:52:12] they're interesting because they need single quotes around part of teh command line (which means you'll have to be careful with the quotes around the whole line) [18:52:22] yeah [18:52:53] mark: well... hmmm... ideally those hashes could be used to make better check commands [18:53:06] but... I feel like that is outside the scope of this current rampage [18:53:14] I wonder why observium is doing ALTER TABLE all the time... [18:53:42] mark: I'm easy either way. what do you think is a better way to go? [18:53:53] if you have no reason to have that array now, just make static files [18:54:02] one per host/instance with a standard naming scheme [18:54:07] ok [18:54:10] puppet will certainly be faster that way ;-) [18:54:17] New review: Ryan Lane; "Seems I was confused and this isn't used for ssh." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4792 [18:54:19] if you ever decide you need arrays, you can always pull back this code [18:54:20] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4792 [18:55:02] also easier to edit for everyone, no stupid quoting needed [18:55:29] now you need quotes for puppet manifest language, then udpfilter config language, then bash, then awk/sed/whatever [18:55:35] multiple levels deep [18:55:38] it gets quite horrid ;) [18:55:48] fair [18:56:00] I've just gone mad with generating templates from hashes [18:56:15] oh you can do really cool stuff with that [18:56:22] so that's often useful [18:56:25] but not here (yet) [18:56:37] one area where I've recently done that is torrus [18:56:47] yeah, I've read through a lot of that [18:56:51] torrus has quite complicated XML files which i needed to edit every time we added/removed squids or varnish [18:56:56] New patchset: Demon; "(bug 35570) Make bug links clickable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4796 [18:57:09] with just a bit of templating that's now done fully automatically [18:57:12] Project does not begin with mediawiki/ [18:57:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4796 [18:57:13] (well, as long as torrus stays up ;) [18:58:42] heh [19:00:37] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4796 [19:00:40] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4796 [19:03:57] notpeter: yesterday I was pondering automatically setting up monitoring for all LVS services from the LVS configuration hashes [19:04:01] but meh [19:04:05] too much work ;) [19:05:08] until the next time you need to do it.... [19:05:18] at which point I'm sure you'll go to town on it [19:06:26] yes, usually I do that stuff when I need to touch it anyway and it annoys me [19:06:28] like NRPE [19:06:44] like udp2log [19:07:12] yes [19:07:26] I didn't even dare opening that manifest file anymore [19:07:41] New patchset: Pyoungmeister; "udp2log cleanup, new instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4782 [19:07:46] hah [19:08:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4782 [19:08:22] it seems like it only took me 6 hours to untangle. I'm sure you could have done it in 3 ;) [19:09:15] i'm now untangling your varnishncsa init script [19:09:41] perhaps I'll leave it as it's working, but it's itching [19:10:15] New patchset: Pyoungmeister; "udp2log cleanup, new instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4782 [19:10:18] at least it's no longer the version with puppet template logic mixed in with bash logic, that's something ;) [19:10:24] that was indeed, not one of my finer pieces of work [19:10:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4782 [19:10:50] New patchset: Pyoungmeister; "udp2log cleanup, new instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4782 [19:10:51] well I tried to replace it with a nice clean small upstart job [19:10:53] but that was fail [19:11:01] that sucked [19:11:06] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4782 [19:11:21] yeah. I was trying to do it with one puppet template [19:11:28] but asher convinced me to mix the logics [19:11:30] upstart has built in support for multiple instances of a program [19:11:36] but then you need to start the instances manually [19:11:40] oh, that's really sweet [19:11:42] and doing that with puppet was not so easy as it seemed [19:12:21] add in a couple of upstart bugs in lucid and some puppet annoyances and I got quite frustrated after hours of what should have been simple [19:12:42] New patchset: Pyoungmeister; "udp2log cleanup, new instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4782 [19:12:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4782 [19:13:05] bugs in upstart? [19:13:09] that's obnoxious [19:13:31] well yes, because you don't really want to roll a new backported version of 'init' across your cluster either ;) [19:13:59] upstart was just confused of all my different configuration attempts, after a few hours I rebooted the box and then it suddenly worked fine [19:15:53] it's mildly upsetting wen rebooting actually helps. *this is why I don't run windows...* [19:16:07] ok, you want to have one more look at my changes? I think I'm finally finished [19:16:10] ok [19:16:28] (editing in the terminal and a ide makes for missed saves...) [19:16:46] the system role [19:16:51] that's not really a mediawiki log server is it [19:17:22] demux.py and sqstat, I'm guessing those are just utilities used by some filters? I'd make a separate sub class for them [19:17:33] not every udp2log install will need them, and it's good to make that clear [19:17:48] ok [19:18:15] you have udp2log-${name} and udp2log_${name} [19:18:51] you should probably require udp2log::packages in ::instance [19:19:05] no need to depend on all the individual resources then [19:21:07] why is the log file called /var/log/squid/packet_loss.log? [19:21:10] it's a udp2log log file right [19:21:19] if it's just done because that's how it is now, add a FIXME: to change it later [19:21:24] it is, but this is to not mess with that's already there too much [19:21:28] kk [19:21:30] sure, just add a note [19:21:57] mark: +1 fixme. my guess on why - it's log content from the squids, as opposed to /var/log/udp2log/udp2log.log, which is log lines generated by udp2log itself. [19:21:59] after these changes, good enough I think [19:22:05] yeah I figured [19:22:24] why do we have udp2log instead of, say, syslog? [19:22:31] it's not for syslogging [19:22:43] it's for access logs by squids, varnish, nginx [19:22:43] what is it for? [19:22:47] in udp packets [19:22:49] home grown system [19:23:01] <^demon> Also mediawiki logging, right? [19:23:01] couldn't you redirect those to syslog? [19:23:01] udp packets with one or more log lines in them [19:23:09] that wasn't efficient enough [19:23:16] mediawiki uses it too now yeah [19:23:20] where was the bottleneck? [19:23:30] pipe buffers and stuff, years ago [19:23:39] http://www.squid-cache.org/mail-archive/squid-dev/200701/0042.html [19:23:40] sorry for the newbie questions, please bear with me :-) [19:23:43] I happen to have that tab open ;) [19:23:47] this is how it started [19:24:14] paravoid: it's quite a few hundreds of megabits of logging traffic [19:24:39] this patch is in squid upstream [19:25:04] we have a patch for varnishncsa as well, which isn't quite clean enough for upstream [19:25:12] I improved it a bit today [19:25:19] and there's an nginx patch as well [19:25:53] I was just wondering why this was needed [19:26:26] so tim wrote this as it was efficient and easy to deal with logging on the squids [19:26:32] and wrote a receiver program (udp2log) as well [19:26:45] that's what peter is now cleaning up [19:27:46] I wonder if this still stands nowadays, with more modern syslog implementations or whatnot [19:28:13] well it's nice to have access logging separate from normal syslogging anyway [19:28:18] request logging [19:29:35] and if it works I wouldn't dream to touch it anyway :-) [19:30:48] other thing we do with it is multicast it [19:31:03] just like our purge packets [19:32:01] the nginx code is still slightly broken [19:32:05] shit [19:32:15] I keep forgetting I need to deploy the fix that abe wrote [19:33:35] convert to git-buildpackage while you're at it ;) [19:33:54] is that patch upstreamable? [19:34:01] it's probably not very useful for other people [19:37:07] New patchset: Pyoungmeister; "udp2log cleanup, new instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4782 [19:37:15] * mark leaves quickly [19:37:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4782 [19:37:26] hahaha [19:40:14] alright, I think I shall go for it [19:41:06] mark: thumbs up? [19:44:08] Change abandoned: Hashar; "Following a chat with Ryan, this is unneeded. The correct repo is labs/private and not operations/pr..." [operations/private] (master) - https://gerrit.wikimedia.org/r/3899 [19:46:27] !log stopping puppet on locke and emery [19:46:29] Logged the message, notpeter [19:47:22] New review: Pyoungmeister; "at long last." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4782 [19:57:44] New patchset: Pyoungmeister; "Merge branch 'production' of ssh://gerrit.wikimedia.org:29418/operations/puppet into lucenehashin" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4802 [19:58:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4802 [20:06:22] Change abandoned: Pyoungmeister; "wtf" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4802 [20:35:10] New patchset: Demon; "Using wrong message, we don't want 'Lint check passed.'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4806 [20:35:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4806 [20:37:41] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4806 [20:37:44] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4806 [21:10:07] New patchset: Lcarr; "Switching new icinga(nagios) box to use naggen to generate conf files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4810 [21:10:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4810 [21:12:12] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4810 [21:12:14] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4810 [21:15:33] LeslieCarr: I've got an answer to your question [21:16:10] drdee: :) [21:17:14] aren't you curious/ [21:17:21] yes [21:17:23] ! [21:17:35] what percentage are from my badly constructed list of countries ? [21:17:36] :) [21:17:41] i shared a google doc with some specifics, but the answer is [21:17:53] 13.3B [21:18:12] yay :) [21:18:17] for the month of march [21:18:24] what's the total pageview # ? [21:18:29] man, we get a lot of pageviews [21:34:25] New patchset: Faidon; "Fix generate() calls for naggen" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4813 [21:34:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4813 [21:35:15] New review: Faidon; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4813 [21:35:18] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4813 [21:44:21] New patchset: Lcarr; "installing htop on all servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4814 [21:44:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4814 [21:45:04] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4814 [21:45:07] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4814 [21:55:57] New patchset: Lcarr; "Adding in naggen to /usr/local/bin on puppetmasters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4815 [21:56:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4815 [22:00:50] New patchset: Lcarr; "Adding in naggen to /usr/local/bin on puppetmasters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4815 [22:00:58] paravoid: want to check out above? ^^ [22:01:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4815 [22:08:01] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4815 [22:08:04] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4815 [22:13:22] doh, spence is dead ... [22:17:39] New patchset: Lcarr; "no purge script needed!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4816 [22:17:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4816 [22:17:56] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4816 [22:17:59] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4816 [22:39:07] New patchset: Lcarr; "moving permission fixing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4818 [22:39:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4818 [22:44:56] New patchset: Lcarr; "fixing fundraising db nrpe checks" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4819 [22:45:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4819 [22:45:19] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4819 [22:45:34] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4818 [22:45:37] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4819 [22:45:38] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4818 [22:54:25] New patchset: Lcarr; "Fix double scoping of files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4821 [22:54:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4821 [22:55:02] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4821 [22:55:05] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4821 [23:11:35] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 0.824376299213 [23:12:11] RECOVERY - udp2log log age on emery is OK: OK: all log files active [23:12:29] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [23:12:29] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [23:14:35] PROBLEM - MySQL Slave Delay on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:15:02] PROBLEM - MySQL Slave Running on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:15:02] PROBLEM - mysqld processes on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:15:29] PROBLEM - MySQL disk space on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:15:31] New patchset: Lcarr; "Revert "Fix double scoping of files"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4822 [23:15:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4822 [23:15:47] PROBLEM - MySQL Recent Restart on es1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:15:47] PROBLEM - jenkins_service_running on gilman is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:16:06] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4822 [23:16:09] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4822 [23:17:35] PROBLEM - SSH on es1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:19:35] New patchset: Lcarr; "fixing icinga files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4823 [23:19:51] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4823 [23:20:18] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4823 [23:20:21] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4823 [23:20:26] RECOVERY - SSH on es1004 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [23:20:44] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [23:20:44] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [23:40:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:46:41] PROBLEM - SSH on ssl1004 is CRITICAL: Server answer: [23:46:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.722 seconds [23:53:35] PROBLEM - SSH on gilman is CRITICAL: Server answer: [23:53:48] New patchset: Tim Starling; "set -e in scap" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4827 [23:54:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4827 [23:55:00] New review: Tim Starling; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/4827 [23:55:03] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4827 [23:58:35] hey, is anyone checking out ssl1004 ?