[00:03:32] !log upgrading db1045 to precise for testing [00:03:39] Logged the message, Master [00:07:57] New patchset: Ryan Lane; "Fix nslcd config to only specify validnames for precise and above" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16461 [00:08:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16461 [00:09:04] New patchset: Ryan Lane; "Fix nslcd config to only specify validnames for precise and above" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16461 [00:09:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16461 [00:09:50] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16461 [00:10:30] Ryan_Lane: will that change ease booting servers? [00:10:51] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 190 seconds [00:11:46] binasher: :D [00:17:09] PROBLEM - Host db1045 is DOWN: PING CRITICAL - Packet loss = 100% [00:18:03] RECOVERY - Host db1045 is UP: PING OK - Packet loss = 0%, RTA = 26.89 ms [00:23:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 22 seconds [00:31:15] PROBLEM - Host virt1004 is DOWN: PING CRITICAL - Packet loss = 100% [00:50:03] RECOVERY - Host virt1004 is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [00:50:16] paravoid: got a sec for some puppet-on-labs questions? [01:08:26] wow.. i in-place upgraded a lucid db to precise and it came up after first boot with the cfq io scheduler. i thought my mysql build must have been seriously screwed, the result was getting 1/10th the tps in sysbench benchmarks [01:09:51] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 193 seconds [01:13:18] PROBLEM - Host db1045 is DOWN: PING CRITICAL - Packet loss = 100% [01:14:30] RECOVERY - Host db1045 is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [01:15:37] domas: this one isn't stripped this time, heh [01:22:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [01:32:30] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [01:41:12] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 205 seconds [01:42:33] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 254 seconds [01:49:18] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 657s [01:52:09] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [01:54:51] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 2 seconds [01:56:03] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 44s [02:09:56] PROBLEM - LDAP on virt0 is CRITICAL: Connection refused [02:43:50] PROBLEM - Puppet freshness on potassium is CRITICAL: Puppet has not run in the last 10 hours [03:32:02] PROBLEM - Memcached on ms-fe1001 is CRITICAL: Connection refused [03:33:05] PROBLEM - Swift HTTP on ms-fe1001 is CRITICAL: Connection refused [04:44:52] New review: Tim Starling; "Wouldn't it be better as a non-redirect rewrite rule, like the rewrite rule for "/" directly above t..." [operations/apache-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/7772 [05:04:23] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [05:29:10] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [07:34:11] morning [07:42:07] New patchset: Hashar; "redirect (302) /w/ to /w/index.php" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7772 [07:43:58] New review: Hashar; "Indeed it is better to redirect directly to the index :-]? I removed the [R=302,L] with patchset 3." [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/7772 [08:02:18] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [08:05:18] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [09:03:56] Change merged: Nikerabbit; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16252 [09:24:27] morning [09:38:54] it is afternoon already! :-P [09:45:09] meh [09:49:10] heh [09:50:19] New patchset: Hashar; "redirect /w/ to /w/index.php" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7772 [09:50:46] New review: Hashar; "Patchset 4: removed the reference to 302 in commit message, that is no more the case since Patchset 3." [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/7772 [09:51:57] paravoid: good afternoon :-) I got a few changes for you later on :-] [09:52:09] sure :) [09:52:35] and if you are in the mood, would love to migrate the labs out of the NFS instance to /data/project [09:52:47] which will create havoc and madness everywhere :-] [09:53:08] the very trivial changes are https://gerrit.wikimedia.org/r/#/c/14287/ (a basic comment) [09:53:20] and there is a remote syslog change https://gerrit.wikimedia.org/r/#/c/14090/ [09:53:28] that got lost when test was merged in production [11:33:48] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [11:52:02] New review: Hashar; "mediawiki::package is used by various class being applied on non apache host. For example misc-scrip..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/14162 [11:53:49] hashar: oh, here? [11:54:02] yup finally managed to lunch :-] [11:54:37] New review: Faidon; "Trivial enough." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/14287 [11:54:37] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14287 [11:54:56] \O/ [11:55:48] I don't understand the other commit [11:55:52] 14090 [11:56:56] hm, wait [11:57:41] * hashar reads 14090 [11:57:54] so on beta we need to have the logs send to a specific instance [11:58:00] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14090 [11:58:02] in production we have a syslog server per site [11:58:27] I'm not terribly excited at having if/then for all instance projects [11:58:42] but we'll fix that when more projects want to add themselves there [11:58:49] plus, that was already there :) [11:59:01] yeah I think I did that change before you came in [11:59:08] and instructed to use dedicated classes [11:59:13] err: Could not retrieve catalog from remote server: Error 400 on SERVER: Cannot reassign variable syslog_remote_server at /var/lib/git/operations/puppet/manifests/base.pp:316 [11:59:17] oops [11:59:21] ideally there should be a base::configuration::labs or something to let us configure [11:59:24] grmblblb [11:59:48] oh right, I should have seen that [11:59:49] silly me [12:00:28] tsk tsk tsk [12:00:46] mark: I deserved that :-) [12:01:13] for the record, this: [12:01:15] class base::remote-syslog { [12:01:15] if ($::lsbdistid == "Ubuntu") and ($::hostname != "nfs1") and ($::hostname != "nfs2") { [12:01:18] package { rsyslog: [12:01:20] is ḣorrible. [12:01:24] but not the time [12:01:32] very horrible [12:01:33] mark: did you see my varnish module? [12:01:40] yes [12:01:52] you know, I hate the fact that we have to split up every class into a separate file [12:01:58] New patchset: Hashar; "armor an escape to have it ignored by puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16487 [12:02:30] it's how the autoloader works [12:02:35] i know [12:02:36] maybe we could have some kind of generic configuration module [12:02:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16487 [12:02:41] it's not a style thing I mean [12:02:46] but I hate it [12:02:49] and then have a configuration-prod and another configuration-labs-beta that would extend it ? [12:03:38] sounds like the labs LDAP has an issue : Error 400 on SERVER: Failed when searching for node i-0000031a.pmtpa.wmflabs: LDAP Search failed [12:03:38] that's the idea of realm [12:03:53] oohhhh [12:03:56] New patchset: Faidon; "Fix reassignment of $syslog_remote_server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16488 [12:03:57] but it's not quite done yet/done well [12:04:12] so that $syslog_remote_server could have landed in realm.pp :( [12:04:16] I keep forgetting about that one [12:04:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16488 [12:04:58] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16488 [12:05:53] argh [12:05:58] paravoid: I don't understand that change :/ [12:06:12] apparently that is the if( $::syslog_remote_server == '' ) { [12:06:13] which is faulty [12:06:25] yeah, I deserve another tsk tsk tsk [12:06:27] should just check the local $syslog_remote_server has been set [12:06:57] or we could roll out that change and insert it in realm.pp [12:08:36] i've been thinking of giving class base a lot of parameters with defaults [12:08:41] with the defaults working for 99% of servers [12:08:57] and passing a parameter (like, 'no_rsyslog') for when you want something else [12:09:05] i can't really think of a better way to do it, globals suck [12:09:29] and you can't set default parameters for parameterized classes like you can with definitions :( [12:10:22] ? [12:10:24] of course you can [12:10:27] no [12:10:37] I mean like Resource { param => default_value } [12:10:48] that would have been handy here [12:10:51] I don't understand [12:10:57] nevermind [12:11:11] well [12:11:14] say, in realm.pp [12:11:15] class webserver( $vhost_dir = '/etc/httpd/conf.d', $packages = 'httpd' ) {} ? (form puppet doc [12:11:17] you can have class foo::bar($port='80') { ... } [12:11:18] I would have liked to be able to set [12:11:20] yes yes [12:11:33] please continue [12:11:39] Base::Syslog { server => "whatever" } [12:11:57] so that becomes the default value unless overridden for a specific server [12:12:00] ah [12:12:05] have you seen hiera? [12:12:08] no [12:12:14] I've never worked with it, but it's supposed to help with this problem [12:12:21] let's look at it [12:12:23] and has been merged into puppet 3.0 [12:12:25] this would have bee nice [12:12:29] different defaults for different realms [12:12:48] http://projects.puppetlabs.com/projects/hiera/ [12:12:51] oh that [12:12:52] Change abandoned: Reedy; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/14162 [12:13:26] I haven't tried it, so I don't have an opinion yet [12:13:41] but looking at its description, it sounds a lot like what you're talking about [12:13:46] i hate how they hide the open source puppet version so much on the web site [12:13:59] like mysql :P [12:14:47] let's buy Puppet Enterprise! [12:14:48] * paravoid ducks [12:15:02] you will get a graphical interface!!! [12:15:12] just as slow, just much more enterprisy and expensive! [12:15:54] because you know how much I loooove enterprisy stuff [12:16:16] mark: I got a pending change to varnish/bits.inc.vcl.erb and your last comment confuse me ( https://gerrit.wikimedia.org/r/#/c/15445/ ) [12:17:17] * hashar grabs a coffee [12:17:19] hashar: you're referring to a variable not even local to the template location in the manifest [12:17:30] $test_wikipedia [12:20:47] puppet 3.0.0rc3 is current [12:20:57] we should test it in labs [12:25:09] "Our customers manage millions of nodes with Puppet" [12:25:16] and half a million puppetmasters to serve them... [12:25:25] hahahahahahahaha [12:26:11] Does it need some version of ruby that hasn't been packaged? [12:36:44] mark: I understood why $test_wikipedia is wrong :-D it is not really there after all [12:36:53] RECOVERY - LDAP on virt0 is OK: TCP OK - 0.001 second response time on port 389 [12:36:58] so [12:37:06] mark: would you mind implementing the $cluster_options hash to varnish class ? :- [12:37:07] you should introduce a new parameter to the varnish::instance class [12:37:24] where we can pass arbitrary things to the manifests [12:37:28] depending on the cluster [12:37:39] so each varnish include file can pull from it what it needs [12:37:55] lets try :-) [12:37:57] ok [12:38:14] PROBLEM - Puppet freshness on srv198 is CRITICAL: Puppet has not run in the last 10 hours [12:45:17] PROBLEM - Puppet freshness on potassium is CRITICAL: Puppet has not run in the last 10 hours [12:46:12] I'm fixing puppet now btw [12:46:15] haven't forgot about it [12:50:06] New patchset: Faidon; "syslog: do not use a localhost remote" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16495 [12:50:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16495 [12:51:01] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16495 [12:53:37] argh [12:53:42] New patchset: Faidon; "Another syslog fix: use double quotes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16496 [12:54:06] come on gerrit2 [12:54:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16496 [12:54:29] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16496 [12:55:56] finally fixed [12:57:18] New patchset: Faidon; "armor an escape to have it ignored by puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16487 [12:57:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16487 [12:58:02] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16487 [13:04:13] New patchset: Faidon; "Scope with root all $lsb* fact variables" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16497 [13:04:31] anyone wants to very quickly review this just in case? [13:04:43] it's a simple sed, but it touches stuff all over the tree [13:04:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16497 [13:08:11] New patchset: Alex Monk; "(bug 32516) Enable Narayam on pawiki." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16498 [13:08:18] New patchset: Hashar; "varnish::instance learned 'cluster_options'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16499 [13:08:54] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16499 [13:09:01] mark: I have added cluster_options to varnish::instance with https://gerrit.wikimedia.org/r/#/c/16499/ and migrated the enable_geoiplookup to use it [13:09:21] New review: Hashar; "I have added cluster_options to varnish::instance with https://gerrit.wikimedia.org/r/#/c/16499/ ..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15445 [13:14:31] New patchset: J; "migrate wikimedia-job-runner to puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16501 [13:15:03] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/16501 [13:15:32] who's J? [13:23:41] paravoid: ok [13:25:03] mark: ? [13:26:05] j^: ohhi :) did you see that your change has syntax errors? [13:26:10] who the fuck merged manifests/timedmediahandler [13:26:21] mark: was wondering that myself :) [13:26:53] j^: you can fix it and amend your change [13:27:09] j^: there's no need for a "revert", since the changeset was never merged [13:28:07] hashar: do you more about timedmediahandler.pp? [13:28:16] hashar: I remember you talking about it, no? [13:28:35] j^: you mean "git rm manifests/timedmediahandler.pp"? [13:28:42] i am doing that now [13:28:46] it's not referenced anywhere in production, but I think it's used somewhere in labs [13:29:49] mark: merge the $::lsb first? :) [13:29:59] yeah you do that [13:30:27] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16497 [13:30:46] paravoid: yeah what j^ said :-) he is TMH guru ;) [13:30:56] ah, didn't know that [13:30:57] New patchset: Mark Bergsma; "Remove timedmediahandler.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16503 [13:31:01] j^: have you managed to use puppetmaster::self on the video01 instance ? [13:31:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16503 [13:32:16] New patchset: Mark Bergsma; "Remove timedmediahandler.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16504 [13:32:52] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16503 [13:32:52] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16504 [13:33:19] mark: merged on sockpuppet [13:33:22] tnx [13:33:59] mark: could you look at the varnish::instance( $cluster_options ) stuff https://gerrit.wikimedia.org/r/#/c/16499/ ? :-) [13:34:05] yeah [13:34:37] j^: shoot if you need any help with your change(s) [13:34:51] New patchset: J; "migrate wikimedia-job-runner to puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16501 [13:35:05] oi, changes to varnish [13:35:08] there goes my module stuff [13:35:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16501 [13:36:18] New review: Mark Bergsma; "$cluster_options needs to be initialized to an empty hash by default." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/16499 [13:36:35] not quite yet ;) [13:37:52] hmm apparently ubuntu wants to pass "elevator=deadline" to the Linux Kernel. [13:39:15] I should relocate to The Netherlands [13:39:27] so I could get you to look over my shoulder while I code :-] [13:40:49] New review: Faidon; "Global variables (such as those $jobrunner_* variables) are deprecated, parameterized classes are th..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/16501 [13:44:16] mark: 6ee375 was merged when I was on vac, but I see it still doesn't work [13:44:27] I can see at least one problem [13:44:57] that lvs1/2 & lvs1001/1002 were made aggregators but have no $ganglia_aggregator=true [13:45:00] which I'll fix [13:45:04] but do you see any other problems? [13:45:33] sure [13:45:39] the lvs servers are not actually put in the lvs group [13:45:48] they have $cluster = "misc" set in their node entries [13:45:55] hahaha [13:46:02] okay, not I feel completely stupid :) [13:47:46] anything else? [13:50:21] New review: Hashar; "> $cluster_options needs to be initialized to an empty hash by default." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16499 [13:50:21] I think it should work then [13:50:26] but i haven't checked in detail [13:50:50] New patchset: Hashar; "varnish::instance learned 'cluster_options'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16499 [13:51:13] great, thanks a lot [13:51:25] New review: Hashar; "Patchset 2 let $cluster_options parameter an empty hash, default value is determined directly in the..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16499 [13:51:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16499 [13:51:33] mark: 16499 ^^^ :) [13:52:53] read again hashar [13:52:56] you didn't fix most of it [13:53:34] so I guess I simply don't understand your comment [13:53:37] :( [13:53:42] hmm you did [13:53:53] well, for the non bits clusters, don't include the enable_geoiplookup [13:54:04] I guess you did fix most of it and forgot that one ;) [13:54:26] ohhh [13:54:27] and the blog one [13:54:41] so the reason I have explicitly defined them was because they were defined previously [13:54:52] ahhh [13:54:54] yeah [13:54:56] no longer necessary [13:55:09] so that is merely cleanup ? :à [13:55:13] yes [13:55:29] this is good otherwise [13:55:44] I tend to be as little disruptive as possible when doing the changes [13:55:48] New review: J; "Will update to use Parameterized Class and amend." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16501 [13:56:00] * hashar amends [13:56:14] you can skip the $varnish_cluster_options step [13:56:20] you can refer to the parameter directly [13:56:23] I know the rest doesn't do that [13:56:28] but there's not really any good reason for that anymore [13:57:31] New patchset: Faidon; "Actually move LVS to their own Ganglia group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16506 [13:58:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16506 [13:58:14] ok [13:58:18] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16506 [13:58:38] haha [13:58:40] paravoid [13:58:42] $::hostname ;-) [13:58:42] yes? [13:58:46] hahaha [13:59:07] to my defence, I just copied it from somewhere above [13:59:14] yeah sure [13:59:15] in my defence even [13:59:24] it's wrong in tons of places [13:59:31] but it's funny since you just did a large sed for that [13:59:43] yeah :) [14:01:26] New patchset: Hashar; "varnish::instance learned 'cluster_options'" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16499 [14:02:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16499 [14:02:45] New review: Hashar; "Patchset3:" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16499 [14:02:51] mark: updated :) [14:04:25] New review: Mark Bergsma; "Good work hashar. :)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/16499 [14:04:25] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16499 [14:04:49] I can imagine mark throwing hashar a cookie :) [14:05:04] \O/ [14:05:15] so now [14:05:21] I have to figure out why I did that change :-] [14:05:40] ohhh [14:05:48] to rebase another change on top of it of course [14:05:54] * hashar goes back to git rebase [14:06:37] paravoid: whenever you can, could you review the beta moves from its nfs instance to the shared /data/project https://gerrit.wikimedia.org/r/#/c/15545/ [14:06:41] !log force puppet run on all LVS servers for the ganglia change [14:06:50] Logged the message, Master [14:07:07] mark: so, pmtpa & eqiad seems to work already; esams isn't, do we have any weird multicast setup or anything? [14:07:12] uhoh bad [14:07:19] it just removed geoip on one of the bits servers [14:07:44] - -P ${PIDFILE} ${DAEMON_OPTS} -p cc_command="${CC_COMMAND}" > ${output} 2>&1; then [14:07:44] + -P ${PIDFILE} ${DAEMON_OPTS} > ${output} 2>&1; then [14:08:18] ah [14:08:24] we missed one variable lookup [14:08:26] i'll fix that [14:09:49] New patchset: Mark Bergsma; "Replace remaining $enable_geiplookup variable" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16507 [14:10:22] mark: sorry about that :/ [14:10:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16507 [14:10:31] n/p [14:10:40] * hashar gives back the cookie [14:10:45] no I missed it too [14:10:48] mark: so, how's the multicast group work across the atlantic? [14:10:57] do we have any special relays or something? [14:11:00] paravoid: it doesn't [14:11:10] the ganglia server contacts over unicast [14:11:12] the aggregators [14:11:16] aha [14:11:37] so esams has its own multicast group and it doesn't need to extend to the US [14:12:16] although we're setting up tunnels for it, so we can soon [14:12:54] * paravoid scratches head [14:13:02] I wonder why it doesn't work for esams [14:14:52] ah! found it [14:16:00] shouldn't have blamed multicast, but can you blame me? :) [14:16:15] New patchset: Faidon; "Fix hostnames for esams LVS Ganglia aggregators" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16508 [14:16:52] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16508 [14:16:59] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16508 [14:17:10] hehe [14:18:04] so I think the current way is a bit silly [14:18:17] we should have a list in puppet of aggregators/clusters [14:18:35] so it can determine itself whether a host is aggregator or not [14:18:44] note that the amslvs2 hostname is still wrong [14:18:47] missing an 's' [14:19:20] ... [14:19:27] I think I should stop doing puppet changes for today [14:19:45] don't worry [14:19:53] by comparison, some people should never be doing puppet changes then [14:20:40] New patchset: Faidon; "Fix another hostname typo in Ganglia" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16509 [14:21:07] brb [14:21:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16509 [14:21:29] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16509 [14:22:35] I find the repetition a bit silly as well [14:22:49] we tell puppet the aggregators twice [14:29:45] New patchset: Hashar; "varnish config for bits.beta.wmflabs.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/13304 [14:30:21] New patchset: Hashar; "Fix bits VLC when enable_geoiplookup is disabled" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15445 [14:30:38] Logged the message, Master [14:30:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/13304 [14:30:56] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15445 [14:31:21] New review: Hashar; "Patchset 4 rebase change to use the new $cluster_options parameter in varnish::instance. Explicitly ..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15445 [14:32:04] New review: Hashar; "Patchset 11 is a rebase." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/13304 [14:32:48] mark: moaar varnish change https://gerrit.wikimedia.org/r/#/c/15445/4 now reusing the $cluster_options to check if there is the site has a test host [14:36:17] PROBLEM - MySQL Slave Delay on db12 is CRITICAL: CRIT replication delay 210 seconds [14:37:38] RECOVERY - MySQL Slave Delay on db12 is OK: OK replication delay 0 seconds [15:05:32] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [15:23:41] hashar: I forgot about beta/NFS again [15:23:43] duh [15:23:51] ;-] [15:23:54] looking now [15:23:59] sorry :/ [15:24:21] is it tested in labs? [15:24:21] that is ok, aren't we all overwhelmed anyway ? :-] [15:26:00] » » » device => "ms1.wikimedia.org:/export/upload", [15:26:03] eh? [15:26:14] ah, ensure absent [15:27:08] hashar: has it been tested in labs? [15:27:39] hashar: why create upload5? it wasn't before and it's not being used [15:28:34] well it was there before :-D [15:28:37] https://gerrit.wikimedia.org/r/#/c/15545/6/manifests/nfs.pp,unified [15:28:39] line 145 [15:29:01] I just moved the mount {} part for production [15:29:22] ahh [15:29:25] the file creation sorry [15:29:27] removing it [15:29:45] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [15:31:08] New patchset: Hashar; "(bug 38084) uses /data/project instead of NFS instance" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/15545 [15:32:26] New review: Hashar; "Patchset 7 removes the /mnt/upload5 creation (that is ensure => absent)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15545 [15:32:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/15545 [15:36:27] paravoid: I can't remember if I tested it on labs let me redo it [15:37:26] paravoid: I did on the deployment-integration instance i-0000034a [15:37:34] rerunning puppet there [15:40:49] New review: Hashar; "Tested Patchset 7 on depoyment-integration instance which uses both nfs::upload::labs and nfs::apac..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/15545 [15:50:46] New patchset: Hashar; "mail web-ui htdigest is only for production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16514 [15:51:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16514 [15:51:25] New patchset: Hashar; "Making the htaccess production only" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16514 [15:52:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16514 [15:52:53] New patchset: Hashar; "Making the htaccess production only" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16514 [15:53:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16514 [15:53:36] New patchset: Hashar; "mail web-ui htdigest is only for production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16514 [15:54:11] Change abandoned: Hashar; "such a mess. Will reapply cleanly." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16514 [15:54:21] New patchset: Hashar; "mail web-ui htdigest is only for production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16515 [15:54:54] New review: Hashar; "Clean change is https://gerrit.wikimedia.org/r/16515" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16514 [15:54:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16515 [15:55:05] that is a bit messy :-D [15:55:20] anyway you could abandon https://gerrit.wikimedia.org/r/#/c/6541/ I ported it from test branch to production branch with 16515 [16:08:27] !log srv278 hardware worked on by chris, placing back in service to see if its going to stay fixed [16:08:36] Logged the message, RobH [16:09:04] paravoid: i am out let s peak the nfs stuff tomorrow morning if that works for you [16:10:15] hashar: yeah, sorry, had to do something else :/ [16:10:30] paravoid: I understand :-} [16:10:35] dont worry! [16:23:59] New patchset: Mark Bergsma; "Fix hostnames for ms-be1001-1005" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16517 [16:24:36] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16517 [16:40:36] New review: Mark Bergsma; "Don't you need to pass the test hostname to the template, or something? if so, that can be used to d..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/15445 [16:53:22] PROBLEM - Host analytics1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:56:01] New patchset: Matthias Mullie; "lower AFTv4 odds to display AFTv5 at 10% (inverse odds)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16521 [16:56:08] Change abandoned: Matthias Mullie; "will push new commit" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/14044 [17:04:03] New patchset: Bhartshorne; "disabling eqiad swift cluster stats until the cluster is functional." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16522 [17:04:41] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16522 [17:10:52] hi LeslieCarr, you there? [17:11:53] hey ottomata i'm here now [17:12:32] heya [17:12:53] so, i recently rebooted analytics1001.wikimedia.org…and um, now I can't reach it anymore [17:12:53] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16521 [17:13:03] I don't have any super powers, can you help? [17:13:15] * Damianz hands ottomata some magic fairy dust [17:13:36] * LeslieCarr steals the fairy dust [17:13:40] let me check it out [17:13:43] danke [17:14:35] so, will the magic fairy dust make frank schleck not have doped ? [17:15:14] ha [17:15:21] shame on him [17:15:31] make his doped self go fix the server [17:15:57] good idea [17:16:02] wierd, looks stuck on pxe boot .... [17:16:09] let me try power cycling it [17:24:11] any luck? [17:25:32] oo, The disk drive for /var/lib/hadoop-0.20/name is not ready yet or not present [17:25:39] skipped [17:26:30] hmm [17:26:32] ok that's fine [17:26:34] weird, but ok [17:26:39] shoudl boot beyond that, right? [17:27:52] yeah…. theoretically [17:28:09] it's just repeating "Ubuntu 10.04. . . .Ubuntu 10.04. . . ." right now .... [17:28:23] maybe give it a minute … haven't seen that before [17:28:36] then again, haven't touched the ciscos much -- RobH or notpeter -- have you seen this before in ciscos ? [17:28:53] on boot? [17:28:57] i have not seen that [17:29:29] yeah, it went through booting, had a disk error, and then this [17:32:23] vit1004? [17:32:29] virt even, its a cisco with bad disk [17:32:32] but duno which your usin [17:33:18] if it says that on boot, it sounds it's pxe booting and something's wrong [17:35:40] analytics1001 robh [17:35:47] oh it's trying to pxe boot ? [17:35:49] hrmm, odd. [17:46:15] hmmmmm oooooooddddddoooooo [17:46:18] any updates? [17:51:18] nope, so i am guessing it should not be attempting to pxe boot … ? [17:51:27] maybe i can switch and re-reboot [18:02:59] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [18:04:11] lesliecarr, any luck with switch and rereboot? [18:04:16] ottomata: it's slowly rebooting... [18:04:23] hmm, ok [18:04:31] so, the reason I rebooted [18:04:39] is I was noticing really strange behavior with hdfs stuff [18:05:20] hdfs file ops would take forever to run [18:05:32] New patchset: MaxSem; "Wiki Loves Monuments (RT#3221) definitions, version 0.0.1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16530 [18:05:37] i had been messing with iptables, but everything seemed fine [18:05:43] i had turned iptables off, but the problem persisted [18:05:59] so I decided to reboot and see if the problem still happened [18:05:59] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [18:06:07] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16530 [18:06:55] there's your problem, you iptabled the bios [18:06:56] ;) [18:07:03] skipping mounting again... [18:07:20] New patchset: MaxSem; "Wiki Loves Monuments (RT#3221) definitions, version 0.0.1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16530 [18:07:52] haha [18:07:53] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/16530 [18:09:18] ooo [18:09:20] login prompt [18:09:39] yay ottomata [18:09:54] trying... [18:09:59] New patchset: Pyoungmeister; "apache overhaul: round three of responding to mark's comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16532 [18:10:31] hm, can't connect [18:10:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16532 [18:10:36] can't ping either [18:10:37] wow it's messed up, fyi [18:10:41] i just ssh'ed in [18:10:46] but it can't ping anything [18:11:04] hm [18:11:20] are my iptables rules there? [18:11:21] i don't see any drops in iptables [18:11:22] iptables-save [18:11:29] ok phew, there really weren't many anyway [18:11:33] i only dropped by default INPUT [18:11:39] and allowed a few (including ssh) [18:11:41] yeah [18:11:44] but they aren't running anyway [18:11:49] sooo, more weird [18:12:13] New review: preilly; "Syntax error at 'bash'; expected '}' at ./manifests/misc/wlm.pp:16 err: Try 'puppet help parser vali..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/16530 [18:13:12] New patchset: MaxSem; "Wiki Loves Monuments (RT#3221) definitions, version 0.0.1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16530 [18:13:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16530 [18:14:08] New patchset: Andrew Bogott; "Temporarily repartitioning virt1001-9 for ceph testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16533 [18:14:17] can someone look at https://gerrit.wikimedia.org/r/16530 [18:14:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16533 [18:16:15] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16533 [18:16:33] ottomata: dude it's all sorts of insane [18:16:41] it shows eth0 but thinks it's unconfigured [18:17:24] hmmmmmm, so if you remember, this machine was given a public IP after being originally assigned a private one [18:17:30] not sure if that is related [18:17:38] but maybe there are crazy configs that were left around from that [18:17:44] and when it booted it got all confused [18:17:46] maybe? [18:17:48] iunno [18:18:38] ottomata: hrm [18:18:54] actually, it looks like it thinks its gateway and network identifier are impossible numbers [18:18:58] dunno how that could happen [18:19:05] let me fix that and see if everything gets fixed up [18:21:20] yeah, it's like it thought it had a /24 and a /26 at the same time [18:21:28] there we go [18:21:29] can you ssh [18:21:51] yes i am in! [18:22:01] weird [18:22:13] RECOVERY - Host analytics1001 is UP: PING OK - Packet loss = 0%, RTA = 35.74 ms [18:22:18] just for curiosity sake, what file did you find that in? [18:22:58] PROBLEM - Host virt1005 is DOWN: PING CRITICAL - Packet loss = 100% [18:23:00] New patchset: Bhartshorne; "adding DHCP entries (and placeholders) for new eqiad swift backend hosts." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16536 [18:23:07] PROBLEM - swift-container-updater on ms-be1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [18:23:07] PROBLEM - swift-object-updater on ms-be1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [18:23:26] /etc/network/interfaces [18:23:34] well i googled the error i got when i did ifup :) [18:23:35] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16536 [18:23:45] adn they suggested checking out that [18:23:52] PROBLEM - swift-object-replicator on ms-be1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [18:23:52] PROBLEM - swift-container-replicator on ms-be1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:23:57] how did we find out how to fix problems before google ? [18:24:13] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16536 [18:24:19] PROBLEM - swift-account-replicator on ms-be1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [18:24:29] New patchset: Ryan Lane; "Adjusting validnames regex to match - properly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16537 [18:24:55] PROBLEM - Host virt1007 is DOWN: PING CRITICAL - Packet loss = 100% [18:25:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16537 [18:25:04] PROBLEM - Host virt1008 is DOWN: PING CRITICAL - Packet loss = 100% [18:25:53] yea right, i don't know how anyone did any of this stuff without google [18:25:54] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16537 [18:26:11] so what did you change [18:26:12] the netmask? [18:27:27] New review: preilly; "Have you run grep and looked for apache_site ?" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/16530 [18:28:31] RECOVERY - Host virt1005 is UP: PING OK - Packet loss = 0%, RTA = 35.41 ms [18:30:28] RECOVERY - Host virt1007 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [18:30:37] RECOVERY - Host virt1008 is UP: PING OK - Packet loss = 0%, RTA = 35.38 ms [18:32:25] PROBLEM - SSH on virt1005 is CRITICAL: Connection refused [18:33:55] PROBLEM - SSH on virt1008 is CRITICAL: Connection refused [18:34:31] PROBLEM - SSH on virt1007 is CRITICAL: Connection refused [18:36:23] New patchset: J; "migrate wikimedia-job-runner to puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16501 [18:36:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16501 [18:42:51] New patchset: J; "migrate wikimedia-job-runner to puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16501 [18:43:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16501 [18:45:10] RECOVERY - Host ms-be10 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [18:49:40] PROBLEM - NTP on virt1005 is CRITICAL: NTP CRITICAL: No response from NTP server [18:50:52] PROBLEM - NTP on virt1008 is CRITICAL: NTP CRITICAL: No response from NTP server [18:51:28] PROBLEM - NTP on virt1007 is CRITICAL: NTP CRITICAL: No response from NTP server [18:57:35] j^: here? [18:59:03] New review: Faidon; "As I said to my previous review:" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/16501 [18:59:25] j^: I posted it to gerrit, I was only looking for you to make this a bit more interactive :) [18:59:38] awjr_lunch: you were looking for me yesterday? [19:05:10] j^: I am :) [19:05:17] heh, sorry [19:05:25] I say so in my ircname though [19:05:36] /whois or /wii paravoid [19:07:41] phew, LeslieCarr, I think whatever you fixed also fixed my hdfs problem [19:07:45] things look much better now [19:07:49] i think networking was all screwy there [19:08:00] even before reboot [19:09:10] PROBLEM - Host virt1005 is DOWN: PING CRITICAL - Packet loss = 100% [19:09:24] oh cool [19:09:36] yeah, it would make sense if the networking got messed up, the reboot would have triggered it [19:09:44] since i am assuming you don't often do an ifup/down ;) [19:10:29] heh, aye [19:11:32] New review: J; "> 1) Do we really need jobrunner::packages and jobrunner::files? Isn't a "jobrunner" class enough?" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/16501 [19:14:43] RECOVERY - Host virt1005 is UP: PING OK - Packet loss = 0%, RTA = 35.42 ms [19:22:36] New patchset: Asher; "precise mysql pkgs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16538 [19:22:45] j^: I have seen one of your change about migrating the job scripts to puppet :-]]]]]]]]]]]]]]]]]]]]]]]]]] [19:22:45] cmjohnson1: sure, checking it out now [19:23:15] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/16538 [19:23:34] doh [19:26:13] PROBLEM - swift-container-replicator on ms-be1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [19:26:13] PROBLEM - swift-object-replicator on ms-be1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [19:26:40] PROBLEM - swift-account-replicator on ms-be1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [19:26:58] PROBLEM - swift-container-updater on ms-be1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [19:26:58] PROBLEM - swift-object-updater on ms-be1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [19:28:27] New patchset: Asher; "precise mysql pkgs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16538 [19:29:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16538 [19:30:51] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16538 [19:36:20] j^: it's a bit late for me (22:35) to continue the review; if you're in a hurry, maybe you should ask one of the SF folks [19:39:33] so many j names [19:40:05] New patchset: Asher; "fixing mysqlatfacebook package names for precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16543 [19:40:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16543 [19:41:00] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16543 [19:41:41] binasher: if you'd like a review on these packaging changes, I'm available -- I'd prefer doing it tomorrow though and that's probably too late for you [19:43:37] provide feedback whenever [19:46:28] PROBLEM - Host virt1007 is DOWN: PING CRITICAL - Packet loss = 100% [19:46:28] PROBLEM - Host virt1005 is DOWN: PING CRITICAL - Packet loss = 100% [19:50:20] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16417 [19:51:07] RECOVERY - SSH on virt1007 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:51:16] RECOVERY - Host virt1007 is UP: PING OK - Packet loss = 0%, RTA = 35.43 ms [19:51:34] RECOVERY - SSH on virt1008 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:51:55] New patchset: Andrew Bogott; "Revert "Temporarily repartitioning virt1001-9 for ceph testing"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16544 [19:52:01] RECOVERY - Host virt1005 is UP: PING OK - Packet loss = 0%, RTA = 35.44 ms [19:52:32] Change abandoned: Andrew Bogott; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16544 [20:10:13] !log upgrading versions of xtrabackup and percona toolkit on all coredbs [20:10:23] Logged the message, Master [20:10:45] New patchset: Andrew Bogott; "Revert "Temporarily repartitioning virt1001-9 for ceph testing"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16546 [20:11:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16546 [20:11:40] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16546 [20:13:20] Ryan_Lane: SPDY at Facebook: http://lists.w3.org/Archives/Public/ietf-http-wg/2012JulSep/0251.html [20:13:40] and whoever else might be interested in HTTP/2.0 or SPDY :) [20:14:16] how are they implementing it, though? [20:14:18] New patchset: Asher; "xtrabackup is now percona-xtrabackup, precise hosts don't need a separate mysql repo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16547 [20:14:43] that info would be nice to have :) [20:14:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16547 [20:15:08] binasher: want me to merge your pending changes on sockpuppet? (I'm guessing these are yours.) [20:15:46] what are they? [20:15:51] hm. I have keystone at least partially working [20:15:53] I like their analysis [20:16:03] er? [20:16:13] did the openstack upgrade take priority over the migrations? [20:16:24] yes [20:16:27] why? [20:16:46] because we've started hitting bugs in our version of nova [20:16:56] but we've stared the migration process already [20:16:56] andrewbogott: ? [20:17:05] yeah, that's fine [20:17:16] we were waiting for hardware for ages, did all this work and then put it on hold? [20:17:24] the provisioning new nodes work etc. [20:17:24] binasher: udp2log-aft [20:17:25] who says it needs to be put on hold? [20:17:33] andrewbogott: not mine [20:17:35] binasher: Sorry, suffering from horrible net lag today [20:17:43] well, you said the upgrade took priority [20:18:02] I think I need to shift my priority to it [20:18:10] not necessarily everyones [20:18:22] I don't think we should make two migrations at the same time for starters [20:18:31] ? [20:18:39] and also, my impression from our mails was that we were going to share the load of the migrations [20:18:49] I have work that's blockers for upgrade [20:19:04] and I was waiting for you to make the first migration and send me the -possibly scripted- steps [20:19:04] I have to get keystone working and I have to modify openstackmanager [20:19:09] * Ryan_Lane nods [20:19:19] such that atm I can't really look at the patch [20:19:19] let me attempt one [20:19:23] and script it [20:19:30] so, I certainly wasn't working on the hardware migrations these days, because I was waiting for you [20:19:37] * Ryan_Lane nods [20:19:55] andrewbogott: I +2d that [20:19:56] andrewbogott: looks like that was hashar's change and notpeter merged it [20:19:57] it can go out [20:19:58] We just need 2 of ryan, who's get a chain saw [20:20:00] not that I didn't enjoy what I did, but I surely would like to finish with this thing :) [20:20:09] heh [20:20:14] * Ryan_Lane nods [20:20:19] let's get that going [20:20:24] the migrations can take a while [20:20:28] right [20:20:42] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16547 [20:20:47] so, if you occasionally do some migrations, and I occasionally do some while working on the upgrade stuff [20:20:50] they'll eventually get done [20:21:04] notpeter: So you want me to merge on sockpuppet, or are those changes suspect somehow? [20:21:11] oh, you already answered that :) sorry [20:21:25] andrewbogott: merge of sockpuppet is fine. I just forgot to.... [20:21:26] sorry! [20:21:29] i merged them [20:21:46] binasher: NNNNNNNNNNNNOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO [20:22:01] there goes the site [20:22:08] binasher: thanks [20:22:42] Ryan_Lane: right, agreed! [20:23:41] "there goes the site"? :) [20:24:56] paravoid: as much as you hate it, I think this is going to require agent forwarding [20:25:24] that's fine [20:26:06] bleh [20:26:30] I picked an instance that's already running on a migrated system [20:26:41] oh well, that's fine. I'll migrate it to another one of them [20:29:52] binasher: are bellin and blondel in use at this point? [20:30:00] can I close the "set them up" ticket? [20:30:16] or should I wait until service are migrated? [20:32:06] up to you.. i kind of think those hosts should be stripped of ssds and ram, and the rest thrown in a sacrificial fire [20:32:51] New patchset: Ryan Lane; "Another nslcd fix" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16549 [20:33:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16549 [20:33:37] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16549 [21:01:17] Lesliecarr: did you get a chance to see that ticket 3295? [21:01:34] doh sorry [21:01:36] looking now [21:03:36] cmjohnson1: i only have the ports set up until db62 [21:04:18] is there a ticket for the ports for db 63 and up ? [21:04:46] there should be...lemme look [21:05:04] New patchset: Pyoungmeister; "apache overhaul: round one of responding to mark's comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16532 [21:05:35] lesliecarr: rt 3161 [21:05:40] New patchset: Pyoungmeister; "Initial comments to app server manifests work" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16122 [21:06:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16532 [21:06:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16122 [21:07:03] weird [21:09:49] New patchset: Pyoungmeister; "apache overhaul: round one of responding to mark's comments" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16532 [21:10:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16532 [21:10:31] i'm sorry cmjohnson1 -- don't know how i failed to do that :( [21:10:32] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 185 seconds [21:10:32] leslicarr: they should've been networked...i have already provisioned all of them except db64 [21:10:33] shouldn't they be on the network b4 i can do that? [21:10:33] cmjohnson1: should be all up [21:10:50] cmjohnson1: yeah, somehow the ones on csw1-sdtpa (my guess is i was not caffinated enough when i closed the ticket) hadn't been updated [21:11:02] since their ports had been db's before, most went up [21:11:12] lol! np..still can't ping db64 [21:12:36] New patchset: Alex Monk; "(bug 34135) Enable FlaggedRevs on cawikinews." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16558 [21:14:54] hrm, i see the port as up [21:14:55] i don't see a mac on it though, so it's not pushing any traffic out it [21:14:56] yeah..strange...the link lights are out [21:23:52] cmjohnson1: want to try swapping the interface? [21:24:07] next open port ? [21:24:39] there are no other open ports on that panel [21:24:49] oh :( [21:24:56] just reseat the cable ? [21:25:08] did that...swapped the cable already as well [21:25:11] hrm [21:26:31] port 10 is currently down, if we want to try plugging it in to see if it's the machine or the port? [21:27:28] I am not in DC right now...I will have to do that tomorrow. [21:28:02] i was hoping it was a simple fix...thought maybe the port was down [21:32:45] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 187 seconds [21:34:44] PROBLEM - Puppet freshness on maerlant is CRITICAL: Puppet has not run in the last 10 hours [21:39:46] ok [21:39:50] well it was down. .. [21:39:53] as well [21:39:55] hehe [21:40:00] is the machine up and installed ? [21:41:20] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 207 seconds [21:42:50] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [21:50:46] New patchset: Ryan Lane; "Fixing cron spam" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16566 [21:51:14] PROBLEM - Host virt1005 is DOWN: PING CRITICAL - Packet loss = 100% [21:51:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/16566 [21:51:36] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/16566 [21:52:08] RECOVERY - swift-container-updater on ms-be1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [21:52:17] RECOVERY - swift-object-updater on ms-be1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [21:56:29] PROBLEM - swift-container-updater on ms-be1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [21:56:47] RECOVERY - Host virt1005 is UP: PING OK - Packet loss = 0%, RTA = 35.57 ms [21:59:20] PROBLEM - swift-object-updater on ms-be1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [22:09:59] PROBLEM - Host virt1007 is DOWN: PING CRITICAL - Packet loss = 100% [22:11:29] PROBLEM - Host virt1008 is DOWN: PING CRITICAL - Packet loss = 100% [22:15:05] RECOVERY - Host virt1007 is UP: PING OK - Packet loss = 0%, RTA = 35.42 ms [22:17:02] RECOVERY - Host virt1008 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [22:18:08] New patchset: Asher; "fix symlink paths that conflicted with the distro mysql-common" [operations/debs/mysqlatfacebook] (master) - https://gerrit.wikimedia.org/r/16570 [22:19:53] RECOVERY - SSH on virt1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:19:58] Change merged: Asher; [operations/debs/mysqlatfacebook] (master) - https://gerrit.wikimedia.org/r/16570 [22:20:47] PROBLEM - SSH on virt1008 is CRITICAL: Connection refused [22:23:32] New patchset: awjrichards; "Fix footer logo for enwiki on MobileFrontend" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16573 [22:24:45] Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16573 [22:37:35] PROBLEM - NTP on virt1008 is CRITICAL: NTP CRITICAL: No response from NTP server [22:40:03] PROBLEM - Puppet freshness on srv198 is CRITICAL: Puppet has not run in the last 10 hours [22:40:21] RECOVERY - SSH on virt1008 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:42:45] PROBLEM - Swift HTTP on ms-fe1002 is CRITICAL: Connection refused [22:46:12] PROBLEM - Puppet freshness on potassium is CRITICAL: Puppet has not run in the last 10 hours [22:57:54] RECOVERY - NTP on virt1005 is OK: NTP OK: Offset -0.0345107317 secs [23:05:21] New patchset: Alex Monk; "(bug 31754) Enable WikiLove on Swedish Wikinews" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/16577 [23:09:46] RECOVERY - NTP on virt1008 is OK: NTP OK: Offset -0.03513991833 secs