[00:12:25] TimStarling: I want to ask you a few things about swift/ms7 when you have the time [00:12:38] https://wikitech.wikimedia.org/view/Swift/Open_Issues_Aug_-_Sept_2012/Cruft_on_ms7 specifically [00:18:37] TimStarling: No, it's because Aaron added a $wgMathDirectory usage to filebackend.php [00:45:35] PROBLEM - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours [00:48:51] !log starting nagios on spence [00:49:01] Logged the message, Master [00:49:52] TimStarling: the original patchset was the script importing a sql dump [00:50:25] TimStarling: which would allow toolserver to run arbitrary commands on one of our production systems [00:50:26] ACKNOWLEDGEMENT - Puppet freshness on ms-be6 is CRITICAL: Puppet has not run in the last 10 hours daniel_zahn its a Dell C2100 [00:52:09] Ryan_Lane: that's still the case as far as I can see [00:52:37] it's using tab delimited data [00:52:52] or it should be anyway [00:54:07] how's DROP TABLE; ALTER TABLE ... RENAME TO isn't racy? [00:54:20] lol @ Daniel's acknowledgement "it's a Dell C2100" [00:54:29] what happens on the hits in the middle of the two statements? [00:54:42] apparently the app will handle that situation magically [00:56:12] what happens if toolserver.org gets hacked and attempts to fill our database with bogus data? [00:56:19] we're fucked [00:56:26] come on, this is terrible in so many ways [00:56:29] yep [00:56:41] we've complained a number of times about this [00:56:46] apparently it's only running for a number of months [00:56:50] that code is pretty dense [00:56:51] then it'll be fixed for next time [01:00:31] RoanKattouw: :) [01:00:43] RoanKattouw: does umask for wikidev users still matter after switch from svn to git? [01:01:07] re: "Simply running svn up with the wrong umask can put our SVN checkout in a nasty broken state" [01:01:08] I think so [01:01:17] I believe git pull can have the same effet [01:01:43] alright [01:02:19] I've seen screwed-up ownership in the .git directory at least once, but I don't remember if it was due to a bad umask or not [01:07:53] so the data imported by that script is actually used for something? [01:12:17] why is the toolserver used at all? [01:13:42] my question exactly [01:14:24] because it's doing the computation on toolserver and it's somehow too much work to move that over too [01:14:43] I think (and have said) this is a bad idea [01:40:38] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 242 seconds [01:40:56] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CRIT replication delay 262 seconds [01:46:47] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 613s [01:59:23] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [01:59:32] RECOVERY - MySQL Slave Delay on storage3 is OK: OK replication delay 10 seconds [02:00:17] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 8 seconds [02:00:35] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [02:10:20] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [02:10:20] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [02:12:17] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: Puppet has not run in the last 10 hours [02:57:26] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: Puppet has not run in the last 10 hours [02:58:20] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: Puppet has not run in the last 10 hours [03:03:35] RECOVERY - Puppet freshness on nfs1 is OK: puppet ran at Thu Aug 30 03:03:25 UTC 2012 [03:05:19] "cd /var/wlm/data/ && rm.txt" [03:05:37] I think there's a typo in that comment. [03:38:22] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [03:51:24] PROBLEM - Puppet freshness on cp1023 is CRITICAL: Puppet has not run in the last 10 hours [03:56:22] PROBLEM - Puppet freshness on cp1022 is CRITICAL: Puppet has not run in the last 10 hours [04:05:21] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: Puppet has not run in the last 10 hours [04:06:55] paravoid: i think i addressed racy in my comment? maybe good enough, maybe perfectly (we could ask domas ;) ) [04:08:18] (that's gerrit 17964) [04:22:27] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [04:22:27] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [04:22:27] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [04:22:27] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [04:22:27] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [04:22:28] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [04:22:28] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [04:22:29] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [04:22:29] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [04:22:30] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [04:22:30] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [04:22:31] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [04:22:31] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [05:29:05] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [07:08:10] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: Puppet has not run in the last 10 hours [07:31:20] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: Puppet has not run in the last 10 hours [07:42:17] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: Puppet has not run in the last 10 hours [07:52:29] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [07:53:59] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [07:58:29] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [08:02:46] hello [08:22:11] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [08:22:11] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [08:22:11] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [08:22:11] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [08:23:23] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:28:11] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [08:34:02] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [08:38:41] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [08:40:11] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [08:56:14] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [09:06:08] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [09:46:07] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 185 seconds [09:46:07] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 186 seconds [09:46:25] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 199 seconds [09:46:25] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 199 seconds [09:55:07] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [09:55:07] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [10:46:34] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [10:46:52] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [10:47:46] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [10:48:58] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [10:53:54] !log gerrit-wm irc bot is apparently no more sending anything to #mediawiki :( IRC user has been inactive since midnight UTC. Opened {{bug|39797}} [10:54:07] Logged the message, Master [10:54:12] !root [10:54:22] any ops around to restart irc echo on manganese ( https://bugzilla.wikimedia.org/show_bug.cgi?id=39797 ) [10:54:32] gerrit-wm is no more sending notification to IRC channel [10:54:35] started occurring at midnight [10:54:43] so might be a cron job that killed it / caused issue [10:54:55] apergos: paravoid: ^^^ [12:11:34] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [12:11:34] PROBLEM - Puppet freshness on ms-be10 is CRITICAL: Puppet has not run in the last 10 hours [12:13:31] PROBLEM - Puppet freshness on ms-be4 is CRITICAL: Puppet has not run in the last 10 hours [12:21:26] apergos [12:24:20] mark: [12:24:50] that rsync cron spam [12:24:52] why is that happening? [12:25:24] which? [12:25:43] nice [12:25:46] so you don't even see it [12:25:52] dataset1001 [12:25:55] I looked at cronspam today [12:26:00] but I ddn't see a huge amount [12:26:08] i don't want any from that box [12:26:10] I can't do shit about it [12:26:15] and it's cluttering up my email [12:26:18] whereas you're not even seeing it [12:26:30] it makes me want to turn off the damn cron job every day [12:26:47] I just said I looked at cronspam today [12:26:51] I look at it every day [12:26:58] hi; can you please to restart ircecho on manganese? gerrit-wm is no more sending notification ( https://bugzilla.wikimedia.org/show_bug.cgi?id=39797 ) [12:27:00] well apparently you're not fixing it then? [12:27:11] the dataset messages are very few compared to everything else [12:27:32] yeah there's virt stuff which I'm complaining to ryan about [12:27:38] but I don't see the point of getting them [12:27:48] if it's something you can't fix, why not send it to devnull? [12:27:50] or yourself only [12:28:02] I can send em just to me, that's fine [12:28:13] it's not like anyone else is gonna do anything with it [12:28:19] I do want to get notifications, not for these files, maybe I can filer a little better [12:28:26] yeah that's true [12:28:34] so are these there because the files change during the rsync? [12:28:40] well they complete [12:29:02] or they are temp files that get tossed [12:29:10] I got rid of some of those but not all I guess [12:29:15] perhaps those shouldn't get rsynced then [12:29:20] or perhaps LVM snapshots would help there [12:30:20] we really don't need em rsynced, that's the best [12:30:54] but for now I'll probably mailifonly to the alias for dumps (which I think has only me on it :-P) [12:31:02] thanks [12:31:05] yw [12:52:56] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [12:53:41] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [12:57:53] PROBLEM - Puppet freshness on ms-be12 is CRITICAL: Puppet has not run in the last 10 hours [12:58:56] PROBLEM - Puppet freshness on ms-be11 is CRITICAL: Puppet has not run in the last 10 hours [13:16:17] woot sound the klaxon. first successful payment test through the new frack payments rig! [13:17:43] congrats [13:38:59] PROBLEM - Puppet freshness on ms-be5 is CRITICAL: Puppet has not run in the last 10 hours [13:52:02] PROBLEM - Puppet freshness on cp1023 is CRITICAL: Puppet has not run in the last 10 hours [13:57:07] PROBLEM - Puppet freshness on cp1022 is CRITICAL: Puppet has not run in the last 10 hours [14:06:16] PROBLEM - Puppet freshness on ms-fe4 is CRITICAL: Puppet has not run in the last 10 hours [14:06:27] apergos: here? [14:06:43] yes [14:06:53] are you staying or about to leave? [14:07:05] I'll be here for a while [14:07:08] okay [14:07:10] what's up? [14:07:19] we have a disk replaced in ms-be7 [14:07:30] that needs to be formatted and reenabled [14:07:41] ah ha [14:08:08] the box probably needs a reboot too; the disk was sdg but hte system now sees it as "sdo" [14:08:20] ugh [14:08:22] we can always mv the node, but I prefer clean solutions [14:09:08] !log authdns-update for helium/potassium (poolcounter servers) [14:09:21] Logged the message, RobH [14:09:22] I guess chris swapped it in last night? [14:09:28] yes [14:10:11] I think formatting etc. is being handled by puppet (scary) [14:10:22] and the rest is the ring builder, which I've done once before [14:10:23] so, [14:10:25] paravoid: apergos: mark: we have lost gerrit notification from ircecho on manganese ? Can one of you look at it please ? https://bugzilla.wikimedia.org/show_bug.cgi?id=39797 [14:10:28] would you like to do this one? :-) [14:10:32] yeah I was looking [14:10:34] (maybe I should get a simple shell access on that server) [14:10:44] but I don't see what's wrong yet [14:11:02] I restarted it and have been loking a bit at the code and at the logs [14:11:18] next I might shoot it and run it from the command line and see if I get anything useful (since it's broken anyways) [14:11:44] paravoid: sure, but it will be a little bit later. I wanna try to make headway on the irc bot (or reach the giving up point) [14:11:59] okay [14:12:02] I have a meeting in 20' [14:12:05] ok [14:12:15] and another one an hour after that [14:12:31] well do your meetings, it's fine [14:13:44] anything tricky about the ringbuilder piece? [14:20:33] there are instructions [14:20:52] but I don't think they're complete [14:21:02] I don't remember the details; catch up with me when you're about to do it [14:21:09] ok [14:21:18] or when you hit a wall [14:21:57] great [14:23:13] PROBLEM - Puppet freshness on ms-be1001 is CRITICAL: Puppet has not run in the last 10 hours [14:23:13] PROBLEM - Puppet freshness on ms-be1003 is CRITICAL: Puppet has not run in the last 10 hours [14:23:13] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [14:23:13] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [14:23:13] PROBLEM - Puppet freshness on ms-be1009 is CRITICAL: Puppet has not run in the last 10 hours [14:23:14] PROBLEM - Puppet freshness on ms-be1002 is CRITICAL: Puppet has not run in the last 10 hours [14:23:14] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [14:23:15] PROBLEM - Puppet freshness on ms-fe1001 is CRITICAL: Puppet has not run in the last 10 hours [14:23:15] PROBLEM - Puppet freshness on singer is CRITICAL: Puppet has not run in the last 10 hours [14:23:16] PROBLEM - Puppet freshness on virt1001 is CRITICAL: Puppet has not run in the last 10 hours [14:23:16] PROBLEM - Puppet freshness on virt1002 is CRITICAL: Puppet has not run in the last 10 hours [14:23:17] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [14:23:17] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [14:25:02] apergos: looks like gerrit-wm notify again thx! [14:25:09] it is? [14:25:12] but uh [14:25:17] got a notification [14:25:26] have you done anything? [14:25:28] where? [14:25:34] in #mediawiki [14:25:35] oh [14:25:37] huh how weird [14:25:41] ok well um [14:25:48] got restarted 10 minutes ago apparently [14:25:54] I restarted it much before that [14:25:57] and it didn't help [14:26:13] guess some magic stuff appeared that fixd it :) [14:26:15] I closed the bug! [14:26:16] it's running directly from the command line and not in screen or anything so I'm going to shoot it and let it run again [14:26:18] thx! [14:26:22] ok [14:26:48] I found a bunch of [14:27:05] ERROR com.google.gerrit.server.git.PushReplication : Cannot replicate to gerrit2@formey.wikimedia.org:/var/lib/gerr [14:27:05] it2/review_site/git/operations/puppet.git [14:27:05] and [14:27:19] TransportException: gerrit2@formey.wikimedia.org:/var/lib/gerrit2/review_site/git/operations/puppet.git: session is down [14:27:20] in the logs [14:27:32] but I did not get anywhere close to figuring out what the cause was/is [14:28:59] I don't even know that this is the issue [14:45:41] hey makr [14:45:42] mark [14:45:45] can you look at this today? [14:45:51] it has been waiting since monday [14:45:57] https://gerrit.wikimedia.org/r/#/c/21749/ [15:01:15] hashar: I think it was premature to close the bug, sorry [15:02:47] dohh [15:03:10] i think formey is a backup / slave [15:03:10] I see reviews and patchsets being added to the logs [15:03:17] and nothing going to the channel [15:03:24] ^demon: apparently gerrit replication from manganese to formey has some issues [15:03:45] apergos: if it is written in the log, at least the gerrit hooks are working [15:03:50] so that let us with the irc bot [15:04:08] could be either ircecho that no more track the file [15:04:10] I do see that gerrit on manganese was restarted yesterday [15:04:13] hey opsen, is there someone who has some spare cycles to take care of https://rt.wikimedia.org/Ticket/Display.html?id=2970 (Redirect all .mobile requests to .m)? [15:04:26] well ircecho has them open, I checked that [15:04:32] <^demon> hashar: What sort of issues? [15:04:49] ^demon: ERROR com.google.gerrit.server.git.PushReplication : Cannot replicate to gerrit2@formey.wikimedia.org:/var/lib/gerrit2/review_site/git/operations/puppet.git [15:04:52] session is down [15:04:56] or at least it opens them and then does some inotify thing on them [15:04:58] <^demon> apergos: Could you please merge https://gerrit.wikimedia.org/r/#/c/21965/? I can't get at the gerrit logs until that goes back in. [15:05:03] spotted by apergos in the logs [15:05:14] bah, it doesn't let me look at it [15:05:17] oh [15:05:20] my client took the ? [15:07:56] what host do you need that on? [15:08:05] <^demon> manganese & formey [15:10:20] on formey: [15:10:22] err: Could not prefetch ssh_authorized_key provider 'parsed': Could not parse line "ssh-rsa gerrit2" at /var/lib/gerrit2/.ssh/authorized_keys:2 [15:10:32] dir perms are fixed [15:10:45] <^demon> Eh that's not surprising. Permissions is all I needed. [15:10:46] <^demon> Thanks. [15:11:34] ottomata: it's not strange that reviews take a long time if you do these huge commits eh [15:11:45] it's not something you can review in a minute [15:12:05] why do we need xinetd? [15:12:19] I didn't write this, this was taken from puppetlabs [15:12:24] paravoid suggested I commit the files directly [15:12:39] <^demon> hashar: This is why I want 2.5. I could just /stop replication/ by unloading the plugin :) [15:12:44] xinetd is a way to run an rsync daemon [15:12:45] https://github.com/puppetlabs/puppetlabs-rsync [15:12:48] ^demon: :)))))))) [15:12:50] the rsync module uses it [15:12:52] (but doesn't have to) [15:12:56] it can use init.d/ scripts too [15:13:05] apergos: so I guess the gerrit-wm not responding is due to ircecho :/ [15:13:07] <^demon> Oh herp derp. I know what's wrong. [15:13:17] so at least split those up in separate commits [15:13:20] <^demon> I'll have a fix for replication real soon now. [15:13:22] i don't think we need xinetd [15:13:26] what, rsync and xinetd? [15:13:28] yes [15:13:36] well, the rsync module uses the xinetd module [15:13:43] it doesn't need it [15:13:47] you can get around it, but you won't be able to just include rsync::server [15:13:51] you'd ahve to [15:14:01] class { "rsync::server": use_xinetd => false } [15:14:10] hashar: I'm sure but what I don't know is that happened to make it stop working [15:14:14] code is the same old code [15:14:26] i tested the xinetd stuff on my VM, and it works great [15:14:29] oohhh maybe demon's fix will help [15:14:35] doesn't mean we necessarily want it [15:14:48] is there a reason you don't want it? [15:14:56] yes, yet more cruft in our repo [15:15:05] xinetd is cruft? [15:15:08] i think so [15:15:27] really? its a pretty standard thing, no? [15:15:31] we don't use it [15:15:42] is there a reason why not? [15:15:44] it's standard in redhat land I think [15:15:54] more like, is there a reason why we would use it? [15:16:22] <^demon> apergos: https://gerrit.wikimedia.org/r/#/c/22031/ + a puppet re-run on formey will fix replication and the puppet error [15:16:43] * ottomata googling for internet opinions on xinetd on debian/ubuntu [15:16:49] so how did it break? [15:17:02] <^demon> SSH public key wasn't installed on formey [15:17:04] <^demon> Just the private. [15:17:10] can you split it up in separate commits? [15:17:13] rsync might go in soon [15:17:22] we can add xinetd later if it's really needed [15:17:27] ottomata: I'm with mark on this, there is nothing "wrong with" xinetd per se but why use it if we don't need to, it's just one more thing to keep track of [15:18:06] i dislike big commits in general [15:18:19] incremental is good [15:18:43] yeah, i understand that, and what I would've prefered to do is not commit this at all in puppet [15:18:47] since I didn't write it [15:18:51] there are lots of small commits [15:18:59] you can go read history from upstream if you want to review all the small commits :) [15:19:22] as is, if you try to use rsync module without xinetd, you'll get puppet errors unless you try to get around it [15:19:33] you just said rsync doesn't need xinetd per se [15:19:41] right, but you have to manually do it [15:19:42] and [15:19:46] then do that? [15:19:49] the code is there in rsync module to use xinetd [15:19:50] or change the default parameter? [15:20:02] so it seems wrong to have the code existing in a non working state [15:20:05] ja was about to suggest that [15:20:10] i'm ok with that, if we default to init.d [15:20:19] init.d? [15:20:23] what's wrong with running rsync in daemon mode? [15:20:32] oh sorry [15:20:35] I think the more important question is what to do with external modules [15:20:36] i misread that as inetd [15:20:40] aye ja [15:20:46] yeah that is a bigger q [15:21:00] paravoid: just don't use em? :) [15:21:06] haha [15:21:11] import them as they come into our puppet repo with a danger of having it blow up in size [15:21:14] and cruft [15:21:19] you want us to rewrite everything ourselves when someone else has done hte work? [15:21:25] yes if it's not exactly what we need [15:21:59] i don't particularly like many 3rd party modules [15:22:05] have you looked at this one? [15:22:09] not yet [15:22:15] I looked at external rsync modules when I was going to write one (still plan to) [15:22:21] they didn't really cover our cases well [15:22:27] which is why I was going to write one :-/ [15:22:34] i was thikning about writing one too [15:22:34] I'm afraid we'll end up with a NIH syndrome [15:22:35] but if you start like "we need X extra modules because module Y requires it even if we don't need it" then I already start to dislike it ;) [15:22:39] our template duplication right now is really bad [15:22:46] but otoh I've seen a lot of crappy modules as well [15:23:03] yeah, for example, that's why I wrote the generic::mysql define (it should bea module) [15:23:10] so ^demon do we need to restart anything anywhere (on manganese)? [15:23:11] because there was nothign good that I could find [15:23:22] but this rsync one is really good, it does everything I want it too [15:23:35] specificically the ability to puppetize rsync daemon modules separately [15:23:57] <^demon> apergos: Shouldn't need to, but I'll check. [15:23:57] so you don't have to have an rsync.conf template for every machine type [15:24:19] so mark, ok, I will change this commit, I will remove xinetd and use init.d by default [15:24:42] <^demon> apergos: I'm not getting replication errors anymore. [15:24:48] ok well that's good [15:24:54] now I wonder about gerrit-wm [15:24:59] guess I'll restart it again :-/ [15:28:16] New review: Demon; "Test" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/21961 [15:28:22] <^demon> Seems to be back :) [15:28:35] yay [15:28:38] well that was painful [15:28:58] I was waiting for someting to show up in one of the logs. :-D [15:29:19] so why would formey replication being broken also break ircbot? [15:29:30] <^demon> Unrelated, probably. [15:29:32] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [15:29:45] but [15:29:50] I have restarted the bot twice already [15:29:53] ottomata: if you change the module in any way (like that default parameter), do that in a separate commit [15:30:05] no, three times [15:30:19] and now after you touch things it suddenly works? [15:30:20] <^demon> Dunno. [15:30:32] <^demon> I'm just that good? ;-) [15:30:36] :-D [15:31:08] ^demon knows where it's /special/ place is [15:31:34] <^demon> Oh, I know all of gerrit's special places. [15:31:37] ewww [15:31:41] this is a family channel [15:32:29] ok mark, I can do that [15:32:43] buuuut, gotta ask, is that really better? is it better to have a commit that is sorta kinda 'broken' [15:32:43] ? [15:32:53] upstream rsync module w/o xinetd? [15:32:53] it's not broken, that's bs [15:33:16] i mean, doesn't really matter since no one will revert to that commit, but meh? [15:33:19] but it can be helpful to have a commit which is exactly like upstream [15:33:20] seems weird to me [15:33:28] ja i could see that [15:33:35] yeah, i guess they have them in different repos anyway [15:33:36] ok ok ok [15:34:05] but the rsync module works, you just can't use it with certain parameters [15:34:16] it's unfortunate that that is the default parameter, but ok, we'll change that [15:34:27] we're essentially forking the module that way though [15:34:33] yes [15:34:42] means it's a one-off import, any further enhancements/bugfixes to that module we'll have to three-way merge [15:34:48] and it can be handy to be able to revert that fork commit [15:34:54] for this reason too [15:35:41] mark, just curious what you think about this, I had originally committed this as git submodules, instead of fully importing [15:35:55] in general, do you prefer manually commits of 3rd party modules like this? [15:36:05] or would a gerrit mirror + git submodule be better? [15:36:11] we're not setup for git submodules currently [15:36:16] ottomata: submodules with third-party repos (github) is not going to happen [15:36:18] so you can't use them now, it's as simple as that [15:36:21] and indeed [15:36:25] right, which is why I said gerrit mirror [15:36:27] 3rd party repos are completely out of the question [15:36:33] submodules within our gerrit, it's something we should talk about [15:36:37] how are we not setup? meaning people don't now how to use them? [15:36:37] yes [15:36:57] ottomata: our processes have "git fetch", which won't fetch them afaik [15:37:01] you need git submodule update or something [15:37:03] New patchset: Ottomata; "Manually adding modules/rsync for managing rsyncd modules." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21749 [15:37:26] but yeah, it might make sense to import modules like that in separate trees [15:37:35] agreed [15:37:38] we just don't have that today [15:37:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/21749 [15:37:52] I'm not 100% sure it's the right way to go, I have mixed feelings about that [15:38:03] i don't know exactly how git submodules work, so can't say [15:38:12] have you used svn:externals? [15:38:22] no [15:38:23] it's similar to that [15:38:24] ah :) [15:38:33] They don't work in a sane way like they should [15:38:45] yeah, I'm not entirely happy with them either [15:38:56] you effectively can't do tree-wide changes [15:38:58] yeah, they are more annoying than svn:externals [15:39:07] i think you are right, you have to do some manual stuff to get them [15:39:32] <^demon> We use submodules for deployment of extensions :) [15:39:35] besides that, I mean that I think you can't have a commit that ties one particular revision in a submodule with a commit in your parent tree [15:39:53] but it's been years since I've seen them, so things might have evolved since then [15:40:00] Importing 3rd party repos into gerrit is nearly forking them anyway - yes for security it's needed but if 1 commit is ever rejects for any reason (maybe just our use case) it can never be updated anyway. [15:40:01] <^demon> In the parent repo, updating submodules is a distinct commit there. [15:40:05] ^demon: feel free to pitch in, input is welcome [15:41:14] hmmm, i paravoid, i think submodules allow you to tie to specific commits [15:41:19] <^demon> Yep. [15:41:24] <^demon> https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=shortlog;h=refs/heads/wmf/1.20wmf10 - here's the current deployment branch. [15:41:48] http://git-scm.com/book/en/Git-Tools-Submodules [15:41:48] This is an important point with submodules: you record them as the exact commit they’re at. You can’t record a submodule at master or some other symbolic reference. [15:42:02] <^demon> Here's an example of a commit updating a submodule: https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=commit;h=4e9d1fd34580802f3cbbf6480b124a776d53bb28 [15:42:22] but still, we'd have to run git submodule update [15:42:23] or somethign [15:42:25] if someone does change them [15:42:31] and even with a new clone I think [15:42:37] <^demon> Yeah, we do that as part of the deploy cycle. [15:42:44] <^demon> `git pull && git submodule update --init` [15:43:00] mark: [15:43:00] https://gerrit.wikimedia.org/r/#/c/21749/ [15:43:02] better? [15:43:24] much better [15:44:16] cool, i'll go ahead and commit the other change [15:44:45] <^demon> ottomata: As far as the "updating to HEAD" problem -- gerrit has a feature called "submodule subscriptions" that can be utilized. It's what we do to keep mediawiki/extensions.git up to date. [15:45:19] I can't approve this: https://gerrit.wikimedia.org/r/#/c/21749/2/modules/rsync/files/motd [15:45:30] hahahahaha [15:45:48] hahah [15:45:59] lol [15:46:04] wtf [15:46:05] ahhhhhhh but mark! it is from upstream! and you wanted a pure upstream commit :p [15:46:21] in his defence, he didn't say exactly that :) [15:46:25] hehe, true [15:46:26] want me to remove it in another? [15:46:34] !log mw8 shutting down for DIMM troubleshooting [15:46:44] Logged the message, Master [15:46:54] most definitely you will update that in a subsequent commit ;-) [15:47:02] motd is off by default anyway [15:47:05] mk [15:47:19] apergos: I'm about to run in a second meeting in 15', are you still planning to do the ms-be7 thing today? [15:47:30] yes I am [15:47:35] I know it's getting late for your schedule :) [15:47:40] go do your meeting [15:47:47] if I get stuck I'll check when you get back [15:48:28] okay, I was about to ask if you want to skim through the instructions so that we can find the undocumented parts, so that you can do it tomorrow when I'll be sleeping :P [15:48:44] tomorrow? [15:48:48] I'm gonna do it tonight! [15:49:02] whatever works for you :-) [15:49:06] heh [15:49:37] New patchset: Ottomata; "Removing xinetd support" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22036 [15:50:27] New patchset: Ottomata; "Removing default puppetlabs motd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22037 [15:51:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22036 [15:51:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22037 [15:51:39] ok, mark. so in order those are: [15:51:39] https://gerrit.wikimedia.org/r/#/c/21749/ [15:51:39] https://gerrit.wikimedia.org/r/#/c/22036/ [15:51:39] https://gerrit.wikimedia.org/r/#/c/22037/ [15:51:40] s'ok? [15:53:27] reviewed all 3 [15:56:28] ok, I don't have +2 powers [15:57:07] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [15:58:50] mark ^, I don't have +2, could you approve them too? [15:59:26] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22036 [15:59:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22037 [15:59:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/21749 [16:01:10] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [16:01:57] thank youuuu! [16:02:09] I will ahve another commit later using that stuff, but I think anyone can review that one [16:02:25] mark and paravoid, thanks for helping me get that one straight, much obliged :) [16:04:10] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [16:18:52] PROBLEM - Host mw8 is DOWN: PING CRITICAL - Packet loss = 100% [16:19:55] RECOVERY - Host mw8 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [16:21:59] paravoid: have a second? [16:23:29] !log power cycle ms-be7 so it picks up the replaced drive with the right device name [16:23:39] Logged the message, Master [16:23:49] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [16:24:01] that was very silly, oh well, it was also harmless [16:26:11] oohhh noooo [16:26:13] we don't like it [16:26:16] not one little bit [16:26:40] billions of xfs errors [16:26:46] on all the drives :-( [16:27:10] apergos: ms-be7�shit!!! I am working w/ DELL the now on 6. [16:27:23] very intersting [16:27:42] if possible I would love to be kept up to date on ms-be6 (and/or 10 if you talk to them about it too) [16:27:50] you are not using virtual mounts ? [16:27:56] i will cc you on the emails [16:28:16] atm I am just watching it boot up or not [16:28:25] feel free to interject [16:28:46] PROBLEM - swift-object-server on ms-be7 is CRITICAL: Connection refused by host [16:28:46] PROBLEM - swift-container-replicator on ms-be7 is CRITICAL: Connection refused by host [16:28:54] yeah [16:28:55] PROBLEM - swift-account-replicator on ms-be7 is CRITICAL: Connection refused by host [16:28:55] PROBLEM - swift-object-updater on ms-be7 is CRITICAL: Connection refused by host [16:28:55] PROBLEM - swift-container-server on ms-be7 is CRITICAL: Connection refused by host [16:28:55] PROBLEM - swift-account-server on ms-be7 is CRITICAL: Connection refused by host [16:28:57] we know [16:29:13] PROBLEM - swift-container-updater on ms-be7 is CRITICAL: Connection refused by host [16:29:22] PROBLEM - swift-object-auditor on ms-be7 is CRITICAL: Connection refused by host [16:29:22] PROBLEM - swift-account-auditor on ms-be7 is CRITICAL: Connection refused by host [16:29:49] PROBLEM - swift-container-auditor on ms-be7 is CRITICAL: Connection refused by host [16:29:49] PROBLEM - SSH on ms-be7 is CRITICAL: Connection refused [16:29:49] PROBLEM - swift-object-replicator on ms-be7 is CRITICAL: Connection refused by host [16:29:58] PROBLEM - swift-account-reaper on ms-be7 is CRITICAL: Connection refused by host [16:31:27] New patchset: MaxSem; "WLM updater script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17964 [16:32:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17964 [16:45:50] !log removing srv281 from apaches pool because it's a broken piece of garbage. [16:46:00] Logged the message, notpeter [16:46:37] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [16:47:07] who knows how we get these things to boot (= the c2100s)when there's a bunch of failed mounts? [16:47:16] mountall: mount /srv/swift-storage/sdk1 [1618] terminated with status 32 [16:47:17] mountall: Filesystem could not be mounted: /srv/swift-storage/sdk1 [16:47:17] init: ureadahead-other main process (1652) terminated with status 4 [16:47:25] that's the last I got and it's unresponsive [17:01:51] cmjohnson1: now I am [17:02:03] notpeter: I told you it was cursed :) [17:02:07] notpeter: what happened? [17:02:24] paravoid: how can I skip the mounts of these disks? reboot and it hung :-( [17:02:36] with piles of cmplaints about lots of drives :-( [17:02:47] type 'S' [17:02:53] that's "skip mounting" [17:03:01] ah who would have guessed that >_< [17:03:02] also, that's exactly the symptom of ms-be-6 [17:03:04] ms-be6 [17:03:08] yeah I know [17:03:16] paravoid: debs were shat. the curse is back. haven't actually looked yet. [17:03:19] also, if it's down for a lot of time then it gets stale data and we have to empty it... [17:03:38] *now* it gives me a message [17:03:43] sigh [17:03:45] *beds [17:03:47] yeah I'm aware, the docs say up to a couple hours [17:03:50] so we'll see [17:04:31] paravoid: apergos has informed you of the problem�i am working on getting answers for 6 now�we'll see [17:04:43] RECOVERY - SSH on ms-be7 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [17:04:52] RECOVERY - swift-account-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [17:05:01] RECOVERY - swift-container-auditor on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [17:05:01] RECOVERY - swift-object-auditor on ms-be7 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [17:05:01] RECOVERY - swift-account-auditor on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:05:01] let's see what is in the logs [17:05:28] RECOVERY - swift-object-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:05:28] RECOVERY - swift-account-reaper on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:05:37] RECOVERY - swift-container-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [17:05:46] RECOVERY - swift-container-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [17:05:46] RECOVERY - swift-account-replicator on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:05:46] RECOVERY - swift-object-server on ms-be7 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [17:05:48] maybe it takes a while for the system to see the disks [17:05:58] we can add rootdelay= to cmdline [17:06:13] RECOVERY - swift-container-updater on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [17:06:13] RECOVERY - swift-object-updater on ms-be7 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:06:47] Aug 30 17:04:27 ms-be7 kernel: [ 84.460315] sd 0:0:11:0: [sdn] Unhandled error code [17:06:47] Aug 30 17:04:27 ms-be7 kernel: [ 84.460321] sd 0:0:11:0: [sdn] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK [17:06:47] Aug 30 17:04:27 ms-be7 kernel: [ 84.460331] sd 0:0:11:0: [sdn] CDB: Write(10): 2a 00 b0 d1 a1 e8 00 00 08 00 [17:06:47] Aug 30 17:04:27 ms-be7 kernel: [ 84.460350] end_request: I/O error, dev sdn, sector 2966528488 [17:07:28] and lots of [17:07:30] Aug 30 17:04:27 ms-be7 kernel: [ 84.946590] mpt2sas0: log_info(0x3112011a): originator(PL), code(0x12), sub_code(0x011a) [17:07:31] too [17:08:46] sdk and sdn [17:09:28] Change abandoned: Parent5446; "(no reason)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21322 [17:09:31] PROBLEM - Puppet freshness on ms-be8 is CRITICAL: Puppet has not run in the last 10 hours [17:10:35] grumble grumble [17:11:01] not. liking. it. [17:15:49] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [17:16:23] rootdelay is already 90 on these boxes [17:16:54] sigh [17:17:28] RECOVERY - Puppet freshness on ms-be7 is OK: puppet ran at Thu Aug 30 17:17:20 UTC 2012 [17:18:37] trying to decide what we could try next [17:19:57] looks like I won't be rebuilding rings anytime soon >_< [17:20:12] lovely boxes [17:21:02] yes [17:21:10] let's please get rid of them, at least partially [17:21:15] we're already 3 boxes down as we speak [17:21:29] in tampa that is [17:21:49] I really don' twant to be afraid to reboot one for fear that some disks will fail to show up [17:21:52] it's ridiculous [17:22:23] and tbh swapping the drive should not have meant that the new drive shows up as sdo anyways [17:22:51] for these it may be better to use drive identifiers instead of sda etc [17:22:57] /sys/block/by-id or so [17:23:24] New patchset: Pyoungmeister; "updating searchqa lib to reflect new architecture" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22050 [17:23:47] we use LABEL= already [17:24:11] that won't make the driver write to them any better though [17:24:16] but LABEL is sdg1 for example, and the mountpoint is /srv/swift-storage/sdg1 [17:24:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22050 [17:24:41] and puppet mkfs'es automatically those disks with the device node in puppet itself [17:24:49] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [17:24:54] (ugh @ puppet mkfsing...) [17:26:18] what's funny is that most of these mounts are fine... [17:26:38] I don't really understand how /srv/swift-storage/sdg1 is mounted but whatever [17:26:55] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22050 [17:27:30] apergos: puppet probably... [17:27:36] I'm merging all kinds of stuff that was left unmerged [17:27:41] looks like rsync module [17:27:43] is that safe? [17:27:43] yes by puppet [17:27:44] er [17:27:45] i wrote that stuff [17:27:48] ok, cool [17:27:50] just making sure [17:27:57] notpeter: that wasn't to you [17:28:00] but yes you can merge that [17:28:03] hehe, ok :) [17:28:37] if I reboot it again will we lose more disks? [17:30:06] apergos: try it? :) [17:30:12] hahaha [17:30:26] sure why not [17:30:32] mark: I don't know, mkfs'ing by puppet sounds a bit dangerous to me... [17:30:42] it feels more of a deployment step than a puppet thing [17:30:58] although my swift experience is a week old obviously. [17:30:59] heh maybe [17:31:14] I think I put some safety nets in, but i'm not sure I counted in this controller weirdness ;) [17:32:04] apergos: seriously, reboot it and let's see what happens [17:32:18] I am seriiusly going to [17:32:25] just noting which things are not mounted [17:32:37] PROBLEM - Puppet freshness on ms-be1 is CRITICAL: Puppet has not run in the last 10 hours [17:32:51] rootdelay=90, that's so ridiculous [17:32:56] missing disks *with* that is even more [17:33:02] that was for the thumpers [17:33:04] with 48 drives ;) [17:33:34] !log and rebooting ms-be7 again in the slim hope that we get back the missing drives... meh [17:33:45] Logged the message, Master [17:33:55] * apergos watches the reboot [17:34:22] I bet it gave me the 'S to skip' message the last ime but it scrolled waaay off the screen with the error messages... [17:34:47] so, seriously, I'm all for uniformity, but perhaps we need to have others bits of hardware for swift, at least partially [17:35:07] three copies sounded like a lot, but not with these boxes [17:35:11] I'm for uniforimty. let's have different hardware for swift than these. [17:35:37] PROBLEM - Host ms-be7 is DOWN: PING CRITICAL - Packet loss = 100% [17:36:04] are all the siwft boxen c2100s? [17:36:44] seems that way [17:36:52] rats [17:36:55] apergos: yes [17:39:05] bummer [17:39:09] apergos: so? [17:39:18] patience grasshopper [17:39:28] it's still generating error messages and trying to mount things [17:40:38] New review: Demon; "If you could rebase this, we can test it out on gerrit-dev.wmflabs." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/11589 [17:41:01] so, failed again? [17:41:19] RECOVERY - Host ms-be7 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:42:16] comparing what we have this time to what we had last time [17:43:00] I'm finishing up what I did and joining you [17:43:01] missing one more [17:43:29] if I reboot 7 more times it won't see any of them :-D [17:43:34] PROBLEM - Puppet freshness on ms-be2 is CRITICAL: Puppet has not run in the last 10 hours [17:45:43] Aug 30 17:22:22 ms-be7 kernel: [ 3440.250709] sd 0:0:6:0: Device offlined - not ready after error recovery [17:45:45] these are nice too [17:48:27] apergos: Don't worry you have another 2 servers to go before taking down swift ;) [17:48:47] I'm not touching any more of em [17:48:50] forget that :-P [17:48:54] cause [17:49:05] "you break it, you own it" and no waaaay am I owning swift [17:49:38] Ah, delegation. I like the idea [17:53:06] * apergos tries mounting one of the failed nes by hand [17:53:14] yeah that hung :-/ [17:53:57] If you loose one on every reboot are you sure it's not the card they're connected to? Must be some crappy drives if not [17:54:26] oh I'm pretty sure it's not the drives [17:55:16] PROBLEM - Router interfaces on cr1-sdtpa is CRITICAL: CRITICAL: host 208.80.152.196, interfaces up: 78, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr2-pmtpa:xe-0/0/1 {#6000} [10Gbps CWDM]BR [17:56:31] cmjohnson1: hey let's continue in this room [17:56:38] hmmm [17:56:47] how does puppet know to format a drive anyways? [17:56:53] lesliecarr: did you need anything in pmtpa? [17:56:54] paravoid: in squid.conf.php, I see "acl thumb_php urlpath_regex ^/w/thumb\.php", where is the code for what peers that has? [17:57:08] cmjohnson1: so yes, please move the link in cr2-pmtpa xe-0/0/1 to xe-1/0/0 [17:57:22] cmjohnson1: i should have said "link plus optic" [17:58:03] I can run fdisk on the drives it can't mount.. attempts to mount one of these just returns eventually with a cmplaint that it can't find the superblock (but I would expect i/o errors) [17:58:07] there is an optic in 1/0/0 [17:58:29] just remove/move it [17:58:41] we need to use the new optics as they are a different wavelength than the standard optics [17:58:57] and the adva system "expects" the certain wavelength in order to mix and unmix the channels [17:59:09] 507s! [17:59:20] GET /sdn1/21520/ ... and a 507 [17:59:22] i just realized that�i didn't change the optic downstairs [17:59:59] ah :) [18:00:06] no problem, we'll get that when you get back [18:00:33] * Damianz waves in some form at Leslie then runs for food [18:00:44] hi Damianz :) [18:01:15] LeslieCarr: do you know the power draw on the mx80s? [18:01:26] and 4200, and 4500 [18:01:44] if not i wont list them, but im listing the draw for the other servers for ulsfo [18:01:55] ah I see, there were errors (from the mount), just not the last things in the lof [18:01:56] log [18:01:59] RECOVERY - Router interfaces on cr1-sdtpa is OK: OK: host 208.80.152.196, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [18:02:04] *sigh* [18:02:30] RoanKattouw: i can find them [18:02:37] [[tab]] [18:02:46] meh, we wont need more pwoer. [18:02:54] LeslieCarr: I assume you meant to ping me ;] [18:03:06] oops yeah [18:03:06] doh [18:03:08] the damned 420s (20) can pull 8280W [18:03:14] so will have to move some to cab A [18:03:21] and have cross rack DAC cables. [18:03:35] (they are all 3M so they have some distance, if we can go inner rack cross rack) i will ask. [18:04:35] hrmm [18:04:37] we cannot do this. [18:04:43] lesliecarr: tokay all fixed [18:04:45] LeslieCarr & mark [18:04:51] damn autocorrect [18:04:52] the new ulsfo cannot handle all the servers we want. [18:04:58] RobH: MX80 - 376W, ex4500 - 650W, ex4200 - 320W [18:05:00] well, it can at idle i guess [18:05:04] but not at max [18:05:10] i wonder what the ul folks will say [18:05:25] the dell tool shows 5497W use [18:05:35] but thats not the MAX pulls, but that should be ok i think.... [18:05:35] yeah, that's above the max [18:05:39] yeah [18:05:43] :-/ [18:05:44] LeslieCarr: those the max or the data they have now? [18:05:58] those are the max ratings [18:06:02] hrmm [18:06:16] if we go by max we couldnt do but 75% of what we normally do in our racks [18:06:17] heh [18:06:29] we'll just be vague and see what they say ;] [18:06:39] the ex4500 i would expect to pull closer to the ex4200 due to not being fully loaded with optics which are the power draw [18:07:23] hrm [18:07:36] RobH: what about a 2nd 20A circuit in one of the racks [18:07:43] AaronSchulz: looking [18:07:45] or 2nd 30A circuit in the rack [18:08:16] I assume it routes stuff to the scalars [18:08:21] LeslieCarr: I think we are ok [18:08:26] the max draws are insanely high [18:08:43] we can also specifically set in the bios for them to stagger their power return for post [18:08:47] cool [18:08:52] AaronSchulz: two places it seems [18:08:56] so i think we are good. [18:09:25] lesliecarr: lmk if you have a link? [18:09:50] AaronSchulz: 1) text-settings.php, in turn used by the generator. 2) frontend.php, as an http_access deny to block external requests to that [18:09:59] AaronSchulz: you can also grep through generated/ to see the actual config [18:11:11] qi have link! [18:11:14] and traffic [18:11:15] huzzah [18:11:55] \o/ [18:11:59] =thumb_php' => 'rendering.svc.pmtpa.wmnet', [18:12:22] paravoid: so that sends thumb.php requests to scalars then right? [18:12:48] yes [18:12:58] RobH: note that we're not ordering the backup system yet [18:13:09] paravoid: can you also add ^/w/thumb_handler\.php to that regex [18:13:11] then we are fine for power [18:13:15] and I have my doubts as to whether we should try to cram that in there [18:13:17] ? [18:13:20] same. [18:13:33] it won't be expandable in any case [18:13:52] LeslieCarr: are you planning to use two MX80s for the caching site? [18:13:56] AaronSchulz: I can; can you explain what changed? [18:14:29] well the change was weeks ago, but thumb_handler wraps thumb.php and is used internally [18:14:33] ok, im going to go snag some lunch, back shortly. [18:14:38] mark: i figured use both, though i guess with power we can easily use one [18:14:53] since it can still be accessed externally, it should route to the scalars where its supposed to run, just like thumb.php [18:15:00] since they're free, maybe two yeah [18:15:09] otherwise I'd say, it's overkill hehe [18:15:11] no one gets URLs to thumb_handler.php, but just in case people hit it [18:15:15] exactly [18:15:22] what do you mean accessed externally? [18:15:26] could use one in another site for peering/transit [18:15:29] we don't allow that, frontend caches block it [18:15:35] it would be more expensive for dual 4500s but easier ;] [18:15:54] dont need to buy the rj45 to sfp things, since we have a bunch spare in tampa [18:16:00] (for the few non 10G servers) [18:16:05] would they work in the 4500? [18:16:09] paravoid: http://commons.wikimedia.org/w/thumb_handler.php/1/15/Simon_Bolivar_Buckner_Sr.jpg/368px-Simon_Bolivar_Buckner_Sr.jpg [18:16:14] (having cross rack 10g connections sounds horrible to me) [18:16:17] they should work in the 4500s [18:16:19] (but not sure) [18:16:27] we should try it out [18:16:33] cmjohnson1: has them and a 4500 in tampa [18:16:35] nah, we can do cross rack 10G in that setup [18:16:46] it's bad practice, but it's not like that setup will expand to more racks [18:16:51] i emailed asking if the partitions in rack can remove small acces panels [18:16:53] like in our racks in eqiad [18:17:01] so its easier to route the 3m DAC cables [18:17:15] if they have them, thats fine, if its over the top of the cabinet and back down, we may need longer. [18:17:19] oh [18:17:21] btw [18:17:24] they have access panels between the racks [18:17:28] we can move some squids from row a to row c now eh [18:17:30] New patchset: Pyoungmeister; "lucene: moving en search traffic to pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22055 [18:17:37] so 7 mounted, 9 not, and fresh out of ideas. anyone got any? [18:18:01] paravoid: http://commons.wikimedia.org/w/thumb.php?f=Simon_Bolivar_Buckner_Sr.jpg&w=368 [18:18:19] yeah, it seems to work, it's commented out for non-upload [18:18:20] Change merged: awjrichards; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/21753 [18:18:21] no idea why... [18:18:35] New patchset: MaxSem; "WLM updater script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17964 [18:18:46] mark: any insight? it seems you can hit imagescalers externally [18:18:48] I think its fine for them to work, but they should both go to the scalars [18:19:06] wikisource uses external thumb.php urls in one of it's extensions [18:19:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17964 [18:21:17] paravoid: sure, using en.wikipedia.org/w/thumb.php [18:21:25] or what do you mean? [18:21:55] that's what I mean, it's explicitly denied for upload.wm.org but allowed for the rest [18:22:03] sure [18:22:05] tons of people use it [18:22:07] for mobile clients and stuff [18:22:13] i referred to this in my mails last week [18:22:23] why is that blocked for upload then? [18:22:24] it seems to be the majority of traffic on the image scalers [18:22:41] because then there's no mediawiki instance mapped to it? [18:23:22] now it's just a normal mediawiki request for a certain project (wikipedia) and language (english) [18:23:32] and that's how mediawiki (even on the image scalers) gets initialized [18:23:35] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [18:23:35] PROBLEM - Puppet freshness on ms-be1011 is CRITICAL: Puppet has not run in the last 10 hours [18:23:35] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [18:23:35] that doesn't work for the upload domain [18:23:39] anyway [18:23:40] dinner's ready [18:23:42] aha [18:23:46] enjoy [18:24:01] I don't see why imagescalers need to know if it's english wikipedia, but okay, your explanation makes sense I guess [18:24:34] AaronSchulz: sorry for the interrupt, I try to take advantage of such questions to understand our setup better :) [18:24:41] because they run a regular mw instance like the other apaches [18:24:43] rendering can depend on wiki settings (including what extensions are running on that wiki) [18:24:52] so the only way they know the path to things is to do the mapping [18:24:55] project/langcode [18:24:58] -acl thumb_php urlpath_regex ^/w/thumb\.php [18:24:58] +acl thumb_php urlpath_regex ^/w/thumb(_handler)?\.php [18:25:03] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22055 [18:25:08] and there is already code for mapping site/lang in the host to a wiki ID and grabbing the settings [18:25:23] paravoid: yeah I saw it in git diff :) [18:29:35] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [18:36:44] there is one option I have left [18:36:51] meh [18:41:05] by switching search traffic from eqiad to pmtpa, traffic went down by 3 megs a second in eqiad, but up by 6 megs a second in eqiad... [18:41:08] interesting... [18:41:30] !log deploying squid config all for thumb_handler.php addition to thumb_php [18:41:35] PROBLEM - Puppet freshness on palladium is CRITICAL: Puppet has not run in the last 10 hours [18:41:40] Logged the message, Master [18:41:46] apergos: what's ms-be7's status? [18:41:51] just like it was [18:41:56] 7 disks mouonted, the rest not [18:42:04] for objects on those it serves 507s [18:42:47] PROBLEM - Apache HTTP on mw8 is CRITICAL: Connection refused [18:42:51] yeah, we should remove them from the rings then [18:42:56] so they can get back the 3rd copy [18:43:06] but let me login first, to have a look too [18:43:07] just in case [18:43:10] please do [18:43:27] I'm looking at things relating to the mpt2sas driver, in hopes I can turn up something [18:45:05] arghabharga. ubuntu-- [18:45:29] paravoid: can you help me make sense of a pbuilder snafu? [18:46:21] sec [18:46:25] thx [18:47:12] oh I think I may have finally found it [18:47:36] why oh why. just why. [18:48:45] apergos: sigh [18:48:50] broken controller I'd say [18:48:55] could be [18:49:00] note how this box also runs precise and a fairly recent kernel [18:49:04] while the others ran lucid [18:49:13] so we have 2.6.32 and 3.2 with the same symptoms [18:49:17] see what gets me is that we're using a real recent kernel on ms-be... forget if it's 10 or 6 [18:49:21] yeah [18:49:22] oh wait, that runs 3.2 [18:49:24] er, 2.6.32 [18:49:29] this is 2.6.32 [18:49:33] how that happened? [18:49:35] but there's one of em with 3.2.something [18:50:12] I'm sure fine with upgrading this kernel to whatever is the latest for lucid [18:50:21] I just don't expect it t make a difference [18:50:29] agreed [18:50:32] we've seen both [18:51:03] I guess these are h200s ? [18:51:03] I'd say disable the 7(!?!) disks in the rings and open up a ticket for Chris [18:51:22] we could try some other controller with these that Sucks Less (tm) [18:51:24] *shrug* [18:52:05] :q! [18:52:07] err [18:52:08] apergos: ^ [18:52:15] Jeff_Green: same to you, buddy [18:52:37] apt-get install holy-handgrenade [18:52:43] which are you pointing to? [18:53:34] New patchset: Pyoungmeister; "search: moving pool 2 and 3 traffic to pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22060 [18:53:42] 21:51 < paravoid> I'd say disable the 7(!?!) disks in the rings and open up a ticket for Chris [18:53:53] yeah, I'm already getting the device ids together [18:54:03] just wondered it there was something I missed [18:55:28] paravoid, apergos: do you know about owa1-3 status? per RT they are supposed to be repurposed and given to analytics, but they currently have swift stuff on them, like "swiftcleaner" and /srv/swift-storage [18:55:43] uh oh. er, no idea [18:56:05] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22060 [18:56:08] so is ms-be7 in the proverbial toilet as well [18:56:19] apergos ^ [18:56:41] mutante: they're part of the swift test cluster [18:56:50] cmjohnson1: yes [18:57:03] so that is 6,7 and 10�.not good! [18:57:13] nope [18:57:25] cmjohnson1: yes it's squarely in the sh-tter [18:57:38] PROBLEM - Puppet freshness on ms-fe2 is CRITICAL: Puppet has not run in the last 10 hours [18:58:09] paravoid: thanks, so we can't reinstall them yet. i am updating RT-2511 [18:58:26] because Diederik asked about that one [18:58:53] !log upgrading labsconsole to 1.20wmf10 [18:59:02] Logged the message, Master [18:59:55] mutante: I'm not sure what's the status of the test cluster though... [19:00:01] if it's still needed etc. [19:00:12] I've asked Ben about it before I got involved with swift and he said he needed it [19:00:17] that was a month or two ago [19:01:06] hmm, yeah, that's the thing i have been wondering about [19:01:07] why do the instructions have this [19:01:08] cp -a /etc/swift ~; cd ~/swift; [19:01:26] * Jeff_Green is disgruntled to have just wasted an hour undoing the malfeasance of boneheaded pbuilder package config [19:01:29] before removing the devices from the rings? [19:01:29] paranoid are the drives part of an array? [19:01:49] and I run this on the particular host right? or do I have to be in sockpuppet in some weird directory? [19:02:14] apergos: you do that on one of the swift hosts (ms-fe1 has ~/swift/ already) [19:02:21] you build the rings [19:02:22] on a front end? [19:02:28] and then scp them to stafford volatile afaik [19:02:34] uggh [19:02:35] which then gets distributed to ms-be* [19:02:38] I see [19:04:07] apergos paravoid just forwarded email from dell tech [19:04:41] ok great, I'll look in a bit [19:06:52] PROBLEM - Puppet freshness on ms-be9 is CRITICAL: Puppet has not run in the last 10 hours [19:06:57] oh I see, I rebalance in this local copy and then push it to sockpuppet [19:07:28] can I just do it for the one ring file I altered? (the other two are untouched) [19:08:24] seems so [19:16:01] New patchset: Pyoungmeister; "search: moving cluster 4 and prefix cluster to pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22062 [19:16:05] no dice [19:16:48] Change merged: Pyoungmeister; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22062 [19:17:50] !log all search traffic now pointed back to pmtpa [19:18:00] Logged the message, notpeter [19:19:26] ah nm [19:19:36] they aren't weighted to zero, they are actually remoed, that's ok I guess [19:21:16] RECOVERY - Puppet freshness on ms-fe2 is OK: puppet ran at Thu Aug 30 19:20:40 UTC 2012 [19:23:13] RECOVERY - Puppet freshness on ms-fe4 is OK: puppet ran at Thu Aug 30 19:23:09 UTC 2012 [19:23:53] yeah cause I'm running it :-P [19:24:43] RECOVERY - Puppet freshness on ms-be1 is OK: puppet ran at Thu Aug 30 19:24:24 UTC 2012 [19:24:43] RECOVERY - Apache HTTP on mw8 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [19:26:27] :-) [19:27:16] RECOVERY - Puppet freshness on ms-be2 is OK: puppet ran at Thu Aug 30 19:26:50 UTC 2012 [19:27:25] so along wiht running puppet [19:27:34] all those conf file changes in puppet? they are going around no [19:27:35] w [19:28:06] I've audited those, it's no problem [19:28:44] !log finished upgrading labsconsole to 1.20wmf10 [19:28:54] Logged the message, Master [19:29:31] so I think we can do this rebuild of the ring files local copy on any host in the cluster actually. seems pretty straightforward [19:29:39] yep [19:30:25] so no one reboot any more hosts. for any reason :-P [19:30:25] RECOVERY - Puppet freshness on ms-be5 is OK: puppet ran at Thu Aug 30 19:30:15 UTC 2012 [19:31:10] PROBLEM - swift-account-reaper on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [19:31:29] and btw about the mail [19:31:35] seriously, replace each disk?? [19:31:46] RECOVERY - Puppet freshness on ms-be3 is OK: puppet ran at Thu Aug 30 19:31:21 UTC 2012 [19:31:46] PROBLEM - swift-account-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [19:32:04] PROBLEM - swift-account-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [19:32:35] ugh really? [19:33:21] hi. Who can tell me what an e-mail alias on the lists server (*-owner@lists) points to? [19:33:34] PROBLEM - swift-object-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [19:33:43] PROBLEM - swift-container-server on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [19:35:04] RECOVERY - swift-account-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [19:35:04] RECOVERY - swift-object-auditor on ms-be2 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [19:36:25] RECOVERY - swift-account-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [19:36:43] RECOVERY - swift-container-server on ms-be2 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [19:38:40] RECOVERY - Puppet freshness on ms-be4 is OK: puppet ran at Thu Aug 30 19:38:19 UTC 2012 [19:43:46] RECOVERY - Puppet freshness on ms-be6 is OK: puppet ran at Thu Aug 30 19:43:32 UTC 2012 [19:44:09] hmm this was probably a mistake to run puppet on ms-be6 [19:44:10] oh well [19:51:43] RECOVERY - Puppet freshness on ms-be8 is OK: puppet ran at Thu Aug 30 19:51:14 UTC 2012 [19:52:46] RECOVERY - Puppet freshness on ms-be9 is OK: puppet ran at Thu Aug 30 19:52:31 UTC 2012 [19:53:42] apergos: ms-be6 is dead for all intents and purposes [19:53:50] even it's main / disks don't work [19:54:00] well it's getting a puppet run anyways :-P [19:54:05] which will finish sometime [19:54:16] RECOVERY - Puppet freshness on ms-be11 is OK: puppet ran at Thu Aug 30 19:54:04 UTC 2012 [19:55:03] New patchset: Dzahn; "set umask 0002 for users in wikidev group, RT-804" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22111 [19:55:13] it has to try and fail to mount a bunch of things, then it will be done [19:55:37] couple of dpkg issues over ther etoo, ignoring em [19:55:46] RECOVERY - Puppet freshness on ms-be12 is OK: puppet ran at Thu Aug 30 19:55:19 UTC 2012 [19:55:46] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [19:55:46] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [19:56:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22111 [19:56:28] New patchset: Dzahn; "set umask 0002 for users in wikidev group, RT-804" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22111 [19:57:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22111 [19:57:28] on ms-be11 there was somesort of issue with sdi, puppet was trying to partition it and it failed [19:57:51] it wanted to create a filesystem on there but of course that failed too [19:58:08] oh come on [19:58:22] can I pass the buck to you on that? [19:58:55] I'm thinking that now it is actually pretty late to take on a new issue. this ring change for ms-be7 was supposed to be a short deal :-P [19:59:11] yeah... [19:59:41] ah, flapped. sdi is now sdo [20:00:05] * apergos runs puppet on ms-be10 just for fun [20:00:11] that's the last of em [20:00:43] RECOVERY - Puppet freshness on ms-be10 is OK: puppet ran at Thu Aug 30 20:00:27 UTC 2012 [20:01:25] ahazehaz [20:01:33] New patchset: Pyoungmeister; "mediawiki module: correcting l10nupdate user key" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22112 [20:01:50] any hint about where I could put puppet class + Files dedicated to the beta project which is only on labs ? [20:01:53] apergos: did you open an RT for ms-be7? [20:02:11] ugh [20:02:16] no I completely spaced it [20:02:21] I thought about manifests/misc/beta.pp and create my class under beta:: namespace [20:02:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22112 [20:02:36] and the files I need under files/misc/beta/ [20:02:50] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22112 [20:02:57] !log reinstalling cp1024-cp1028 [20:03:07] Logged the message, Master [20:03:08] apergos: okay, I will [20:03:10] no worries [20:03:17] oh, I was just in there [20:03:34] https://rt.wikimedia.org/Ticket/Display.html?id=3282 [20:03:38] I think this can remain the ticket [20:03:56] just needs updated with the new disks that don't show up [20:04:01] I have that list [20:04:15] no, let's open a new one [20:04:21] RobH requested that [20:04:28] and that ticket is also quite messed up [20:04:36] oh. ok [20:04:40] eh? [20:04:51] RobH: you said one ticket per hardware problem [20:04:54] yes please. [20:04:59] we do one ticket per case with dell [20:05:05] that's what I said :) [20:05:09] when folks make a single huge ticket, it becomes hard to track whats going on [20:05:10] yep [20:05:41] I really wish it could be the driver [20:05:54] *sigh* [20:06:26] I wonder how long it'll be until we start losing data [20:06:30] ok well should I copy over the ms-be7 info to the new ticket or a new ticket with just the new failures? [20:06:50] apergos: I say new ticket with new failures and mention the other ticket on the body/metadata [20:07:00] I'd say [20:11:58] https://rt.wikimedia.org/Ticket/Display.html?id=3500 not very exciting but the bare bones are there [20:12:25] okay [20:13:29] anyone familiar with netinet/in.h ipv6 structures? :) [20:14:01] i'm trying to cast a byte array containing a raw ipv6 address to a in6_addr, and I must be doing something really stupid [20:15:12] do we have these issues on any box without ssds? (just out of curiosity) [20:15:24] ah I should ask #tech, sorry, danke [20:19:57] notpeter: I'll exchange you srv281 for 24 fully populated C2100s! [20:22:11] I mean... can I have the c2100s for my home? [20:22:12] I'd do that. [20:24:38] notpeter: what did you do to srv281�.you've turned it orange [20:24:53] sdi on ms-be11 has been serving up 507s for a while it seems [20:25:05] apergos: yeah at least 10 days [20:25:06] cmjohnson1: bucket o' orange paint [20:25:25] cmjohnson1: not sure. brought it up, started using it, it crashed. havne't investigated [20:25:29] it has a history of crashing. [20:25:33] it does [20:26:00] 281, 266, and 291 [20:26:10] are all unstable [20:26:15] cmjohnson1: iunno. I'm not really worried about 281. we can put it in the "throw it into the bay" queue for all I care [20:26:47] i think it may be under warranty..so we'll see [20:27:04] it's a friend of 206 and 266:) [20:27:36] cmjohnson1: excellent [20:35:15] New patchset: Hashar; "(bug 39701) beta: automatic MediaWiki update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22116 [20:35:58] RECOVERY - swift-account-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [20:36:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22116 [20:36:34] RECOVERY - swift-object-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [20:37:10] RECOVERY - swift-container-server on ms-be6 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [20:39:32] !log srv291 shutting down to reseat DIMM [20:39:43] Logged the message, Master [20:43:10] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:43:19] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:43:19] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:43:28] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:43:37] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:43:37] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:43:37] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:43:41] hmmm [20:43:46] PROBLEM - Varnish traffic logger on cp1031 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:43:55] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:44:01] !! [20:44:13] it's fine [20:44:13] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:44:13] PROBLEM - Varnish traffic logger on cp1032 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:44:22] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [20:44:31] PROBLEM - Host srv291 is DOWN: PING CRITICAL - Packet loss = 100% [20:44:59] Reedy. ok, good. i am reinstalling cp1021 to cp1028, was just wondering if this might be related since i rebooted 2 boxes in that second [20:46:03] New patchset: Ottomata; "Need newline at end of file for udp2log to read properly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22117 [20:46:27] notpeter, could you do me a favor and approve that real quick? [20:46:27] https://gerrit.wikimedia.org/r/22117 [20:46:48] prolly lemme look [20:46:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22117 [20:47:17] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22117 [20:47:28] newlines considered dangerous. -1. [20:48:13] thank you! [20:48:36] New patchset: MaxSem; "WLM updater script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/17964 [20:49:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/17964 [20:50:22] RECOVERY - Varnish traffic logger on cp1032 is OK: PROCS OK: 3 processes with command name varnishncsa [20:51:00] ottomata: nearly 100% of buggy code has newlines in it [20:51:02] true story [20:51:26] haha [20:52:04] notpeter: I'd like to see the buggy code that doesn't [20:52:04] New review: Hashar; "Being tested on deployment-integration labs instance. Will polish it up on Friday." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/22116 [20:53:35] New patchset: Hashar; "(bug 39701) beta: automatic MediaWiki update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22116 [20:54:07] PROBLEM - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is CRITICAL: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2600* [20:54:26] New review: Hashar; "PS2: add the beta.pp manifest" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/22116 [20:54:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22116 [20:55:09] !beta applying beta::scripts to deployment-integration [20:55:10] !log deployment-prep applying beta::scripts to deployment-integration [20:55:13] grr [20:55:22] Logged the message, Master [20:56:55] AaronSchulz: cat /dev/random | xxd -ps > solve_all_problems.bin [20:57:00] no newlines, 100% bugs [20:58:06] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [20:59:54] PROBLEM - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is CRITICAL: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y CRITICAL - *2625* [21:01:06] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [21:02:54] RECOVERY - ps1-d1-sdtpa-infeed-load-tower-A-phase-Y on ps1-d1-sdtpa is OK: ps1-d1-sdtpa-infeed-load-tower-A-phase-Y OK - 1938 [21:06:30] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [21:07:06] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [21:08:00] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [21:08:45] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 3 processes with command name varnishncsa [21:09:30] RECOVERY - Varnish traffic logger on cp1031 is OK: PROCS OK: 3 processes with command name varnishncsa [21:10:15] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [21:11:00] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [21:13:17] New patchset: Hashar; "(bug 39701) beta: automatic MediaWiki update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22116 [21:14:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22116 [21:17:07] !log removing srv194 from apache pool [21:17:17] Logged the message, notpeter [21:18:48] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 3 processes with command name varnishncsa [21:18:48] RECOVERY - Host srv291 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [21:20:18] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [21:30:35] New patchset: Hashar; "(bug 39701) beta: automatic MediaWiki update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22116 [21:31:26] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22116 [21:39:04] about to run scap [21:46:00] New patchset: Hashar; "(bug 39701) beta: automatic MediaWiki update" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/22116 [21:46:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/22116 [21:47:34] New review: Hashar; "PS5 fix logging. stdin & stderr are now append to /var/log/wmf-beta-autoupdate.log" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/22116 [22:27:06] !log ran a maintenance script on labsconsole to update all instance pages [22:27:16] Logged the message, Master [22:38:29] PROBLEM - Host virt1003 is DOWN: PING CRITICAL - Packet loss = 100% [22:41:56] RECOVERY - Host virt1003 is UP: PING OK - Packet loss = 0%, RTA = 27.25 ms [23:07:53] PROBLEM - check_job_queue on spence is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , ruwikisource (21916) [23:08:11] PROBLEM - check_job_queue on neon is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 9,999 jobs: , ruwikisource (22010) [23:23:39] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [23:24:24] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [23:28:01] well that's a lie [23:28:29] 15041 [23:30:47] Reedy: soo many "Invalid message parameter" exceptions, ugh [23:30:56] :( [23:31:00] and it looks like the enwiki master had problems ~6 hours ago [23:31:02] No useful informationt o go with them? [23:31:16] well there are backtraces :) [23:31:27] What's at fault? [23:31:43] Reedy: just ssh into fluorine [23:32:05] it would help if i could spell [23:35:59] TimStarling: so, how is ScanSet chugging along? [23:36:51] same as always ;) [23:41:29] TimStarling: what are you planning to do with it? [23:43:40] Reedy: thumbnail.log is a fire-hose [23:58:26] * AaronSchulz wonders were tim went [23:59:51] New patchset: CSteipp; "Adding GPG Keys for CSteipp" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/22158