[00:05:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:06:29] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2729 [00:06:30] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2729 [00:11:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.044 seconds [00:15:06] New patchset: Diederik; "Improvements: 1) IP range filtering and regular expression now work. 2) Started with unit-tests 3) Major refactoring" [analytics/udp-filters] (refactoring) - https://gerrit.wikimedia.org/r/2698 [00:19:08] New patchset: Diederik; "Improvements:" [analytics/udp-filters] (refactoring) - https://gerrit.wikimedia.org/r/2698 [00:27:53] how do we clear a cached dns miss again ? [00:27:57] (other than just waiting it out ;) [00:30:36] oh found it :) [00:34:24] !log reinstalling neon [00:34:26] Logged the message, Mistress of the network gear. [00:34:59] New patchset: Lcarr; "Bringing neon back to life" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2730 [00:37:45] New patchset: Ryan Lane; "Use the simple scheduler, rather than chance." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2731 [00:38:20] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2731 [00:38:21] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2731 [00:40:09] PROBLEM - Host cp1017 is DOWN: PING CRITICAL - Packet loss = 100% [00:41:57] RECOVERY - Host cp1017 is UP: PING OK - Packet loss = 0%, RTA = 26.43 ms [00:44:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:46:00] PROBLEM - Backend Squid HTTP on cp1017 is CRITICAL: Connection refused [00:46:45] PROBLEM - Frontend Squid HTTP on cp1017 is CRITICAL: Connection refused [00:50:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.045 seconds [00:55:20] New patchset: Bhartshorne; "syntax corrections, size correction, partition number correction" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2732 [00:55:47] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2732 [00:56:22] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2732 [00:57:45] hi maplebed, would this be a good time to install some software on stat1? [00:59:36] New patchset: Ryan Lane; "Making labstore1-4 LDAP clients" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2733 [00:59:39] I think I do have some time before the end of the day. [00:59:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2733 [01:00:00] drdee: I saw the request for virtualenv and pip again [01:00:14] drdee: is it possible for you guys to list the specific python packages you need? [01:00:17] as before [01:00:26] also, the maxmind c library should already be there... [01:00:27] Ryan_Lane: there's a wiki page and some RT tickes. [01:00:47] * Ryan_Lane hates virtualenv and pip [01:00:53] http://www.mediawiki.org/wiki/Analytics/Infrastructure/Stat1 [01:01:34] but I'm with you on virtualenv and pip. There's eternal conflict between languages wanting to manage their modules and distributions wanting to manage packages. [01:01:41] distributions win, when ops has to manage a system. [01:01:58] which means if you write software that depends on the language's package management, it's likely that we won't be able to install it easily. [01:02:09] yes. and virtualenv and pip are a great way to ensure we'll never be able to rebuild a system [01:02:32] at the same time, using the language's package managament to speed development works fine, so long as it's converted to distribution-based package management before getting "deployed". [01:02:39] RECOVERY - Frontend Squid HTTP on cp1017 is OK: HTTP OK HTTP/1.0 200 OK - 27545 bytes in 0.126 seconds [01:02:50] (which is total fail when the dev system starts churning out "production" reports, as hapepns so incredibly frequently. [01:02:50] ) [01:02:57] it'll never happen, though [01:03:13] unlike ryan, I am an eternal optimist. [01:03:16] :D [01:03:22] I know better :D [01:03:51] RECOVERY - Backend Squid HTTP on cp1017 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.178 seconds [01:04:06] I'm going to add: misc::package-builder [01:04:15] since they need C, they'll likely need those too [01:04:39] hm. that may be excessive [01:04:58] heh. it doesn't even include C libraries [01:05:03] I'll just include the package class [01:05:05] for git [01:12:20] New patchset: Ryan Lane; "Adding git and mysql-client to stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2734 [01:12:55] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2734 [01:13:18] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2733 [01:13:19] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2734 [01:13:19] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2733 [01:14:28] Ryan_Lane: virtualenv was requested by Andrew Otto; I understand his reasons but I also understand your concerns [01:14:54] we gain flexibilty and you will be more woried, that's the tradeoff [01:15:11] well, it's fine for development [01:15:15] but not for production [01:15:48] once we are to go to production we can puppetize the whole thing [01:16:03] why's it being developed on stat1, rather than labs, then? [01:16:07] does it require private data? [01:16:11] for development? [01:16:14] yes [01:16:21] hm [01:16:53] Ryan_Lane: you said it should already have the maxmind geoip library? is it getting that not-through-puppet? [01:16:54] it would be really really awesome if we can have private data on labs :) [01:17:05] maplebed: I think it was installed manually [01:17:13] drdee: we're discussing how to implement it [01:17:18] possibly [01:17:35] maplebed: we probably need to add classes for it [01:17:55] puppet on this system is fairly odd [01:17:55] there's already a generic::geoip class. [01:18:02] ah [01:18:20] that would be what's needed [01:18:21] it installs /usr/share/GeoIP/stuff. [01:18:26] that's the same thing? [01:18:27] ok [01:18:28] yep. adding it [01:19:05] New patchset: Ryan Lane; "Adding geoip libraries to ensure it's managed by puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2735 [01:19:29] which version of geoip is that? [01:19:41] whatever comes with ubuntu lucid [01:19:49] if you need a newer version, we'll need to backport [01:19:54] huh. https://github.com/rcrowley/puppet-pip [01:20:02] * Ryan_Lane pukes [01:20:06] no thanks [01:20:14] that's like managing ruby gems via puppet [01:20:38] we don't use third party repos for apt. do we really want to trust one for system libraries? :) [01:20:57] drdee: 1.4.6.dfsg-17 [01:21:02] that's the version in lucid [01:21:38] if precise has a newer version, we can upgrade to precise for it, or it can be backported [01:22:07] if we need an even newer version, it'll need to be backported into a package for whatever ubuntu version we want to use [01:22:33] it's perfect [01:22:36] great :) [01:23:22] does it also install the dev libs of geoip? [01:23:56] do you need them? was just about to ask that in the ticket [01:24:26] oh. it does [01:24:42] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2735 [01:24:42] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2735 [01:24:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:29:49] maplebed, ryan_lane: thanks for helping us out and my apologies [01:29:58] oh. no worries [01:30:01] we'll work something out [01:30:10] hopefully we can work out what you need with ubuntu packages [01:30:16] if not, we'll take a look at virtualenv and pip [01:30:24] I'd *really* like to avoid that, though [01:31:07] as long as very good documentation is kept, it may be acceptable [01:31:23] meaning, whenever someone calls pip, they document the package they added [01:31:42] one of the issues I have with pip is that it installs never versions of python libraries than may be included with lucid [01:32:08] which then makes code dependent on things we'd have to backport [01:32:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.064 seconds [01:35:57] New patchset: Bhartshorne; "adding class to install pip for stat1 - dev use only." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2736 [01:36:32] maplebed: I hope you'll backport whatever they need, when we run into that problem, too ;) [01:36:34] Ryan_Lane: r2736 is a start down that road. [01:36:54] they need stuff, they backport it. [01:36:57] why we would need to backport stuff? [01:37:19] drdee: for instance, if you are using python-blah version 2.3 via pip [01:37:19] because of what Ryan_Lane just said - pip can get you newer versions than ubuntu. [01:37:31] and lucid had 1.9 [01:37:43] you'll be using features that don't exist in 1.9 [01:37:45] and depending on them [01:37:56] but that's why you use virtualenv [01:38:00] then, when we puppetize it, you are missing the packages [01:38:07] err, missing the correct version [01:38:29] drdee: we don't use third party repositories in production, as a rule [01:38:40] they can't be trusted, so we don't use them [01:39:08] so, when we puppetize the system, whichever pip installed python libraries you are using need to come from ubuntu, or they need to be packaged [01:39:22] if you are depending on a newer version of a library, it needs to be backported [01:39:52] this is why we recommend trying to avoid using pip as much as possible [01:40:08] first see if its available as an ubuntu package, then use pip if not, and document it [01:45:20] ok [01:45:30] I'm mentioning this in the ticket [01:47:19] New patchset: Bhartshorne; "adding class to install pip for stat1 - dev use only." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2736 [01:47:36] Ryan_Lane: if I "abandon" a change in gerrit, does it stick around for later viewing? [01:47:44] yep [01:47:45] or does it nuke it compeltely? [01:47:47] ok. [01:48:09] I'm going to abandon that change, but it's there for reference if we decide it's necessary later. [01:48:23] oh. you aren't adding it to stat1? [01:49:34] I thought you had convinced drdee not to use it. [01:50:00] well, it's possible they'll actually need it for some things, as much as I hate that [01:50:12] I suppose I can install it anyways and if they don't need it they won't use it. [01:50:31] I'm writing a comment to the ticket that's basically "please don't use pip unless you have to use it, and if you do, make sure to document what pip is installing" [01:50:45] also explaining why we don't want people to use it [01:50:58] New patchset: Bhartshorne; "adding class to install pip for stat1 - dev use only." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2736 [01:51:14] we need to work out the private data policy in labs :) [01:51:19] so that this wouldn't be an issue [01:51:36] well, I guess it really is one anyway [01:51:43] yup. [01:51:53] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2736 [01:51:54] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2736 [01:52:07] we should probably write development recommendations for people in labs [01:52:15] "how to not make ops' life hell" [01:52:19] heh [01:52:22] :) [01:53:02] well, this ticket can act as a starting point for some docs, I guess. [01:56:12] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 635s [01:57:06] PROBLEM - Misc_Db_Lag on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 689s [02:04:24] New patchset: Bhartshorne; "puppet exec enforces full paths; that's cool." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2737 [02:04:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:06:30] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2737 [02:06:31] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2737 [02:08:35] https://labsconsole.wikimedia.org/wiki/Help:Development_recommendations_for_easily_moving_to_production [02:08:52] drdee, maplebed: ^^ [02:10:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 5.475 seconds [02:14:00] anyone know where our favicons are stored? (specifically the foundation logo favicon) [02:15:50] normally at /favicon, no? [02:16:20] yep /favicon.ico [02:16:22] it's standard [02:16:32] Jamesofur: or do you mean on the filesystem on fenari? [02:16:39] or the filesystem on the apaches? [02:22:19] Ryan_Lane, I just want to grab the image to make it the favicon on the store. I may be able to grab it on fenari if that's the easiest spot or if it's somewhere in SVN or something where I can grab it [02:22:44] just /favicon.ico on any site you want to get it from [02:22:50] Jamesofur: eg http://wikimediafoundation.org/favicon.ico [02:23:11] thanks! [02:25:11] puppet on stat1 is spewing out 4700 lines of junk because of file ownership of /a. [02:25:18] ::sigh:: [02:27:00] New patchset: Bhartshorne; "trying to remove 4700 lines of spam as stat1 tries to manage /a for ezachte." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2738 [02:27:23] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2738 [02:27:23] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2738 [02:42:33] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [03:08:14] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 23s [03:09:17] RECOVERY - Puppet freshness on mw1002 is OK: puppet ran at Thu Feb 23 03:09:00 UTC 2012 [03:09:35] RECOVERY - Misc_Db_Lag on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 1s [03:12:26] RECOVERY - Puppet freshness on db46 is OK: puppet ran at Thu Feb 23 03:12:05 UTC 2012 [04:19:07] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 213 seconds [04:19:52] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 232 seconds [04:23:19] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 202 seconds [04:23:46] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 228 seconds [06:08:29] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [06:14:29] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [06:14:29] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [07:42:06] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [07:42:15] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 2 seconds [08:16:06] PROBLEM - Lucene on search9 is CRITICAL: Connection timed out [08:21:48] RECOVERY - Lucene on search9 is OK: TCP OK - 0.023 second response time on port 8123 [08:58:54] :o [08:58:58] someone can fix labs pls [08:59:25] it seems that someone "fixed" firewall or something like that [08:59:42] it's not possible to connect to public instances [09:00:06] ssmollett, mut-away... [09:01:24] I think you'd be more likely to get results if you pinged apergos or mark [09:07:36] petan|wk: when was this known to be working last? [09:12:40] ah, forget me :) [09:12:45] it was my fault heh [09:12:53] ok then :-) [10:27:42] New patchset: ArielGlenn; "verify objects; no-derive tag for uploads; success/failure message." [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/2739 [10:27:46] New review: gerrit2; "Lint check passed." [operations/dumps] (ariel); V: 1 - https://gerrit.wikimedia.org/r/2739 [11:33:54] PROBLEM - RAID on searchidx2 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:35:51] RECOVERY - RAID on searchidx2 is OK: OK: State is Optimal, checked 4 logical device(s) [12:43:47] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [13:01:09] can someone please have a look at gerrit and review/merge https://gerrit.wikimedia.org/r/#change,2677 (simple URL change) [13:01:24] does not need puppet to be run right now, it is not that urgent :-D [13:02:03] I have also submitted two changes to ignores python compiled files ( .pyc ) https://gerrit.wikimedia.org/r/#change,2587 and vim swap files (.swp) https://gerrit.wikimedia.org/r/#change,2587 [13:02:16] the python one is https://gerrit.wikimedia.org/r/#change,2514 sorry :D [13:48:54] PROBLEM - RAID on search7 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:50:42] RECOVERY - RAID on search7 is OK: OK: 1 logical device(s) checked [16:03:40] mark: hi [16:08:08] robh: rt 2497...please take a look when you get a chance [16:08:21] checkin now [16:09:55] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [16:10:31] cmjohnson1: so its the master, and I guess asher would like us to migrate master to another one (so he would do that) [16:10:37] but lets see if i cannot get the disk id [16:10:57] cmjohnson1: disk 10 [16:11:33] cmjohnson1: so normally we can hot swap this, but since asher would prefer we swap master first [16:11:46] see if you cannot see what disk i mean, and then we can have asher do his thing before we replace [16:12:33] robh: now that the cable is done (thanks) I can't actually change stuff over there because the dreaded rsync error is back... so I gotta work it out with the other end before switching anything else up :-/ [16:12:41] soooo aggravating [16:12:54] i do not see any orange indicators on db22 [16:12:58] cmjohnson1: ct asked us to migrate master, not asher, my bad. [16:13:26] master is usually disk1? [16:14:00] cmjohnson1: no [16:14:12] cmjohnson1: ok, disk 10 MAY be flashing an identify led [16:14:20] but since the disk is bad, it may not be takign this command. [16:14:36] do not swap drives on a db master [16:15:16] mark: wasnt going to since ct asked us not to. [16:15:25] robh: all are flashing but one ...i am checking disk diagram to confirm disk 10. [16:15:28] just trying to see if the drive id command is working [16:15:35] cmjohnson1: dont pull anything though [16:15:55] mark: can you check and see if there any connections needed to scs-c1 to row b in pmtpa. I am moving scs-c1 to d1 today [16:15:55] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [16:15:55] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [16:15:59] robh: no [16:16:11] ok, well, i assigned the ticket to asher [16:16:19] he can migrate which is master and assign back for further work [16:16:23] for now its goign to wait [16:16:29] ok...so we'll standby [16:16:33] it seems it wont take the id command on the bad disk, so later when we can do it [16:16:39] i can id the good disk above it, then below it for you [16:16:51] ok..that is great... [16:16:53] then we can pull the bad disk while its up to see if we got it right, but it could cause downtime so it has to wait [16:17:03] cmjohnson1: yeah, to csw5-pmtpa [16:17:06] that can be a very temp run though [16:17:12] csw5-pmtpa is going away soon, but is still critical now [16:17:47] is it okay to relocate.?..i am going to have to disconnect the cables [16:20:23] yes [16:20:31] we only need it during emergencies [16:20:56] just run a quick cable and make sure it works, don't need to do it neatly [16:21:46] okay...thx [17:04:23] robh: rt 2484 and rt 2485....when you get a chance....memory issues..thx [17:17:31] New patchset: Andre Engels; "Porting my changes to the new version of the code. Main changes: * Including traits.py, containing a number of 'standard' traits, enabling them to be re-used without having to be recopied or rewritten * Including selectors.py, which does the same for " [analytics/reportcard] (andre/mobile) - https://gerrit.wikimedia.org/r/2740 [17:22:54] cmjohnson1: so for this stuff, always check here first [17:22:58] http://noc.wikimedia.org/dbtree/ [17:23:13] this will let you see if its in the server rotation on mysql, in this case it is a slave on s7 [17:23:28] slaves should be able to be shutdown at any time for testing, as long as an ops level person bring it down gracefully [17:23:40] okay [17:25:07] robh: can you bring them down for me. [17:25:16] you have copies to run on both right? [17:25:23] cuz if its one at a time, db15 can go first [17:25:43] go ahead and bring 15 down [17:25:53] and i will take a look at 18 later [17:26:07] ok, since 18 is in rotation, its best to keep it up until you are ready to test [17:26:15] since it means it wont fall far behind in replication [17:26:23] !log db15 shutting down for memtest [17:26:26] Logged the message, RobH [17:26:38] cmjohnson1: db15 is all yours [17:27:36] ok...robh can you send me wikitech link for the pin out for the orange serial cables for scs [17:28:38] http://wiki.wikked.net/wiki/OpenGear#Cisco_-_Opengear_cable [17:35:10] morning, RobH, mark! [17:35:28] mark: the installer is creating a GPT partition table. [17:35:29] hiya [17:35:45] maplebed: i didnt put in my email, but not having a partman on those isnt the worst [17:35:53] sicne it means if they pxe boot they wont wipe out any data [17:36:00] which is the drawback to pxe partman ;] [17:36:22] for the longest time we didnt keep individual server names in there on databases (before they were db# and such) [17:36:26] RobH: ::sigh::. we have 4 now, we'll probably get 5-10 more during the next year, and grow by a few a year beyond that. [17:36:51] but yeah, we could do them by hand. [17:36:54] well a quick look at it makes me think what you did should work [17:37:12] so it means a lot of partman work, which both myself, daniel, peter, and mark can tell you, sucks. [17:37:16] ;] [17:39:03] and me and leslie too. [17:39:27] yea, would be nice to have it automated =P [17:39:39] ok, well, assuming we do give up, I've never done an install without partman; are there docs? [17:40:38] we do not have any ourselves nah, its basic ubuntu installer stuff, so they would have it documented. basically you want to pxe boot and attach to console, then when it gets to the partitioning propmts it will stop for your input [17:40:44] you want to choose manually partition option [17:41:20] then it will show you all the disks, you want to then setup the bios 1m part on sda and sdb, then setup two identical partitions on those two disks at 120GB for the / [17:41:30] and choose 'physical whatever for software raid' as the type [17:41:55] then setup your swap and choose the raid setup option [17:42:06] which then is pretty self explanatory but ping me when yer there if you want help =] [17:44:56] oh, so do all the normal building a servers stuff but just have no entry for the host in netboot.cfg? [17:48:49] hey guys. can i get a password reset on Mobile-feedback-l Moderator pleease [17:48:59] now that i'm back id like to clean up that list [17:49:08] its its not a a quick fix then i'll file a ticket [17:51:25] * tfinc goes to cut a ticket [17:51:34] http://rt.wikimedia.org/Ticket/Display.html?id=2508 hopefully one of you can take a look [17:52:19] tfinc: I think you can request a password reminder. [17:52:20] * maplebed verifies [17:53:28] thanks maplebed [17:53:40] i'm not seeing it but could easily be not looking in the right place [17:56:16] ok, I have it narrowed down the the OAI harvester never starting on the new search indexer [17:56:19] progress! [17:56:39] but why.... [17:56:41] win! [17:57:11] it's not an access thing, as the requests just go to the eqiad LBs [17:58:02] it's actually that it never start [17:58:05] things like [main] INFO org.wikimedia.lsearch.oai.IncrementalUpdater - Authenticating ... [17:58:09] should show up in the logs [17:58:19] but even on debug, the letters oai never show up in the logs [17:59:29] notpeter, where is the log for the OAI process? [18:00:08] rainman-sr: on searchdix2 it just goes to log-all [18:00:32] yep, that i know [18:00:38] where is it on searchidx1001 [18:00:38] ah [18:00:51] that's a very good question... [18:00:57] how is that configed? [18:00:58] the incremental updater is a special process [18:01:31] i.e. it's a standalone java program [18:01:44] ah, ok [18:02:12] e.g. look at /home/rainman/scripts/search-inc-all [18:02:52] and search-restart-indexer-searchidx2 [18:02:59] oh! [18:03:02] which is calling it to start/restart indexing jobs [18:03:02] ok [18:03:18] ok, this is making a lot more sense! [18:03:19] woo! [18:03:23] ;) [18:03:30] is it reasonable to just include that in the init script? [18:11:42] * AaronSchulz wonders what keeps periodically breaking graphite [18:14:24] maplebed: any luck ? [18:19:23] LeslieCarr: any idea why noc.wikimedia.org/cgi-bin/report.py is slow? [18:20:36] tfinc: sorry I got distracted. [18:21:18] AaronSchulz: not at all :) i can look [18:23:36] wow that's incrededibly slow [18:23:39] noc is slow [18:23:53] yeah, but there's no real reason i ts hould be like that [18:24:05] cpu's are low, free memory [18:24:10] quick, let's all click on it and see if it goes faster! [18:24:20] no crazy errors in logs [18:24:41] oh [18:24:48] usually it is slow when someone sends crapload of profiling sections [18:24:57] well this could be why Timeout waiting for output from CGI script /usr/lib/cgi-bin/report.py, referer: http://noc.wikimedia.org/cgi-bin/report.py?db=1.19 [18:25:07] which is not the case now either [18:25:41] oh wait [18:25:43] it isn't spence nowadays [18:25:49] it's fenari now [18:26:23] i'm going to guess something is wrong with the script (ms obvious!) [18:26:27] no [18:26:29] its professor [18:26:37] and no, there's nothing wrong with the script [18:27:46] root@professor:/tmp# du -sh stats.db [18:27:46] 556M stats.db [18:27:47] that would do it [18:27:48] :) [18:27:51] ah :) [18:27:53] hehe [18:28:02] * AaronSchulz tried to clear profiling [18:28:10] it can't run through that big of a db ? [18:28:12] template profiling is enabled again [18:28:14] I had it hacked out [18:28:19] someone decided it is good idea to enable it again [18:28:20] yep [18:28:21] in 1.19 branch [18:28:24] I was looking at that [18:29:11] we need fair sampling for that stuff [18:29:16] full profile won't work [18:33:48] bonvenon qchris :) [18:34:55] http://live.wikimedia.in/ [18:35:15] maplebed: I believe you don't need to give grub a partition when installing to gpt, just the device [18:35:23] but not 100% sure... but I think it may be failing for some other reason [18:35:33] perhaps trying the install command manually in the installer will clear it up [18:35:47] yeah, I thought so too. [18:35:53] domas: any process you can nuke? [18:36:09] maplebed: perhaps I'll try it when I'll install "my" node [18:36:15] if you don't beat me to it [18:36:21] aaronschulz: what do you mean? [18:36:28] I won't; I'm going to build one by hand for now. [18:36:31] ok [18:36:35] domas: to make report.py responsive [18:36:56] aaronschulz: I already said what is wrong [18:37:04] template profiling is back on [18:37:14] which is sending profiling keys like... [18:37:15] not as of three minutes ago [18:37:30] 1.19:-:Parser::braceSubstitution-title-First_year_of_the_Czech_Municipalities_Photographs_grant/layR* [18:37:31] ah [18:37:37] just clear-profile then! [18:37:39] I think it needs a kick though [18:37:42] already tried, twice [18:37:53] hm [18:37:58] maplebed: thanks for helping out. i'll keep watch over the rt ticket to track progress. [18:38:17] tfinc: I was wrong; there is no reminder for the moderator address. [18:38:19] that was what I did 10 min ago, which made me think maybe it wasn't templates...but it actually just has no effect [18:38:21] just a reset. [18:38:31] probably something is to overloaded to respond or who knows [18:38:36] hahaha - do "host 217.199.212.245" [18:39:08] aaronschulz: fast now [18:39:36] lesliecarr: :-) [18:39:48] still hanging for me [18:41:13] woooooooooooooooooooooooo [18:44:41] ok, indexer is indexing [18:45:02] tfinc: where did you go? [18:45:14] I have a password for you. [18:45:15] \o/ [18:45:30] maplebed: R31 [18:45:42] i'll come by as soon as this is done [18:45:43] ah. stop by on your way out. [18:45:51] maplebed: can you update http://wikitech.wikimedia.org/view/Build_a_new_server with what you need to do for the new servers ? [18:46:08] I did a bit. [18:46:14] (references to ipmi for the c2100s) [18:46:24] cool [18:46:33] i was thinking of the "you need a grub partition" specifically :) [18:47:20] well [18:47:35] I believe it's standard for GPT partitions to have a boot partition as the first partition [18:47:40] of a few hundred MB [18:47:55] although grub probably needs far far less [18:48:05] aaronschulz: eh, indeed [18:48:06] RobH said 1MB. [18:48:40] aaronschulz: collector is CPU bound [18:48:48] aaronschulz: did someone change sampling rate or something like that? [18:48:53] or is that asher's code [18:48:54] hehe [18:48:55] ok [18:49:01] you arent making a boot partition [18:49:07] you are making a gpt partition 1mb bios [18:49:18] maplebed: ^ [18:49:27] I think I don't know what that means. [18:49:32] domas: oh, I guess we can revert r111694 [18:49:54] you keep saying 'bios partition' but bioses don't have partitions, so I think you mean something that I don't understand (and I mistakenly assumed something else). [18:49:55] whatisthat [18:49:59] http://en.wikipedia.org/wiki/GUID_Partition_Table#Partition_type_GUIDs [18:50:10] !r 111694 [18:50:10] on is http://www.mediawiki.org/wiki/Special:Code/MediaWiki/111694 [18:50:13] ok [18:50:14] Number Start End Size File system Name Flags [18:50:14] 1 1049kB 2097kB 1049kB bios_grub [18:50:15] 2 2097kB 120GB 120GB ext3 raid [18:50:15] 3 120GB 121GB 1000MB linux-swap(v1) [18:50:15] 4 121GB 2000GB 1879GB xfs swift-sda4 [18:50:29] I don't think it is the problem [18:50:30] RobH: that's the output of what command? [18:50:42] thats parted print all on ms-be1 [18:50:44] domas: yeah, but its not needed now :) [18:50:47] shows all disks and all info [18:51:06] or print sda for just the one disk [18:51:10] frankly, we could assign some large memory buffer to the profiler process [18:51:18] now, swift takes care of all the disks except sda and sdb [18:51:26] but still I think profiling rate is up [18:51:29] or code is more expensive :) [18:51:45] so on sda and sdb, you create the following partitions in the installer BEFORE setting up raid, 1mb primary set to type bios_grub [18:51:50] EFI actually does use partitions [18:51:59] but I don't think our servers boot with EFI yet [18:52:07] then 120gb partition for software raid [18:52:27] then i put a 1gb swap on sda and sdb [18:52:44] then leave the rest of the disk for yer swift stuff. then you have to setup the software raid [18:52:54] tie the two 120 gb into a ext3 / [18:52:58] raid1 [18:53:08] maplebed: that make sense? [18:53:10] aaronschulz: if we wouldn't have all these packages and stuff all around, I'd probably fix it :) [18:53:18] lol [18:54:04] domas: it's 1/50 [18:54:14] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2730 [18:54:14] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2730 [18:54:26] wtf [18:54:27] root@professor:/tmp# netstat -l | grep 3811 [18:54:27] tcp 0 0 *:3811 *:* LISTEN [18:54:27] udp 534878664 0 *:3811 *:* [18:54:28] I don't think that rate changed [18:54:46] I never saw 512MB network queue before [18:55:04] hos is that even possible? [18:55:06] how [18:55:07] domas: AaronSchulz: can I turf you to #wikimedia-tech? [18:55:15] maplebed: why? [18:55:30] maplebed: can I turf you to #wikimedia-tech? [18:55:44] haha [18:55:53] because interleavingc two conversations when we have a quiet channel doesn't make sense and talking about mediawiki vs. drive partitions seem more apt to wikimedia-tech than -operations. [18:55:59] so given the two, that was my suggestion. [18:56:04] New patchset: Lcarr; "Changing neon to public machine" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2741 [18:56:13] maplebed: we're not talking about mediawiki [18:56:17] we're talking about critical service [18:56:20] that is running on wikimedia [18:56:21] that we have to fix [18:56:22] ok, nevermind. [18:56:39] where's the gerrit bot? [18:56:40] PROBLEM - Packetloss_Average on locke is CRITICAL: CRITICAL: packet_loss_average is 12.6867269565 (gt 8.0) [18:56:41] New patchset: Ryan Lane; "Increasing max-cores for the simple scheduler" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2742 [18:56:43] ah [18:56:44] heh [18:57:12] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2741 [18:57:12] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2741 [18:58:39] domas: [18:58:40] root@professor:~# cat /etc/sysctl.d/99-big-rmem.conf [18:58:40] net.core.rmem_max = 536870912 [18:58:40] net.core.rmem_default = 536870912 [18:59:10] ah, stupid me, I was looking for wmem [18:59:18] I guess I need either food or coffee of both [18:59:19] or [18:59:22] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2742 [18:59:23] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2742 [19:00:57] RobH: do you know how to get the gpt partition table printed from teh install shell? [19:01:01] it doesn't have parted. [19:01:23] i dont get why you want to [19:01:35] you can see the partition in the installer partman menu [19:01:37] I want to see whether partman created a partition map that matches what you pasted abov. [19:01:46] ok, but you can see that on screen [19:01:50] If I go back to the intaller and try and repartition it goes into some lame loop. [19:02:08] at least it did last time, though maybe that wasn't when grub was failing. [19:02:14] * maplebed tries again. [19:02:23] aaronschulz: hah, didn't know this, apparently can change without recompile [19:02:25] if grub fails it wont have an os and wants to load installer [19:02:37] basically you need to be on the partitioning screen of the installer [19:02:40] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 2.28160850877 [19:02:42] and i can confirm if its right [19:02:44] yeah, but one option in the installre is to drop into a shell. [19:02:51] ok. [19:03:00] i have no idea if parted is workign int he installer shell [19:03:04] I figured that would be easier. [19:03:06] you can drop and try [19:03:07] but no. [19:03:40] i dunno what to tell ya, sorry [19:03:52] not sure what the error you are seeing is [19:04:07] but if you dont have the bios_grub 1mb parttion installer will fail in the grub setup [19:04:11] domas: \o/ [19:04:32] maplebed: what you can try is, the menu option "install extra installer components" (or something like it) [19:04:35] perhaps it has a udeb for parted [19:04:55] ah, lies, probably in newer version [19:04:59] maplebed: if you would like, get it to the partitoining section [19:05:06] do the partitions you think are right, and ping me [19:05:09] RobH: the story is that I think I do have the 1MB partition and grub failed to install. I want to verify I have the 1MB partition. choosing 'partition disks' has now hung. [19:05:11] or modified one [19:05:16] and i can hop on to the console [19:05:27] maplebed: did you do 1mb parttion then a 120gb partition for raid? [19:05:34] I think so, but I can't verify. [19:05:36] aaronschulz: lol, it was apple-modified version [19:05:42] you are in the installer now? [19:05:46] and now it's hung so I think I need to reboot. [19:05:51] ok [19:06:24] mark: I did find parted under /target/sbin/, but it didn't have the devices in the right place. [19:06:27] ::sigh:: [19:06:40] yeah but if there's a udeb it'll isntall it in the installer environment [19:06:42] (so then it does work) [19:06:44] ah. [19:06:48] of course you can mess with chroot too [19:07:40] http://ubuntu.wikimedia.org/ubuntu/pool/main/p/parted/ [19:07:44] looks like there is a udeb [19:07:51] domas: maybe this is why the graphite graphs are missing data [19:08:19] ok, rebooting ms-be3 to try again. [19:09:29] aaronshulz: of course it is [19:09:35] aaronschulz: at least now it is not using 100% cpu [19:09:55] and profiler is somewhat instant [19:11:16] and socket is properly consumed [19:13:14] aaronschulz: we may need to reduce the rate to 1/100 eventually, maybe [19:13:48] is there a reason the owa machines don't have the puppet standard info on them ? [19:13:58] they're alerting because they don't have the post-puppet hooks [19:14:11] owa1-3? [19:14:15] yeah [19:14:23] they're currently in a limbo. [19:14:26] you can do whatever you want to them. [19:14:40] okay, i'm just going to include the standard package so they stop alerting us ? [19:14:43] I was using them for the swift test cluster; that need is now mostly over but I haven't wiped them yet. [19:14:47] oh [19:14:51] in that case, decom ? [19:15:01] yes, but... reuse [19:15:03] they're not old [19:15:11] cool [19:15:24] is there a ticket ? (or else i'll create one) [19:15:30] I don't think there is [19:15:52] AaronSchulz: http://www.mediawiki.org/wiki/Special:Code/MediaWiki/112227 [19:16:28] LeslieCarr: they were supposed to go bakc to nimish when I was done [19:16:53] I've been hanging on to them in case I needed extra capacity for the production swift cluster. [19:16:56] okay [19:16:57] but I think that need is now past. [19:17:56] they've been unused for a year [19:18:41] http://rt.wikimedia.org/Ticket/Display.html?id=2511 [19:19:23] maplebed, LeslieCarr: diederik has taken over tech-analytics stuff, so he'd probably be the guy to talk to about the owa stuff's future [19:19:41] thanks nimish_g [19:20:34] domas: OK, but I don't see any changelog change ;) [19:21:07] * AaronSchulz was looking at what ncache param did [19:21:17] AaronSchulz: I'm lazy [19:21:48] heh, Special:Book is 15% of cluster cpu? [19:21:51] I've been away for too long [19:22:14] (well, joy of clean profile) [19:22:24] wfStreamThumb 25%? [19:22:27] of time [19:22:48] wow, so the book creator is 15% of our cpu right now ? [19:22:51] pediapress ? [19:23:22] FileBackendStore::fileExists average 60ms? [19:23:24] and tcpdump lets me get the login to archive.org right :-/ [19:23:25] jesus [19:23:43] thanks to all that Cite is now just under 4% [19:24:13] domas: stream is 25% of the 'thumb' section I assume [19:24:31] aaronschulz: looking at 'all' [19:24:42] it is 90% at thumb section [19:25:00] thumb.php fileexists is 128ms [19:25:52] where are 1.18 entries coming from? [19:26:00] * AaronSchulz doesn't see those in StartProfiler [19:26:34] hm [19:26:44] why are count entries so high [19:26:51] ah, looking at wrong one [19:26:57] gah [19:27:06] RobH: you want to take over the console and poke around or shall I try Mark's suggestion of installing extra packages (parted)? [19:27:16] PPFrame_DOM looked plenty [19:27:16] I see [19:27:47] maplebed: i can poke [19:27:58] maplebed: ms-be2? [19:28:09] RobH: nope. 3. do you mind joining my screen session so Ican watch? [19:28:23] maplebed: sure, how i do that? [19:28:41] RobH: log into bastion1001, get root, run 'screen -x ben/ms-be'. [19:29:07] RobH: then switch to window 3 (my escape character is `) [19:29:12] I'll adjust to your terminal size. [19:29:32] bleh. [19:29:38] i have no idea what my default esc character is [19:29:40] aaronschulz: now streamthumb is 45% of 'all' [19:29:44] unless there's profiling bias.. [19:29:52] RobH: when you join my screen session, you're forced to use my escape character. [19:29:59] but the default is ctrl-a, fwiw. [19:30:17] well im on it [19:30:17] mind if i join as well ? curious minds want to know [19:30:19] and i have no idea wtf to do [19:30:21] domas: there is [19:30:24] what screen size should i do ? [19:30:27] so cant you just put it to the right thing? [19:30:33] thumb.php is 1:1, others are 1:50 [19:30:36] heh [19:30:39] ;) [19:30:44] that would explain it [19:30:48] } elseif ( strpos( @$_SERVER['REQUEST_URI'], '/w/thumb.php' ) !== false ) { [19:30:50] LeslieCarr: 200x50 [19:30:55] no mod 50 ther [19:30:57] *there [19:31:12] maplebed: so i have no idea what i am supposed to do on your screen session [19:31:13] thanks, on it now [19:31:16] aaronschulz: *sigh*, can you fix it? :) [19:31:20] i dont use screen [19:31:21] sure [19:31:24] merci [19:31:24] because its confusing and annoying [19:32:03] RobH: we all share input to the window, so (when you're on `3), you just use it like it's your own terminal. [19:32:12] i cannot get to 3. [19:32:22] RobH: to get to 3, type `3 [19:32:23] im trying your sequence and its not working fo rme, so im not doing it right [19:32:24] backtick 3 [19:32:31] (not single quote) [19:32:50] you made it! [19:32:52] ok [19:32:52] :D [19:32:54] domas: ok so we have 3 custom StartProfilers, one in wmf-config and one in each MW [19:33:02] now it'll hang. [19:33:16] it hangs going into the partitioning screen? [19:33:41] yup. [19:33:53] that's what I was trying to describe before. [19:34:08] actually, it might not be hung, just interminably slow. [19:34:08] heh, the last two clones except one has 1.19 in one spot instead of 1.18 :p [19:34:19] I did see it change from 0% to 1% after 5-10 minutes last time. [19:34:33] meh, rebooting. [19:34:51] ok, RobH can I drive for a sec to do the reboot? [19:34:56] aaronschulz: I guess fixing rate for thumb.php will reduce CPU a lot [19:35:01] maplebed: already sent it [19:35:04] its now 50% [19:35:11] maplebed: sorry, i thought i was driving. [19:35:15] oh, in a different window? [19:35:20] nah, that's cool. [19:35:21] yes [19:35:36] I was hoping we could do it in this noe so Leslie could watch. [19:35:49] since it's the fancy new ipmi_mgmt. thing. [19:35:51] next time. [19:36:15] well, i told it to set pxe boot and reboot. [19:36:19] i'm wiping scsi drives [19:36:24] I so do not miss SCSI [19:36:30] I brought one old server home to do it [19:36:35] but hot swapping doesn't really work [19:36:42] so I have to reboot after each pair of drives [19:36:51] mark: don't you miss the jumpering ?;) [19:37:00] fortunately there is no jumpering involved here [19:37:03] RobH: Im' not seeing anything to the console yet; do we need to reattach? [19:37:11] at least this is "modern" scsi with auto termination, and SCA and all that [19:37:22] maplebed: i dunno whats up [19:37:22] but yeah, I remember [19:37:25] shouldnt have to. [19:37:32] well, I'll try anyways. [19:37:43] and the old server is making so much noise [19:37:47] maplebed: god damn it [19:37:52] i hit the wrong escape key w [19:37:55] you don't really notice that in a data center [19:38:10] RobH: you probably didn't hurt anything... [19:38:48] ok, well [19:38:54] i dunno what the fuck its doin [19:39:03] RobH: we should get an old SAS enclosure and use it for wiping drives [19:39:04] i just wanted to look at the partiting screen before it was committed. [19:39:08] hook it up to an old server [19:39:12] New review: Ottomata; "Hey Andre, looks good. I've got a few comments and requests for changes. I put '-1' only because I..." [analytics/reportcard] (andre/mobile); V: 0 C: -1; - https://gerrit.wikimedia.org/r/2740 [19:39:13] mark: why? [19:39:16] and whatever you put in there, have it auto wipe [19:39:18] we have them already [19:39:22] how? [19:39:24] it looks ilke you're still attached to window 3 - did you se me detach and re-attach? [19:39:27] we have a sas enclosure [19:39:28] usb [19:39:38] you do? [19:39:40] RobH: it looks like it didn't get powercycled. [19:39:47] maplebed: ok, well, if you wanna show leslie [19:39:52] go ahead and power cycle it and set it to pxe [19:40:00] it's made it to 3% checking the swap aprtition. [19:40:05] mark: why wouldnt we wipe it in the machine? [19:40:18] ok robh, I'll do that. [19:40:27] RobH: i'm not doing it in the machines because that's annoying, time consuming, and some are broken [19:40:35] RobH: are you going to be around in an hour? [19:40:38] but it may be different for ou [19:40:39] you [19:40:42] maplebed: i should be yes [19:40:48] apparently lunch just showed up here. [19:40:58] mark: its just a lot faster to load up all the bays on one of the batch you are decommissioning [19:41:09] and doing it there, but in pmtpa and eqiad i also have a usb to sata drive toaster [19:41:18] well, eqiad i use the one i have at home. [19:41:22] i have a usb to sata thingy too [19:41:23] if it's ok with you, i'll detach and let you poke around, leslie and I will be back in an hour or so. [19:41:29] if i need to access a disk but i rarely use it [19:41:31] (but I'll reboot first) [19:41:44] doesn't work for SAS drives though does it? [19:41:46] ok [19:41:56] mark: no idea never tried [19:42:05] other way round should work [19:42:18] its rare that we have a disk that we cannot wipe in some machine in the batch beign turned off [19:42:27] I guess [19:42:30] this is the first time I'm doing it [19:42:30] then if its a dead disk instead i usually used to drill press them [19:42:48] the old V20z and knsq1-15 [19:43:01] super old [19:43:03] I used knsq15 for wiping most of those [19:43:08] but the v20z don't even boot from usb [19:43:13] and don't support usb keyboard in bios [19:43:21] and I didn't have a ps2 keyboard in the data center ;) [19:43:26] so I quickly got tired of that and just brought one home [19:43:28] RobH: I've rebooted ms-be3 and detached from the console. be back in a bit. [19:43:38] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 84, down: 2, dormant: 0, excluded: 0, unused: 0BRae3: down - BRae4: down - BR LeslieCarr not actual problem [19:43:48] ok [19:43:48] ' [19:44:33] ....why isnt it getting dhcp offers on the initial nic... [19:44:39] maplebed: ? [19:45:08] wtf, do we even use StartProfiler in wmf-config? [19:45:48] change it and see [19:46:00] maybe not? :) [19:46:52] I'd like to just use that one file and only use per-MW ones if there are breaking changes to the profiling api or such [19:55:32] maplebed: why does ms-be2 have no mac info in brewster? [20:15:26] who is working on neon? [20:15:27] leslie? [20:29:19] yes [20:34:58] LeslieCarr: please stop the mail bombing :) [20:37:57] she is upstairs having lunch... [20:44:41] i'm stress testing the mail servers [20:48:04] ping RobH I'm back. sorry for the confusion; my screen session was connected to ms-fe3, not ms-fe2. [20:48:07] s/fe/be/g [20:48:18] maplebed: ahhhhh [20:48:21] well shit [20:48:28] lemme go poke at that then [20:48:32] I'm sorry. I tried to say so, but it got caught in the crosstalk. [20:48:53] ms-be2 isn't in dhcp because I didn't know the MAC address when I set up DHCP. (I do now, thanks to cmjohnson1 but haven't updated brewster yet.) [20:49:37] RobH: do you want to fight screen again so we can play together or do you just want to beat on it on your own? [20:49:57] (or walk away and let me try out mark's suggestion of the udeb for parted) [20:50:34] maplebed: let me take a look at it on how it is now [20:50:55] i just pulled the mac off ms-be2 from serial bios [20:50:58] and updated brewster already [20:51:07] ms-be3 is rebooting into pxe now [20:51:22] New patchset: Lcarr; "Fixing neon cronspam and neon nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2743 [20:51:26] do you know if ipmi restricts us to a single person on the serial console? [20:51:38] maplebed: once i get it to the partitioning screen and review, i can detach so screen can reattach [20:51:50] you can only have a single device controlling the serial console [20:51:50] but if we both try and attach at the same time, it'll fail? [20:51:53] since its via comport [20:51:58] it'll deny whoever is last [20:52:08] its also why there is a consoleclose command [20:52:13] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2743 [20:52:13] "Info: SOL payload already active on another session" [20:52:14] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2743 [20:52:17] incase someone leaves it open and terminal crashes, or they go afk [20:52:48] basically the same thing as a racreset to kill all console sessions on drac [20:53:03] installer is running dhcp now [20:53:49] and now its in the two minute blank screen window (yay 10.04 redirection issues) [20:54:13] it also may be the rootdelay90 we introduce for disk detection, not sure. [20:54:43] New patchset: Lcarr; "Removing ensure => absent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2744 [20:55:10] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2744 [20:55:11] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2744 [20:55:24] maplebed: these are still getting partman auto stuff? [20:55:31] its trying to confirm raid details i didnt put in. [20:55:32] so far, yes. [20:55:44] ....i thought we were trying to set this up manually now? [20:56:14] Are we doing this manually, or is this troubleshooting the partman auto stuff? [20:56:18] I was trying to confirm that it was creating the correct partition map. [20:56:31] then i misunderstood what we were doing. [20:56:35] so far, our assertion is that it's doing it wrong and that's why grub was failing. [20:56:42] I wanted to verify that before throwing in the towel, [20:57:07] i misunderstood completely then [20:57:08] all I was trying to do was print the partition map. [20:57:53] ok [20:58:12] maplebed: i just commented out the auto part, it can load installer into the partiting menu and we can see what the last installer run that completed did [20:58:13] I blame this channel having multiple simultaneuos conversations. we obviously failed to communicate. [20:58:28] which will let us see int he installer what was setup prevoiusly [20:58:39] robh: can you check the scs connections from scs-c3---now scs-d1...thanks [20:58:52] cmjohnson1: you can check that just as easily as me ;] [20:59:12] if its on the network, login to the scs, and try to connect. if it gives you a login prompt for the device, its good [20:59:18] its all the mgmt login info [20:59:20] ok [20:59:34] if it gives no prompt after attaching to a port, its a bad connection [20:59:36] RobH: I see the comment in netboot.cfg; that's what you mean, right? [20:59:53] maplebed: yea, i want the installer to load to partitioning screen and wait for my input [20:59:58] so i can see what it did last time [20:59:59] k. [21:01:44] wtf... asked for ubuntu mirror [21:01:54] bah, now its stuck trying to reach a bad mirror, rebooting again [21:01:56] can I see? [21:02:00] wait [21:02:03] I don't think it's a bad mirror. [21:02:10] RobH: ^^ [21:02:14] if it doesnt happen twice i dont count it [21:02:16] so i rebooted. [21:02:26] so if it happens a second time, then its an issue. [21:02:26] New review: Danakim; "Got it, will summarize the commit message and wrap it from now. Will also keep tabulations from now ..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2682 [21:02:41] it gives you that error when it tries to install the OS in the frist partition, which, being only 1M, runs out of space. [21:02:48] it confuses the out of space error with a bad mirror. [21:02:55] its not even gotten to that part of the install yet. [21:03:00] its dhcp, mirror, then partitioning [21:03:03] so nothign is on the disk yet [21:03:35] not sure how any hard disk should affect that part of the installer [21:04:14] all I mean is that yesterday I got an error about failing to connect to the mirror many times and that's what it was each time. [21:04:33] ok, i just do not understand why its asking this when its not installing to disk yet [21:04:38] but you may be at a different spot than where it failed for me. [21:04:40] it loads installer, gets dhcp [21:04:43] then says bad mirror [21:04:54] it has not prompted me for the disk partitions yet, its not even tried yet [21:05:28] it has to load the installer components into memory [21:05:32] and now its got an entirely different error [21:06:23] maplebed: i detached from ms-be3 [21:06:33] i am going to confirm the same behavior on ms-be2 [21:06:39] cuz thats strange and i dunno wtf its doing. [21:07:28] LeslieCarr: I wish people would stop stress testing the mail servers all the time ;-) [21:07:34] I can tell you, the mail servers are doing fine [21:07:38] my mail client, not so much [21:09:12] :) [21:09:21] sigh [21:09:33] maplebed: so why are these systems asking for a mirror when they dont have a partman recipie... [21:09:39] ms-be2 does it [21:10:41] me neither! [21:11:25] perhaps because squid on brewster is misbehaving? [21:11:34] nagios question: is there a place people routinely check, when they see nagios alerts, to get pointers on what to do about them? [21:12:01] New patchset: Lcarr; "Fixing type in nrpe.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2745 [21:12:13] i see that you can add a permanent comment to a nagios service, but do people ever look at those? [21:12:14] mark: brewster's responding fine on port 80. [21:12:38] Jeff_Green: there is no such place I routinely check. [21:12:40] but on port 8080? [21:12:44] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2745 [21:12:45] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2745 [21:12:55] 02/23 15:08:11| storeUpdateCopy: Aborted at 160143 (4096) [21:12:55] 2012/02/23 15:31:29| storeUpdateCopy: Aborted at 33156 (4096) [21:12:56] 2012/02/23 15:45:22| storeUpdateCopy: Aborted at 397 (-1) [21:13:00] this doesn't look good [21:13:17] in cache log, its where i am now [21:13:30] and I think I know why [21:13:33] so its failing to snag stuff to present [21:13:34] ? [21:13:37] brewster filled up quite a few times lately [21:13:42] and /var/spool/squid, the squid cache, is in the root fs [21:13:53] that's kinda silly really [21:14:04] well its not full now [21:14:12] does it have to be kicked to start pulling info again? [21:14:18] just a sec [21:15:11] maplebed: ok [21:15:25] RobH: ok try now [21:16:40] ok, trying again on ms-be2 [21:18:31] maplebed: also on ms-be3 now to confirm the drive details [21:18:48] k. [21:19:59] mark: still asks for mirror [21:20:25] then I don't know [21:20:34] =/ [21:20:39] mark: we're trying to ipv6, right ? or did someone sign us up in bad faith [21:20:53] mark: re: the world ipv6 day email from roberts@isoc.org [21:20:58] LeslieCarr: I saw [21:21:01] ok [21:21:05] hrmmm, wtf [21:21:06] not sure who signed us up though [21:21:15] LeslieCarr: no one would have at that email [21:21:15] using dns-admin is a strange address to use [21:21:18] so whoever did was spam [21:21:26] and until we're fully committed, don't wanna say we will [21:21:32] (didn't work so well last year) [21:21:41] but yes, I wanna do it this year [21:22:03] !log Moved /var/spool/squid to its own LV on brewster [21:22:05] Logged the message, Master [21:22:18] RobH: I'm going to re-enable the partman recipe for ms-be4 and try mark's suggestion of installing parted post-partitioning. [21:22:48] ok, off ms-be3 [21:22:56] try away, would be nice for an automatic result =] [21:23:04] update channel with results ;] [21:23:17] are you going to continue with ms-be3? [21:24:26] * maplebed stopped puppet on brewster so it doesn't wipe out my testing changes. [21:30:53] New patchset: Ottomata; "Adding David Schoonever account 'dsc', including dsc on stat1.wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2746 [21:32:26] RobH: fascinating - I also got the mirror request. This wasn't happening yesterday. [21:32:34] (testing on ms-be4). [21:32:42] something has fubared our mirror since then [21:33:01] I'm going to use a us mirror and look at the syslog as soon as I get a chance. [21:33:08] try and get a better error message. [21:34:47] how? its an internal ip ;] [21:37:06] magical routing! [21:37:21] cheatin ;] [21:38:48] bah. [21:40:23] so, random labs instance can aptitude update just fine. [21:40:29] which implies that brewster's ok. [21:41:53] New patchset: Lcarr; "changing /usr/lib/nagios to /usr/lib/nagios3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2747 [21:42:27] RobH: so my conclusion is that, at the moment, we can neither build these hosts with partman nor manually. is that right? [21:42:40] New patchset: Ottomata; "Renaming my user 'otto'. Deleting 'aotto'. I'm currently only installed on stat1.wikimedia.org, so it is better to do this now rather than later. I hope this doesn't piss anyone off. Thank you!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2748 [21:42:49] or any host, i dont think its just affecting this [21:42:53] something is borked about brewster [21:43:38] New patchset: Lcarr; "changing /usr/lib/nagios to /usr/lib/nagios3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2747 [21:43:44] RobH: so, random labs instance can aptitude update just fine. [21:43:58] hrmm, lemme try just an internal host that isnt loaded with anything [21:44:00] (I haven't tried aptitude upgrade yet) [21:44:31] aptitude upgrade worked on test2 in labs. [21:44:54] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2747 [21:44:55] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2747 [21:44:55] (which pulls from ubuntu.wikimedia.org == brewster)) [21:46:11] !log rebooting cadmium for pxe test, not reinstalling it. [21:46:13] Logged the message, RobH [21:46:32] maplebed: so this should launch into installer and stop at partitioning screen for me [21:46:37] lets see if it gets mirror error [21:48:32] PROBLEM - Host cadmium is DOWN: PING CRITICAL - Packet loss = 100% [21:52:58] maplebed: cadmium gets it too [21:53:12] so its brewster [21:53:27] i am gonna look at some stuff, but i am far from expert on this [21:54:08] huh. brewster was rebooted 10 days ago. [21:54:35] well, this is strange as hell, would rebooting brewster be horrible? it may be some service isnt firing right [21:55:28] i need to make somethign to eat, i forgot lunch, afk for a few making grilled cheese [21:55:35] k. [21:56:47] RECOVERY - Host cadmium is UP: PING OK - Packet loss = 0%, RTA = 30.88 ms [22:02:31] !log fixing permissions in /var/spool on brewster [22:02:33] Logged the message, Master [22:07:47] !log enabling LiquidThreads on labsconsole [22:07:49] Logged the message, Master [22:08:24] New patchset: Lcarr; "ensuring /usr/local/nagios exists" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2749 [22:10:01] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2749 [22:10:02] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2749 [22:19:32] what was the old topic? [22:25:54] it had a special char on the end. tripped this bug: http://developer.pidgin.im/ticket/12238 [22:26:16] I wasted a lot of time figuring that out. meh [22:28:36] nice [22:30:51] mark: if you're still here, can you tell me squid does in brewster? [22:31:14] does lighttpd forward to it? [22:33:00] New patchset: Lcarr; "ensuring directory exists + a bit of cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2750 [22:38:23] New patchset: Lcarr; "ensuring directory exists + a bit of cleanup" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2750 [22:38:58] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2750 [22:38:58] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2750 [22:45:39] PROBLEM - Puppet freshness on cadmium is CRITICAL: Puppet has not run in the last 10 hours [23:00:54] New patchset: Lcarr; "Changing static /etc/nagios to variable of config directory needed for nagios3 compatibility" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2751 [23:01:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2751 [23:02:05] from #wikimedia-tech: (03:01:30 PM) TimStarling: did anyone recently change the oggvideotools package? [23:02:52] probably not [23:04:47] RoanKattouw: you're breaking cadmium ;) [23:05:36] heh [23:05:49] I haven't touched it at all, how did I break it now? :) [23:06:13] he's *that* good [23:06:21] i know :) [23:06:28] you have an account on it but the group isn't created :) [23:06:29] I am really good at breaking things, it's true [23:06:34] ah :) [23:06:40] * RoanKattouw blames RobH [23:07:18] RoanKattouw: was me [23:07:26] i didnt see it runnign something, did i miss it? [23:07:40] Leslie claims you broke it [23:07:44] LeslieCarr you have an account on it but the group isn't created [23:07:47] i totally did [23:07:48] technically i claim you broke it ;) [23:07:52] you being Roan [23:07:54] well i rebooted it [23:07:56] err: Failed to apply catalog: Could not find dependency Group[500] for User[catrope] at /var/lib/git/operations/puppet/manifests/admins.pp:37 [23:08:02] on puppet attempts [23:08:17] bleh, yea it should get removed from puppet or corrected [23:08:21] its bad call [23:09:28] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2751 [23:09:29] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2751 [23:15:21] New patchset: Bhartshorne; "when the back end returns something other than a 404, use that HTTP status code rather than 404." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2752 [23:16:00] New review: Catrope; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/2752 [23:19:15] RobH: I fixed some things on brewster, but couldn't find anything wrong relating to the mirror [23:19:39] I see successful attempts in the log for that host to get its tftpd stuff, so it's clearly not something network related. [23:19:54] (as your test on cadmium also confirmed) [23:28:16] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 1; - https://gerrit.wikimedia.org/r/2752 [23:28:30] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2752 [23:31:38] AaronSchulz: http://docs.webob.org/en/latest/reference.html#id2 is what led me to use resp.status. [23:33:47] AaronSchulz: oh, and I tested that it works correctly. [23:33:49] :) [23:34:44] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2752 [23:36:34] maplebed: the lack of docs is frustrating [23:36:43] I guess if it worked in testing it must be ok [23:36:49] on webob? yeah. totally. [23:38:39] https://upload.wikimedia.org/wikipedia/commons/thumb/a/a0/Kaiserwagen_einfahrt_vohwinkel.ogg/seek%3D28-Kaiserwagen_einfahrt_vohwinkel.ogg.jpg [23:38:48] maplebed: ok, we still need to propagate the error msg [23:41:29] hrm [23:41:36] maybe "The resource could not be found" *is* the error [23:42:09] no it's not, nvm :) [23:57:17] AaronSchulz: it's ugly, but... http://copper.wikimedia.org:8080/wikipedia/commons/thumb/a/a0/Kaiserwagen_einfahrt_vohwinkel.ogg/seek%3D29-Kaiserwagen_einfahrt_vohwinkel.ogg.jpg [23:58:13] getting closer :)