[01:54:37] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [02:23:51] PROBLEM - MySQL replication status on storage3 is CRITICAL: CHECK MySQL REPLICATION - lag - CRITICAL - Seconds_Behind_Master : 1724s [02:34:30] RECOVERY - MySQL replication status on storage3 is OK: CHECK MySQL REPLICATION - lag - OK - Seconds_Behind_Master : 0s [04:17:50] blog.wikimedia.org is only intermittently accessible because of the heavy traffic -- is there anything that should be done about it or just leave it? [04:18:09] RECOVERY - Disk space on es1004 is OK: DISK OK [04:18:48] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:24:05] casey, we're trying to move it [04:24:25] Yay. :-) [04:29:31] ping Ryan_Lane [04:29:41] howdy [04:29:44] evenin. [04:29:53] Erik asked me to come join the party. [04:29:58] what can I do to help? [04:29:58] we have a major blog event occuring, and the blog server is dying [04:30:04] we need to move it to another server [04:30:09] hooper has a dead disk [04:30:16] can we put a cache in front of it? [04:30:18] can you find me a server to move it to? [04:30:23] sure. [04:30:26] tampa? [04:30:27] I was about to do that, before I saw the dead disk [04:30:29] yeah [04:30:31] Ryan_Lane: heya [04:30:34] how about one of the OWA hosts? [04:30:37] \o/ [04:30:38] so this is easy [04:30:39] I have three for swift. [04:30:44] I don't need all of them. [04:30:44] one of the OWA will likely work [04:30:46] we have a high performance server [04:30:49] in tampa we can allocate to this [04:30:53] RobH: that's even better [04:30:58] can you get me that really quick? [04:30:58] nice. [04:31:04] so blogs are down now? [04:31:05] we're going to have to steal the IP from hooper [04:31:11] it's up, but won't be for long [04:31:14] cuz we should be able to push blogs back up with mirror [04:31:26] it eventually runs out of memory and dies [04:31:39] Ryan_Lane: we can't change dns to point to the new machine? [04:31:40] well, blogs is a CNAME for hooper [04:31:47] it'll take an hour for it to move [04:31:56] I guess we can keep rebooting hooper till it moves [04:31:56] so spin up with new ip in public vlan [04:31:58] that's ok, isn't it? [04:31:59] and then do that [04:32:01] it's not actually down now. [04:32:10] ok, let me snag a server and get the install running [04:32:16] maplebed: it's been dying pretty frequently [04:32:40] I'm going to drop the TTL to 5m now [04:32:48] so that maybe by the time the new machine is ready it'll be quick to shift. [04:32:50] they also removed the links that tell people to comment there [04:32:58] cool. thanks [04:33:13] RobH: I added some stuff to WP to help [04:33:20] W3 total cache and APC [04:33:28] is it puppetized? [04:33:35] only the webserver config [04:33:39] but we can just rsync the rest [04:34:57] Ryan_Lane: blog.wikimedia.org's TTL is now 5m. It will be 5m everywhere by 9:35pm. [04:35:04] sounds good [04:35:16] did you wipe it from the cache? [04:35:26] RECOVERY - Puppet freshness on spence is OK: puppet ran at Tue Jan 17 04:34:59 UTC 2012 [04:35:30] no, but a dig against all three of our nameservers shows 5m. [04:35:34] ah. cool [04:35:37] have we picked a new server now? [04:35:42] TimStarling: rob is on it [04:37:23] I notice there is a server called harmon which is in pmtpa and unused [04:37:36] Ryan_Lane: can you push dns changes i just made please, they are checked in [04:37:41] sure [04:38:30] done [04:39:50] Ryan_Lane: to confirm, we're just waiting for rob-h's word for now, right? [04:39:55] yep [04:40:04] RobH: is that system already installed and all? [04:40:15] no, working on it now [04:40:18] ok [04:40:19] its bare metal [04:40:24] * Ryan_Lane nods [04:41:30] Ryan_Lane: are you good with any networking stuff we'll need to do or should we ping leslie? [04:41:44] if it's junos, I should be able to do it [04:41:57] otherwise, I'd prefer leslie do it [04:42:23] wouldn't rob need the network stuff done before the system can be built? [04:42:37] depends on what parts change. [04:42:39] one of you login to asw-b4-sdtpa and set the vlan for me? [04:42:48] * maplebed looks at Ryan_Lane for that. [04:42:49] :) [04:43:00] relabel port WMF3641 to marmontel [04:43:05] its a ex4200 [04:43:12] if it was foundry, i could do it ;] [04:43:16] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [04:43:23] hm. why can't I ssh into it? [04:43:34] I've got leslie's number if we want her. [04:43:35] the dns for those is odd [04:43:42] it's .mgmt, right? [04:44:05] maplebed: I'd say ping her [04:44:05] asw-b4-sdtpa.net.mgmt.pmtpa.wmnet [04:44:13] doing so now. [04:44:21] you mean asw-b4-sdtpa.mgmt.pmtpa.wmnet ? [04:44:28] it's timing out for me [04:44:37] i mean i pulled that out of dns just now for 10.1.1.13 [04:44:45] asw-b4-sdtpa.net.mgmt.pmtpa.wmnet [04:44:52] oh, reverse DNS [04:44:58] yea, its not matching up [04:45:04] but ip should be fine [04:45:22] timing out [04:45:23] but if not, also has serial [04:45:25] Ryan_Lane: ^ [04:45:35] how do I access it? [04:45:35] scs-a1-sdtpa.mgmt.pmtpa.wmnet [04:45:43] pmshell to list ports [04:45:47] then # of port [04:45:54] disconnect from serial is ~~. [04:46:07] leslie's on her way in - a few minutes. [04:46:09] let me know when the vlan is tagged, i have dhcp setup for it now [04:46:34] Ryan_Lane: if hooper can stay online, we can migrate this a hell of a lot easier than fresh install =] [04:46:52] RobH: what system is this? [04:46:56] and which port is it on? [04:47:09] yeah. hooper isn't totally dead [04:47:13] i do not know the port, but it should be labeled for WMF3641 [04:47:20] which need to be renamed to marmontel [04:47:33] one port up from ms-fe2 [04:47:52] there's a few boxes in ganglia with very low utilisation if setting up a new box is going to take too long [04:48:13] the os install is minutes. [04:48:22] just getting to that is troublesome ;P [04:48:36] TimStarling: they would need to be apache hosts with public ip space running misc tasks [04:48:42] hey [04:48:45] yvon and gurvin both claim to be IPv6/SSL proxies [04:48:52] got called, we working on this channel ? [04:48:56] found it [04:49:02] they don't seem to be doing much [04:49:03] TimStarling: neither are [04:49:07] LeslieCarr: Ryan_Lane is working on tagging a vlan for a server deploy [04:49:47] okay, which port/valn [04:50:05] ge-1/0/2 [04:50:12] asw-b4-sdtpa, server wmf2642, relabeling to marmontel and setting to public vlan [04:50:12] let me exit this, so you can get in [04:50:26] ... [04:50:37] Ryan_Lane: bah, you didnt do it, why did you take those classes ;p [04:50:38] RobH: how do I get out of pmsell again? [04:50:42] ~~. [04:50:46] LeslieCarr will be much faster than me [04:50:55] I was in the interface to do it, but I have to find crap [04:51:02] ok. I'm out [04:51:05] ge-1/0/2 on asw-b4-sdtpa, yah ? [04:51:10] yep [04:51:21] needs to go into public services, or publicservices2? [04:51:29] RobH: ? did you assign an IP? [04:51:35] yep [04:51:42] as long as you pushed dns? [04:51:44] publcserices or publicservices2? [04:51:46] I did [04:52:00] 208.80.152.150 [04:52:05] ah publicservices, then [04:52:21] i think so yea [04:52:32] hrm, ge-1/0/2 is marked as WMF3641 [04:52:40] !log another dns update for servermgmt [04:52:43] Logged the message, RobH [04:52:55] LeslieCarr: confirm, but relabel now to marmontel as it has a name. [04:53:12] RobH: you said 3641 once and 3642 once [04:53:22] err, 2642* [04:53:38] huh? [04:53:48] you gave two server names [04:53:53] done, committing now [04:53:56] "server wmf2642" vs. WMF3641 [04:54:00] is it WMF3641? [04:54:01] https://racktables.wikimedia.org/index.php?page=object&object_id=1414 [04:54:19] yep. 3641 [04:54:23] ok. dhcp [04:54:27] the server with asset tag wmf3641 is marmontel. [04:54:29] i did that [04:54:33] all i need is the port. [04:54:37] ah. cool [04:54:39] it's committed, not pingable [04:54:43] its not online. [04:54:44] it isn't up yet [04:54:47] so shouldnt ping [04:54:51] okay that would be a good reason why :) [04:55:38] heh [04:55:58] just rsyncing and adding the puppet config should be good enough [04:56:06] let me do puppet really quick [04:56:17] cool, thanks [04:56:28] Ryan_Lane or RobH did you already do dhcp? [04:56:42] i'll still be online until this is done, just say my name and i will come running :) [04:56:43] yes, its done [04:56:50] cool. [04:56:58] Ryan_Lane: so you are taking care of marmontel puppet manifests? [04:57:02] yes [04:57:09] good times [04:57:11] or want me to get anything else or would i just be in the way ? [04:57:13] LeslieCarr: thanks a ton [04:57:16] I made some other puppet changes, so I should really be the one to do it :) [04:57:19] err [04:57:19] maplebed: np [04:57:21] apache changes [04:57:41] Ryan_Lane: so we should be able to rsync over all the stuff, as you said, and be ok with puppet runs [04:57:45] but we will see ;] [04:58:20] os install in progress [04:58:37] New patchset: Ryan Lane; "Adding blog to marmontel and allowing .htaccess in blogs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1927 [04:58:47] this will be a more robust server than hooper was, plus not sharing with etherpad. [04:59:01] claiming dns - I'm dropping all the other names (besides blog.) that point to hooper to 5m TTLs. [04:59:10] sounds good [04:59:12] you mean etherpad is down/slow too? oh noes [04:59:17] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1927 [04:59:18] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1927 [04:59:37] heh [04:59:46] I'm not worried about EP right now [05:00:04] and racktables =P [05:00:06] we should deploy EP-lite soon [05:00:06] does etherpad store its stuff in MySQL or locally? [05:00:09] mysql [05:01:54] partitions formattin [05:05:12] installing software part of os install. [05:08:25] rebooting into os, [05:10:16] crap [05:10:24] I ran the puppet command for ssl wrong [05:11:03] ? [05:11:11] so i am ready to do its first puppet run, not yet? [05:11:24] that should deploy the directories and the like for it. [05:11:42] Ryan_Lane: ? [05:12:02] (cert is awaiting signing on sockpuppet) [05:12:06] ok. puppet is running [05:12:17] ? [05:12:23] you ran it, or you mean on sockpuppet? [05:12:25] I ran it [05:12:29] =P [05:12:56] well, you doing rsync or shall i? [05:13:11] i assume you since yer running puppet [05:13:27] I got it [05:13:49] ok, so i will hang around a bit if needed can ping [05:13:52] ok [05:13:55] but if you are doing that, and maplebed will move dns [05:14:00] then nothing for third person to do now ;] [05:14:15] I want to see tests work before moving dns. [05:14:17] :) [05:14:29] agreed [05:15:24] well, my key worked on marmontel, so puppet's working. [05:16:03] rsyncing [05:16:30] !log installing php-apc on marmontel [05:16:31] Logged the message, Master [05:16:44] maplebed: you mean like hacking your localhost to push to it for blog? [05:16:51] cuz it should work otherwise, same backend. [05:16:58] heh. blog isn't puppetized properly [05:16:59] I was gonna use telnet, but yeah, something like that. [05:17:07] Ryan_Lane: nope, only does base setup [05:17:16] and apache config [05:17:19] Ryan_Lane: what a suprise. [05:17:20] :P [05:17:24] doesnt do the actual web frontend install [05:17:35] heh [05:17:45] you may also have to add permission for that host in the db [05:17:51] since it should be set to the specific hosts [05:17:59] * maplebed logs into the db to look. [05:18:05] RobH: do you know offhand what the db host is? [05:18:08] db9? [05:18:26] it used to be, but i think it may have moved [05:18:30] asher was moving services off it [05:18:34] ok. time to test [05:18:45] Ryan_Lane: do you know what host is the backend db? [05:18:52] Error establishing a database connection [05:18:53] db9 [05:18:58] seems we'll need a grant [05:19:20] do you know the username? [05:19:21] yea, its db9, confirmed on hooper [05:19:44] nm, robh answered. [05:20:01] alright, I'll set up the grant. [05:20:16] ok [05:21:02] hm. I wonder if it is missing php packages [05:21:28] all the packages for it should be installed via puppet [05:21:33] it did pass that part of testing [05:21:39] hm. at least one missing [05:21:41] not sure which one [05:21:59] .... [05:22:10] thats annoying, cuz puppet setup used to work for this [05:22:18] ah. tidy [05:22:27] Ryan_Lane: grants granted. [05:22:27] due to new plugin? [05:22:30] I installed some stuff [05:22:31] yes [05:22:36] so you broke it ;p [05:22:50] still says error establishing connection to database [05:23:02] oh, forgot to flush privsv. [05:23:04] try again? [05:23:13] still [05:23:22] I can connect via telnet [05:24:24] hm. wikidiff error [05:25:57] crap. what's the fix for that again? [05:26:07] !log installing the mysql client on marmontel to test connectivity to the DB [05:26:08] Logged the message, Master [05:26:14] heh [05:26:16] damn [05:26:21] =/ [05:28:32] fixed wikidiff2 issue [05:29:06] why does hooper have mysql installed? [05:29:09] server, tha is [05:29:27] shouldnt, does etherpad do it? [05:29:31] dunno [05:30:05] hm [05:30:09] I can connect via the client [05:30:50] oh [05:30:51] no I can't [05:31:10] Ryan_Lane: mysql is fixed. [05:31:16] (access, that is) [05:31:18] would you verify? [05:31:19] indeed it is [05:31:24] great. [05:31:45] (I had left hooper in the second grant statement) [05:32:05] it works [05:32:10] http://marmontel.wikimedia.org/2012/01/16/wikipedias-community-calls-for-anti-sopa-blackout-january-18/ [05:32:21] loogs good to me. [05:32:22] coolness [05:32:24] * maplebed tests with blog [05:32:28] seems to work for me [05:32:33] marmontel that is [05:32:34] let's switch DNS [05:32:44] this has way more memory. should handle the traffic much better [05:32:57] worked with blog.wikimedia.org for me. [05:33:01] dual cpu 6core and a shitton more ram [05:33:15] yeah, this should handle things much better [05:33:25] all agree I should switch DNS now? [05:33:26] ok, who's doing DNS? :) [05:33:27] yeah [05:33:30] ben is [05:33:51] I'm only moving blog first (not racktables or communityblog or ...) [05:33:57] only move blog [05:34:02] the rest don't move [05:34:03] communityblog? [05:34:04] we'll do them tomorrow [05:34:10] anything with blog [05:34:17] or else the redirection for blog feeds wont work [05:34:33] leave racktables and etherpad alone of course [05:34:40] those will remain on hooper, hooper just needs repair. [05:35:29] done with dns [05:35:34] maplebed: So please also move all the whatever_blogs [05:35:41] I didn't move the racktables or etherpad software [05:35:44] if blog works, shouldn't we move *blog? [05:35:44] or it breaks [05:35:45] we can handle that later [05:35:52] .... [05:35:58] we need to move all blog names now [05:35:59] right, what robh said. [05:36:01] or shit will break [05:36:17] those are just simple redirects for the blog feeds for departments on the blog server [05:36:30] gotcha. [05:36:34] prepping that change now. [05:36:36] RobH: I only needed to move the blog directory, right? [05:36:38] testblog can actually be dumped out [05:36:46] Ryan_Lane: yea, wp needs nothin else [05:36:50] cool [05:36:55] that's all I rsync'd [05:37:06] and the apache stuff, but thats puppet [05:37:15] kind of puppet anyway [05:37:27] I'll fix that soon [05:37:36] I can't believe no one installed php-apc :) [05:37:41] for shame! heh [05:38:00] I still didn't use memcache. I configured w3tc to use apc for caching [05:38:03] so i see no reason to move the other shit off hooper [05:38:07] ok, pushing the change for all the other blogs now. [05:38:12] its more than enough machine for whats left, and its just a bad disk right? [05:38:16] yeah [05:38:28] the blog was causing the server to swap death [05:38:35] its under warranty so will be all good then [05:38:40] yeah [05:38:44] Ryan_Lane: did you want to drop a ticket for hdd replacement in pmtpa? [05:38:52] sure [05:39:10] do note in ticket that its not hot swap, and downtime needs to be scheduled when the replacement disk arrives [05:39:31] done with dns for *blog [05:39:50] just fyi for ops [05:39:50] I haven't increased the TTLs back to 1H - I'd like to leave that for tomorrow. [05:39:57] the r410s we get tend to NOT be hot swap [05:40:00] yeah. thats a good idea [05:40:03] as we want the cabled controller [05:40:05] in case something goes wrong [05:40:13] cool [05:40:16] ok, im goin to bed. [05:40:37] heh, my dns is already updated [05:40:41] RobH: night! [05:40:46] and i use google, so they are updated [05:41:29] http://www.whatsmydns.net/#A/blog.wikimedia.org [05:41:36] yay for maplebed's earlier ttl change [05:41:43] ganglia is showing marmontel picking up traffic [05:42:31] I'm not sure I should be glad it took us the hour it took to drop the TTL to get the server ready, but I suppose it's still a win... :P [05:43:44] so, if someone would drop the mysql rights of hooper, or add a rt ticket for the cleanup atleast, that would rock [05:43:57] cuz it needs to get cleaned off hooper completely [05:43:57] heh. well, it's perfect timing [05:44:01] great job guys [05:44:10] indeed, night all =] [05:44:34] g'night! [05:44:43] LeslieCarr: night [05:44:44] call me if anything else falls over [05:44:46] will do [05:46:18] here's good evidence the DNS change went as expected: [05:46:19] http://ganglia.wikimedia.org/2.2.0/graph.php?r=hour&z=xlarge&title=&vl=&x=&n=&hreg[]=%28marmontel|hooper%29&mreg[]=bytes_%28in|out%29>ype=stack&aggregate=1&embed=1 [05:46:30] It's cool to see the 5m color shift. [05:47:31] !log marmontel has now replaced hooper as blog.wikimedia.org [05:47:32] Logged the message, Master [05:47:35] heh. indeed [05:48:33] * maplebed creates an RT ticket to drop blog privs for hooper from db9 [05:49:49] Ryan_Lane: I'm going to sign off as well. We're all done for the night, right? [05:49:55] yep [05:49:57] thanks for the help! [05:50:04] np. glad it went smoothly. [05:50:22] me too [05:59:39] thank u all! [08:24:05] RECOVERY - Puppet freshness on brewster is OK: puppet ran at Tue Jan 17 08:23:40 UTC 2012 [09:12:58] New review: Dzahn; "about cron jobs running every minute. see:" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/1926 [09:15:24] hey mutante [09:15:59] they're about to deploy the congresslookup ext to testwiki, this entails a new set of tables [09:16:06] shouldn't impact much and I'm around but just a heads up [09:25:02] New patchset: Hashar; "testswarm: explicitly set cron schedule" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1928 [09:25:18] New patchset: Hashar; "testswarm: job to wipe clients idling" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1926 [09:25:41] New review: Hashar; "Change https://gerrit.wikimedia.org/r/1928 explicitly define the agenda :)" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/1926 [09:44:05] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 451161 MB (3% inode=99%): [09:45:55] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 445685 MB (3% inode=99%): [10:06:49] New review: Dzahn; "yep, says "Done" after a few seconds when opening that URL" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1926 [10:06:50] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1926 [10:07:45] New review: Dzahn; "yea, as the commit message says" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1928 [10:07:46] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1928 [10:32:40] PROBLEM - ps1-d2-sdtpa-infeed-load-tower-A-phase-Z on ps1-d2-sdtpa is CRITICAL: ps1-d2-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2463* [10:34:30] RECOVERY - MySQL slave status on es1004 is OK: OK: [12:15:40] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [15:24:15] robh: morning [15:24:26] heyas [15:24:57] cmjohnson1: you may notice new ticket, hooper has a dead hdd, had to move the blog (well, ryan did the rest of us just backed him) [15:24:59] I am going to take db7 down. Just want to confirm that it is decomm and good to go [15:25:14] i think so but lemme grep though the files to confirm [15:25:31] yes...i did and db13 has bad drive (maybe) could be be raid related [15:27:04] !log db7 shutting down for decom, not listed in db for any clusters, load .01 [15:27:06] Logged the message, RobH [15:27:21] cmjohnson1: once it turns off, it can be pulled, you will need to set it aside to wipe the drives out of the rack [15:27:27] or swap with db8 once its wiped [15:27:48] got it ;] [15:28:08] seems db13 also has a ticket for a bad fan [15:28:13] i am confirming both its issues still exist [15:28:28] of course, if it does, its old sun server [15:28:41] cmjohnson1: of those hdds you sent me, they are the sun kind, i guess i should send them back eh? =] [15:29:11] i meant just the drives not the trays so something is coming back, tomorrow when i am in the dc i will email you the capacity [15:29:20] if you dont have any of those left, i will just send them all abck drive and carrier [15:29:33] okay...i will check first [15:34:57] robh: can you get me HDD size for hooper. [15:35:19] yep, in middle of pulling drive info on db13 for ya [15:35:22] will do that immediately after [15:39:13] ok, checking hooper [15:41:05] hrmm [15:41:15] not sure why ryan said hooper disk is dead, it looks like its fine so far, still checkin [15:43:51] cmjohnson1: hooper is fine, i stole ticket, updated, and assigned to ryan [15:44:08] okay...thx [15:44:41] since you are at it please check db43 !rt2170 [15:45:11] lol [15:45:17] gotta love the informative rt ticket... [15:46:17] unbelievable detail [15:47:49] i think you are going to have to crash cart its mgmt is bein odd [15:47:53] worked, but now doesnt [15:48:06] i think its going to need a full power pull/reseat [15:48:10] still checkin [15:48:30] hi guys. did you see that curl error when connection to mgmt before? [15:48:41] on db43? [15:48:50] it happens on the 3 servers that recently got tickets to be reinstalled [15:49:00] i put it in one of them [15:49:19] mutante: not db43 though right? [15:49:28] hold on,looking up details:) [15:49:30] dell drac curl error is fixed by firmware update of the drac [15:50:02] db43 is already a firmware revision up from the fix, so i hope it wasnt it ;] [15:51:16] on mw1099 mw1081 and mw1108 [15:51:20] robhL db43 mgmt is working fine [15:51:30] !rt 2252 [15:51:30] https://rt.wikimedia.org/Ticket/Display.html?id=2252 [15:51:34] it connects for me but wont read logs [15:51:49] and now it works, odd [15:51:52] checking it out [15:52:01] nope, not related to db servers, just mw [15:52:08] werid..i got logs [15:52:18] yea its fine now for mgmt [15:52:27] wont need power pulled, now troubleshooting why it crashed [15:52:40] oh yea, its console is locked up [15:52:41] i gave it a dirty look...it knows who is boss ;] [15:52:45] os is borked [15:53:08] i have to talk to asher about tickets [15:53:15] this is assigned to the wrong person in the entirely wrong queue [15:53:27] no one should be dropping tickets into a datacenter queue with 'fix this' without detailing whats wrong [15:53:39] since the pmtpa queue is for onsite work, this isnt onsite yet. [15:54:14] fyi, i moved one into to pmtpa queue today, "mw64 is dead" (1890) [15:54:21] np...k..did u get drive for db13 [15:54:33] !log db43 rebooting [15:54:35] Logged the message, RobH [15:54:41] cmjohnson1: yea, updated the ticket, [15:54:46] okay [15:54:48] so if you have on site spare to swap to it, great [15:54:54] if not, perhaps the ones you sent me will work [15:54:56] and i send back [15:55:09] i kept some for here so I should be ok [15:55:11] just update ticket with what it says the disk is on the sticker on the front of drive (dont need to pull it) [15:55:39] PROBLEM - Auth DNS on ns2.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [15:55:54] mutante: so what exactly should chris do? [15:55:57] you moved to his queue and gave him no instructions ;] [15:56:21] usually he pings me, or i see them, and I add the what to do to fix it [15:56:33] but it means you guys are bottlenecking repairs waiting on chris to chat with me [15:56:53] well, he has to track down someone with root and such that is [15:56:56] not always me [15:57:22] no one is doing things wrong or anything, just should change how we handle that queue [15:57:32] (no one is doing it after being told differently ;) [15:58:00] (plus mutante didnt make the initial ticket ;) [15:58:06] yeah,ok,in this case i just moved it [15:58:13] but let me add a comment [15:58:24] right, but we dont know whats wrong [15:58:29] so its not an on site issue yet ;] [15:58:46] sweet, no drac errors for memory [15:58:53] though that looks like a memory error to me [15:59:03] luckily, its a cluster server, so its no big deal to pull it offline and run memtest [15:59:16] RECOVERY - Host db43 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [15:59:22] mutante: since I went and pestered you about it, care to update ticket? (If we agree a memtest is the thing to do) [15:59:31] we would wanna depool it and take it offline for chris to work on [15:59:41] well, no need to depool, since pybal should do it [15:59:49] \o/ [16:00:19] but I would then update the ticket with 'System will be powered down, run dell CD with memtest options on system) [16:00:30] without mixing my punctuation quite so horribly =P [16:01:43] heh, ok, my comment was "please run a memory test tool, like memtest86+ from a live CD iso from http://www.memtest.org/ or something to confirm it's a memory error, [16:01:46] if that shows errors ask Dell to replace memory" [16:02:31] yea, but i would advise using the dell cd (you wouldnt know they exist ;) [16:02:41] but dell has utilities cd that has all hardware tests [16:02:49] including the DEST test they require for warranty a lot [16:03:00] but, now you know they exist =] [16:03:09] I don't think you would need that much detail...a simple memtest would suffice. If it's a dell, i know to run dell utililty [16:03:13] they ship with every system, plus when we do warranty stuff they send us updated iso links [16:03:30] cmjohnson1: yea now i am just educating them on what we do for this stuff ;] [16:03:30] robh: can u check to see if you have any SUN 73GB 15k rpm there [16:03:44] I only have the 146GB 1ok [16:04:41] i will check on site tomorrow [16:04:46] if not, then we can use the larger [16:04:54] but prefer we save them, indeed [16:05:03] with the raid, you can always put in a larger, faster disk [16:05:11] it is not faster ...slower [16:05:15] ahh, 10k to 15k? [16:05:24] smaller being faster rpm [16:05:29] well, that sucks. [16:06:38] so is DRAC firmware upgrade an on-site issue? [16:07:24] not if you can access http mgmt [16:07:30] you have foxyproxy or something setup? [16:07:41] drac firmware can be done remotely [16:07:47] RECOVERY - Auth DNS on ns2.wikimedia.org is OK: DNS OK: 5.599 seconds response time. www.wikipedia.org returns 208.80.152.201 [16:07:52] bios firmware has to be local or done with all kinds of crazy shit i dunno about [16:08:07] RobH: ok, then i wont move those tickets to pmtpa [16:08:20] unless we fail at them, i have the firmware locally on my laptop [16:08:26] so if you wanna, just assign those tickets for that to me [16:08:32] and i will shoot the update to them right away [16:08:51] i got the firmware from dell direct, not off the website [16:08:59] i need to find the online copy to link for tothers =] [16:09:00] thanks, will do [16:10:26] PROBLEM - Squid on brewster is CRITICAL: Connection refused [16:15:53] mutante: actually, i uploaded it to nfs if you wanted to try it out [16:16:07] but it will be slower for you, since its transcontinental then onto the slow mgmt network [16:16:18] so i am fine to run all but 1 if you just wanna do the single one for experience [16:16:36] since the drac update is via http interface, it uploads your local copy of the file [16:18:41] mutante: did you push out the new rsvg? [16:19:11] (I think I asked you to last week, but don't remember and got sucked into beta.wmflabs.org) [16:20:07] hexmode: i keep sending thehelpfulone your way ;] [16:20:22] we keep eventually fixing the issues, but i am sure you have gotten a few pings out of it =] [16:20:23] RobH: yep, leave one for me [16:20:32] hexmode: ehm. no.i didnt [16:20:35] RobH: excellent! he has pinged me [16:21:00] glad I can help with thehelpfulone ;) [16:21:16] hexmode: was it really me you talked to about it? i might have forgotten too, but eh, dont really remember either [16:22:31] mutante: I talked to someone about the new rsvg patches... I've since verified it according to tim's notes. Could you deploy? Or is there someone else I should ask? [16:24:52] hexmode: ooh, sorry, i know now, the one where you tested. I see a new update by Tim now though [16:26:06] mutante: I'm not getting email from RT! [16:26:14] * hexmode hates on RT for this alone [16:26:41] hexmode: it's missing your comment , did you also try to add a comment via mail? [16:27:08] hexmode: the new "please do the following in three separate commits:"-part. is that something you could do? [16:27:22] mutante: I'll do those three seperate commits [16:27:27] and comment [16:27:28] hexmode: adding you to CC [16:27:38] and no, I didn't comment via email [16:28:46] hexmode: i made you a requestor.a ticket can have multiple requestors [16:28:54] that should fix mail issues [16:33:05] New patchset: Jgreen; "adjusted notification recipient for offhost_backups script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1930 [16:33:39] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1930 [16:33:40] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1930 [16:34:09] RECOVERY - Squid on brewster is OK: TCP OK - 0.000 second response time on port 8080 [16:40:10] maplebed: the c series shipped today [16:40:19] tracking isnt updated yet cuz it just happened [16:40:27] but its two day, so we should have it on thursday [16:40:48] cmjohnson1: I am going to assign a bunch of tickets for this system, when it comes in on Thursday it will be your top priority to get it racked and ready for access [16:45:37] !log upgrading drac firmware on mw1108 [16:45:38] Logged the message, Master [16:46:27] mutante: I think that is it... Could you check it out and compile it? [16:48:16] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:56:41] hexmode: i don't know. where from? [16:57:34] or did you mean a new attachment? (dont see updates) [16:58:03] mutante: sorry, I committed it [16:58:10] so in svn [16:59:01] bleh, maplebed & cmjohnson1 it seems that ms9 won't arrive until next monday [16:59:12] cmjohnson1: i dropped tickets to both you and leslie for the racking and network setup [16:59:14] what's ms9? [16:59:21] new dell c series [16:59:37] I mean what will it be doing, sorry :-D [16:59:40] basically the same thing as the R510 but with different series of server, more linux and open source friendly [16:59:46] yay for that [16:59:47] swift storage [16:59:51] good! [17:00:04] as soon as the sopa deployment crisis is over I gotta look at ms5 storage again [17:00:55] this server will also be poked at by me for a solid hour before ben gets it [17:01:01] since i wanna run the c series through some tests [17:01:14] it seems like it will be a better fit, on paper, from the R series we use now for everything [17:01:20] I'm definitely c-series curios [17:01:24] curious [17:01:34] more standard chipsets [17:01:39] cheaper systems by a slight margin [17:01:46] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 6.174 seconds [17:01:49] due to the stripping out of the r series management console stuff we dont use [17:01:52] and other things [17:02:03] mostly all the windows deployment crap isnt supported in c series [17:02:07] which we dont care about anyhow. [17:02:16] no we sure don't [17:02:40] mark pointed out that the storage controller and such that the C series is coming with is what rackspace uses [17:02:52] which is cool, since we want to do similar things for storage [17:03:22] sweet [17:11:23] robh: pmtpa-d1 is ready for relocating locke tomorrow. [17:11:40] awesome [17:24:51] PROBLEM - Squid on brewster is CRITICAL: Connection refused [17:35:17] !log also upgraded drac firmware on mw1081 & mw1099 (fixes mgmt console problem) [17:35:19] Logged the message, Master [17:45:57] hey we're getting watchmouse alerts on wikimedia blog still ? i'm seeing a lot of [Tue Jan 17 17:45:40 2012] [apc-warning] Unable to allocate memory for pool. in /srv/org/wikimedia/blog/wp-settings.php on line 70. [17:46:11] yeah [17:46:16] we're all looking at it, [17:46:19] should we try having multiple hosts, either lvs them or dns round robin for at least some stuff ? [17:46:20] !log aware of blog slowdowns, work is being done [17:46:21] Logged the message, RobH [17:46:23] it nneeds pagination, guillom's working on it [17:46:23] okay, i'm behind the times [17:46:23] sorry [17:46:26] no worries [17:46:32] LeslieCarr: its due to comment stuff and our theme not supporting it [17:46:46] but core does, so guillom_ is working on fixing the theme [17:46:47] it's about having 4800 comments download with each page view (and having a flood of new comments coming in) [17:46:52] ah [17:47:00] yea, this has more comments than all other blog postings combined. [17:47:06] sorta screws the cache :-D [17:51:55] my alarm came on npr this morning and what's the first thing i hear? wikipedia blackout :) [17:52:04] :-) [17:59:04] robh: question regarding srv199....the SATA port A is showing not available but during post i see that the HDD type and size is being recognized. [17:59:27] I opened the case back up and reseated the daughter card and checked cable ends. ....any thoughts? [17:59:49] gimme moment, middle of something else [18:00:19] take your time [18:00:22] cmjohnson1: showing not available where? [18:01:28] in post and in configuration [18:01:45] ok so it has an error about it? [18:02:02] i do not know what you mean 'showing unavailable' really, sorry [18:02:03] not really an error...just says SATA port A not available [18:02:06] ok [18:02:12] well, if its turned on in bios [18:02:21] and sees disk, it sounds really funky [18:02:37] sounds like bad controller, which on those i think means bad mainboard, i dont recall [18:04:48] i am going to call Dell [18:04:56] just wanted to see if you had any ideas [18:05:25] !log theme updated on blog along with settting limit back to 20 comments per page [18:05:27] Logged the message, RobH [18:05:28] guillom: ^ [18:05:37] oh thats much faster [18:05:37] ok, checking now [18:05:41] !log blog is instantly faster [18:05:43] Logged the message, RobH [18:05:51] :D [18:05:55] still slower than it should be [18:06:10] but getting better it seems to me [18:06:16] hmm 10 per page now [18:06:31] I'm logged in, should trry it logged out [18:06:49] i see that, odd [18:07:17] apergos: last comment first is why [18:07:24] so the last 'page' holds up to 20 [18:07:25] but may have less [18:07:58] hmm last comment first means after every 20 ? comments the cache is invalidated? [18:08:04] for the other pages [18:08:11] after every comment. [18:08:16] it recaches the entire page i imagine [18:08:25] meh [18:08:33] guillom: whatcha think? [18:08:39] should we display first comment page by default [18:08:41] or last? [18:09:06] each page has the number of comments [18:09:17] so it's not going to help, that number changes with every comment approved [18:09:29] ah, I don't know; I think it's ok to list them chronologically if it's better for caching [18:09:30] anyways it's better, people will have a much smaller thing to regenerate and load [18:09:33] i dont think the cache invalidating is an issue [18:09:40] the performance is now just fine [18:09:43] imho [18:09:45] yup [18:09:56] so caching is no big deal, or the invalidation thereof [18:10:04] it was loading every comment for every direct article link [18:10:07] that was killing shit [18:10:23] i bet the blog could have continued to sit on hooper with the proper code thats now in place [18:10:27] ;] [18:11:19] the load just dropped to near nothing, heh [18:11:38] 1.17 [18:11:54] was over 10 at points before [18:13:13] well now you can relax :-S [18:13:15] :-D [18:23:16] RobH: what's ms9? [18:23:28] the new c series host, due in on monday [18:24:52] oh. I thought you were talking about two separate hosts (c series and ms9, one due thurs and the other mon). [18:24:56] this makes more sense. [18:25:06] i got shipment notification earlier [18:25:12] but i thought it would be here thursday [18:25:13] i was wrong [18:25:15] though to match the ms-fe$ hosts, should it be ms-be#? [18:25:19] its monday, tracking info is updated now [18:25:35] oh, figured ms was backend by default since we have the 45XX series as them [18:26:00] but we can rename them as they deploy into swift service i suppose (when they reinstall)? [18:26:27] all i know is mark hates renaming so I'd rather get it right first. I'm fine to rename them anytime. [18:29:23] so which ms servers are acting as swift storage hosts now? [18:30:43] ms-store1, ms-proxy1 (just brainstorming) [18:31:09] where will the other rings/servers live? [18:31:13] well, in the past it was ms# [18:31:17] and we have those other places [18:31:20] right [18:31:23] now we also have ms-fe# [18:31:31] i kinda hate the - now that i have it [18:31:44] container, account servers will be where? [18:32:13] ? [18:33:49] RobH: swift has at least 4 different kind of servers. I know some will be separate pools (proxy vs. object storage). idk where the other 2 types i can think of offhand (container, account) will live [18:34:04] my understanding is we will have frontend and backend [18:34:11] storage bricks being backend [18:34:20] RobH: (they could all be on every box but it was decided that proxy and storage would be different boxen) [18:34:22] but maplebed is setting it up [18:34:42] so renaming servers that are deployed is hell [18:34:44] and we dont do it [18:34:53] so i assumed we would keep ms# for the storage hsots [18:34:58] and ms-fe for the frontends [18:35:22] we can rename ms# swift storage hosts, but means reinstall and such (or someone digging in the files on the local host for renaming, which is painful) [18:35:26] so we dont do the latter [18:35:32] jeremyb: proxy gets its own host. object, account, and the other one will all live on the same backend storage nodes. [18:35:42] so we'll only use 2 different servers. [18:35:54] maplebed: on all backends or only some? [18:35:59] all. [18:36:01] k [18:36:08] (we're only starting with 4 backend nodes, so ...) [18:36:12] right [18:36:33] i expect more nodes soon (especially for a clone in another DC [18:36:35] ) [18:36:36] maplebed: for the existing ones that are online, we have some right? [18:36:46] for those, would you handle the renmaing or reinstall on them? [18:36:55] cuz it means repuppet, and all that [18:37:10] jeremyb: I was intending to use the cluster syncing stuff for the other DCs rather than extending one cluster into multiple locations. [18:37:17] i dont wanna name half of them one thing, and half the other, thats even more confusing (the storage hosts) [18:37:32] so if we are going to rename the existing storage hosts and do the reinstall on them and such, thats fine [18:37:34] we can rename them [18:37:35] RobH: we're not usingc any existing hardware in the to-be-built production cluster. [18:37:39] maplebed: yeah, well i was just checking the current status on relevant features [18:37:41] oh, ok [18:37:44] then thats fine [18:37:47] we dont have to call this ms9 [18:38:04] the ms-fe boxes recently arrived (so I guess they're existing, but theyre new) [18:38:07] i thought we had existing ms hardware allocated [18:38:16] jeremyb: yeah, that. I don't think syncing is in the version I'm currently working with. [18:38:20] right, and they arent really in use if i frecall [18:38:21] RobH: only for the testing cluster. [18:38:25] so we can change the name on those now as well [18:38:36] maplebed: ok, and test cluster will migrate back to other use after this is done right? [18:38:46] jeremyb: but it'll be a few months before I set up the second cluster and need it, so the newest version will likely have it by then and we'll upgrade. [18:38:50] RobH: correct. [18:38:52] cool [18:39:02] I'm going to build a new test cluster in labs (cuz we do need somewhere to test) [18:39:06] maplebed: so what do you want to call them? [18:39:12] ms-be# [18:39:26] happy enough with the ms-format? [18:39:32] maplebed: https://blueprints.launchpad.net/swift/+spec/cactus-multi-region [18:39:42] bookmarked. thank [18:39:44] thanks. [18:40:13] ok, I gotta bail to cross the bay. see you in a few. [18:40:27] bye [18:40:32] * jeremyb also bails [18:40:58] maplebed: tickets updated with ms-be1 [18:41:06] \o/ [18:44:58] Jeff_Green: you there? [18:45:03] yep [18:45:20] the fundraising nrpe checks are no longer in use, correct? [18:46:09] the junk that's in puppet/files/nagios/nrpe_local.fundraising.cfg [18:46:11] they're not used on silicon or payments*, but I believe they're still in use on grosley/aluminium/erzurumi [18:46:25] looking [18:46:43] also, holy shit spence is so slow.... [18:46:49] oh yes [18:47:20] I'm not seeing any of them in aluminium.cfg [18:47:28] this is stuff that would run in theory on aluminium [18:47:45] I believe that it was for grosley and aluminium [18:47:53] yeah it'd be same for both [18:48:19] fwiw I did not attempt to change any of this [18:48:34] at least not that I recall [18:48:45] I think we talked about this at one point, and I asked if you could do away with it [18:49:05] it was a number of subdirectory checking script [18:49:14] for jenkins [18:49:20] ya maybe so? I approached the problem the other way around--made a cron script to keep the subdir count sane [18:49:36] gotcha [18:49:36] we [18:49:51] I'm fine with you ripping this out if you want [18:50:08] well, it's looking to me like these checks are no longer in use, so I was going to delete nrpe_local.fundraising.cfg out of git [18:50:14] k [18:50:25] but wanted to at least check in first [18:50:42] cool. I shall do so without fear [19:09:51] notpeter: cool thx [19:10:59] unrelated question: I'm (finally) about to start imaging the new eqiad payments boxes and I'm stuck on whether to rename them first. the existing boxes are payments[1-4] and the allocated new boxes are selenium, bromine, etc. what to do what to do? [19:11:47] are we moving away from function-based hostnames? [19:12:02] Jeff_Green: that is the million dollar question ;) [19:12:36] damn. [19:12:40] isnt this like they have hostnames (bromine, etc..) AND service names? (function based) [19:13:14] the old ones do not have 'other' names [19:13:39] but yeah they're accessed from the public via payments.wikimedia.org and LVS [19:16:00] Jeff_Green: I think that usually if it's a cluster, it gets named after the service in some way [19:16:01] keep the hostnames and then also add DNS aliases payments1001 - payments1004? ( it seems the scheme is to start counting from 1000 in eqiad, right) [19:16:34] whereas if it's a one-off service that hosts something that's accessable via a c-name, we use the misc names [19:16:54] i see [19:17:02] I'd actually say rename to payments1001-1004 [19:17:03] so yeah then rename does make sense [19:17:07] but I'd also ask rob [19:17:11] ok [19:17:12] they'll also need to be relabled [19:17:23] both in racktables and irl [19:17:34] yup [19:46:57] RobH - we need to postpone tomorrow's work on Locke till a later date [19:47:31] woosters: ok, any reason? [19:48:07] because they want to make sure they capture the banner logs [19:48:10] for tomorrow [19:49:04] so anytime after tomorrow is fine. I will inform faulkner, nimish and erikZ [19:55:22] PROBLEM - Mobile WAP site on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:47] New patchset: Pyoungmeister; "cleanup. removing a nrpe.cfg that's no longer used." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1931 [19:58:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1931 [20:02:19] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1931 [20:02:19] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1931 [20:04:29] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [20:07:22] RECOVERY - Mobile WAP site on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 1642 bytes in 8.835 seconds [20:25:05] New patchset: Bhartshorne; "added new SOPA filter to emery" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1932 [20:25:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1932 [20:25:58] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1932 [20:25:59] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1932 [20:29:25] en.planet updates fixed [20:44:31] hey [20:55:15] hi [21:04:32] mutante: when are you flying? [21:05:19] mark: i'm not :/ [21:05:33] oh [21:05:39] wasn't it you who texted? [21:05:58] no, the other Daniel, WMF DE [21:06:03] but i talked to him and nosy [21:06:22] can you go to evoswitch as well on friday? [21:06:50] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 167 MB (2% inode=60%): /var/lib/ureadahead/debugfs 167 MB (2% inode=60%): [21:07:28] mark: yea, i was going to offer that in case you'd like me to, but they think they dont need anybody, and just access [21:07:47] since when is that up to them? [21:08:57] they either need an escort from the dc (fine for just swapping a disk or something similar short), or one of us present [21:13:54] yep, will talk to nosy again tomorrow [21:26:31] PROBLEM - Puppet freshness on spence is CRITICAL: Puppet has not run in the last 10 hours [21:28:22] RECOVERY - Disk space on srv223 is OK: DISK OK [21:57:43] New patchset: Jgreen; "adding file_mover@emery to logmover account class (used on storage3)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1933 [21:57:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1933 [21:58:16] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/1933 [21:58:16] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1933 [22:09:16] ok, didnt get lunch, delivery place fubared my order, afk a bit [22:16:30] RECOVERY - Puppet freshness on spence is OK: puppet ran at Tue Jan 17 22:16:16 UTC 2012 [22:20:00] New patchset: Ryan Lane; "Adding support to modify memcached's bind ip, and adding memcached to marmontel" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1934 [22:20:07] maplebed: ^^ review? [22:20:13] sure. [22:20:14] thanks [22:20:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1934 [22:21:15] New patchset: Asher; "prep for throwing varnish in front of single server blog" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1935 [22:21:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1935 [22:21:51] I really wish we could switch our default memcached port to 11211 (its default) rather than 11000... ::sigh:: [22:22:21] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1935 [22:22:22] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1935 [22:22:52] Ryan_Lane: will 2G be enough memory for the cache? [22:23:03] I think so, yeah [22:23:11] It looks like marmontel has ~6 available. [22:23:20] varnish is going to eat some too [22:23:28] and we need to leave a decent amount around for apache [22:23:31] oh, you're running varnish on the same host? [22:23:35] yeah [22:23:39] huh. [22:23:49] I don't think we have time to run this through the cluster [22:23:50] I would have figured we'd separate them. [22:24:01] not the cluster, just a dedicated host. [22:24:08] but meh - I bet it'll be fine. [22:24:20] varnish will offload most of the cpu. [22:24:23] yeah, the amount of stuff that actually needs to be cached is pretty tiny over all [22:24:29] it should all fit in memory [22:24:40] PROBLEM - Puppet freshness on db1045 is CRITICAL: Puppet has not run in the last 10 hours [22:24:44] the only issue is comments purging the cache [22:25:08] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/1934 [22:25:14] Ryan_Lane: +1 commit. [22:25:23] cool. thanks [22:25:41] Ryan_Lane: are you setting a timeout on the cache? I think if it's 1 or 2 minutes, it'll take 90% of the load and comments will show up quickly enough. [22:25:43] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1934 [22:25:43] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1934 [22:25:51] it purges [22:26:08] I may be able to set expires and not purge, though [22:26:24] * maplebed looks at comment throughput [22:27:16] a sample of 2 minutes gives an average of 8 comments per minute. [22:27:28] setting a cache timeout of 1m might be nicer. [22:28:08] even if it is purging the cache 8 times per minute, that's probably still catching thousands of reuquests, so we win either way. [22:30:27] access.log shows 900 hits/min, of which 8 are comments. So we get a 100 cache hits for every purge, in theory. [22:32:39] New patchset: Asher; "reorg probes to prevent error on unused bits probe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1936 [22:32:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/1936 [22:33:00] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1936 [22:33:01] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1936 [22:41:22] New patchset: Asher; "blog: swap varnish and apache between ports 80, 81" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1937 [22:42:05] Ryan_Lane: just to double check.. the apache vhost change in ^^^ is all that's needed to move apache to 81 and not listen at all on 80? [22:43:48] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1937 [22:43:49] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1937 [22:47:09] !log ram only varnish instance now running on marmontel in front of apache/wordpress [22:47:11] Logged the message, Master [22:48:49] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/1918 [22:49:31] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1918 [22:49:31] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/1918 [22:55:43] * Jamesofur thanks binasher for the help with the blog improvements :) [23:18:58] Ryan_Lane: on marmontel -- curl -I 'http://127.0.0.1/2012/01/16/wikipedias-community-calls-for-anti-sopa-blackout-january-18/comment-page-265/#comments' === 404 [23:28:35] New review: Bhartshorne; "(no comment)" [operations/software] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/1908 [23:28:35] Change merged: Bhartshorne; [operations/software] (master) - https://gerrit.wikimedia.org/r/1922 [23:28:36] Change merged: Bhartshorne; [operations/software] (master) - https://gerrit.wikimedia.org/r/1908 [23:48:17] I'm out, see folks tomorrow