[00:09:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.015 seconds [00:10:44] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [00:16:45] !log Scap complete [00:16:53] Logged the message, Master [00:42:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:58:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [01:19:53] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [01:26:20] New patchset: Ryan Lane; "Initial commit of deployment module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27478 [01:31:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.021 seconds [02:20:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:35:29] PROBLEM - LVS HTTP IPv4 on wiktionary-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:36:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [02:36:59] RECOVERY - LVS HTTP IPv4 on wiktionary-lb.eqiad.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 62702 bytes in 0.140 seconds [02:56:49] TimStarling, I saw a report on WP:VPT about search updates being broken again on en.wp, are you aware of this / did you just fix it earlier? [02:58:34] I haven't looked at it or fixed it, I'll have a look now [02:58:38] thanks [03:04:04] TimStarling, fwiw, looking at the report at https://en.wikipedia.org/wiki/Wikipedia:VPT#Search_list_not_updating I am able to find titles from yesterday in the opensearch suggestions, but I'm not able to retrieve the text of those same pages in the fulltext search [03:09:32] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [03:12:09] so this report is from November 14? [03:14:46] no, the one I linked to is from nov 25 [03:16:57] https://en.wikipedia.org/wiki/Wikipedia:VPT#Search_list_not_updating ? [03:17:26] y [03:17:36] ah, I had an old cached version of that page somehow [03:18:01] the same section title was there, but from 11 days earlier [03:18:05] heh [03:30:24] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [03:45:25] I haven't had to look at this incremental index code before, but I'm getting there [03:46:38] we have this: [03:46:39] root@searchidx2:/a/search/indexes# cat status/enwiki [03:46:39] #Last incremental update timestamp [03:46:39] #Fri Nov 23 03:46:19 UTC 2012 [03:46:40] timestamp=2012-11-23T03\:44\:02Z [03:46:57] that is probably when the problem started [03:47:27] so around the time I did my upgrade [04:12:06] !log on searchidx2: restarted incremental updater [04:12:14] Logged the message, Master [04:13:02] I enabled INFO level logging temporarily, we will see what happens [04:13:24] but it seems to be working for now, it's up to arzwiki [04:44:03] TimStarling: apergos booted (or just started?) something on idx2. maybe it should have been done for 2 services and he only did one? [04:45:15] hrmmm, not seeing the !log for it [04:47:00] nope, i'm confusing with search13 apparently (based on my IRC log) [04:49:36] odd... you were looking at enwiki which was not one of the ones broken 3 days ago (AFAIK) [04:50:13] PROBLEM - Host srv238 is DOWN: PING CRITICAL - Packet loss = 100% [04:50:47] apergos was looking (3 days ago) at least at {fr,en}wikisource on searchidx2 and he said then that the files were recent. but that was 2 days after the timestamp tim pasted above [04:51:19] (i guess apergos meant the filesystem timestamps on the index files themselves) [05:16:07] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [05:46:39] jeremyb: there are two copies running now, one started at 04:09 and one at 04:24 [05:47:28] gerrit-wm: where's my lint check!? [05:47:31] !log on searchidx2: killed extra IncrementalUpdater daemon [05:47:38] Logged the message, Master [05:48:40] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27478 [05:48:47] screw it. no lint check [05:51:46] New patchset: Ryan Lane; "Add a role for the deployment module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27479 [05:51:55] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27479 [05:55:36] New patchset: Ryan Lane; "Add deployment code to sockpuppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35597 [05:56:14] -_- [05:56:23] seems the zuul changes means no more ops lint checks [05:57:31] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35597 [07:25:05] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [08:05:23] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 226 seconds [08:07:02] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 327 seconds [08:10:02] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [08:10:11] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 2 seconds [08:16:38] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 303 seconds [08:30:27] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 272 seconds [08:50:00] New patchset: Hashar; "Jenkins test please ignore." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35135 [08:54:53] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [08:54:54] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:54:54] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:55:33] New patchset: Hashar; "Jenkins test please ignore." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35135 [09:10:56] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [09:10:56] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [09:25:47] PROBLEM - Puppet freshness on snapshot1002 is CRITICAL: Puppet has not run in the last 10 hours [09:29:32] doh parameterized class are killing me :-] [09:29:37] class { 'java::openjdk': version => '1.6', jdk => true, } [09:29:38] class { 'java::openjdk': version => '1.7', jdk => true, } [09:29:39] duplicate definitions :( [09:38:15] gotta use a define instead of class [09:44:51] New patchset: Hashar; "testing out multiple defines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35601 [09:52:03] New patchset: Hashar; "testing out multiple defines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35601 [09:58:58] New patchset: Hashar; "convert java::openjdk to a define" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35601 [09:59:13] solved [10:00:48] New review: Hashar; "This cause puppet to emit a duplicate definition errors for java::openjdk. The workaround is to use ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/34863 [10:11:23] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [10:13:06] hashar: oh yes, of course [10:13:11] should have caught that when I merged it [10:13:12] apologies [10:15:28] paravoid: I am not yet a pro :-] [10:15:35] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [10:15:42] on labs I only tested installing one version, not both :( [10:15:53] paravoid: so half tested == half working [10:16:01] and good morning! enjoy your coffee [10:18:06] getting one myself, brb [10:20:07] New review: Silke Meyer; "So, what's next? Can we use this in mediawiki.pp even if parts if it are copied? Or shall we drop it..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/35173 [10:20:14] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 1 seconds [10:55:55] New review: Silke Meyer; "Works for mediawiki installation but breaks wikidata installation: wikidata.pp needs to be adapted, ..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/35293 [11:20:33] New review: Silke Meyer; "Cool, works!" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/35313 [11:20:39] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [11:30:03] Silke_WMDE: hi; i just noticed your issue on github.com/atdt/wmf-vagrant, will answer tomorrow! [11:41:09] ori-l: you must sleepppp [11:41:21] ori-l: or you are never going to be productive tomorrow morning :-] [11:41:39] i've been sleeping since 10pm [11:41:46] no idea what you're talking about [11:41:47] i'm not here [11:41:48] etc [11:41:50] ah ok [11:41:51] sorry [11:41:52] :) [11:41:55] I though you were awake [11:42:12] seems I am dreaming myself [11:42:23] anyway, lunch time [11:42:55] enjoy! [13:10:07] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [13:28:45] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35601 [13:31:07] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [13:50:31] New patchset: Demon; "Reformat gerrit hooks to use more python-esque style" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35615 [13:55:56] RECOVERY - Puppet freshness on gallium is OK: puppet ran at Wed Nov 28 13:55:29 UTC 2012 [13:56:14] New review: Hashar; "ideally we would want to make them pass pep8 linter:" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/35615 [14:13:03] New review: Ottomata; "Interesting!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35601 [14:14:40] New review: Faidon; "A few days ago. I suggested a generic java module (instead ot the original idea) to Hashar esp. thin..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35601 [14:16:32] New review: Hashar; "I have introduced it with https://gerrit.wikimedia.org/r/#/c/34862/ a few days ago :-] I was not a..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35601 [14:17:18] hey hashar [14:17:20] https://gerrit.wikimedia.org/r/gitweb?p=operations/puppet.git;a=blob;f=manifests/misc/java.pp;h=52bbfcddd4e41fbfd0e60b407de174a1b1ea4114;hb=HEAD [14:17:46] hello ottomata :-) [14:18:02] morning! [14:18:04] yeah I have been looking at that one [14:18:12] ah ok, wasn't sure if you knew of it :) [14:18:19] I haven't even tried to look for an existing java class, so I just created my own :( [14:18:23] i saw your change to openjdk.pp, I wasn't aware of that one [14:18:46] i created this one back uhhmmmm, in the spring or early summer or something [14:18:55] as Faidon and I replied on change 35601, the java module has been introduced this week [14:18:58] or maybe last week [14:19:01] anyway, really new one [14:19:12] it is uberly simple so should not be too much of trouble :-] [14:19:18] oh sorry, missed you rreply there [14:19:48] so hmm I ended up reinventing the wheel [14:23:16] hm [14:23:31] maybe we should just move java.pp to the java module [14:23:46] and just put it in init.pp [14:24:15] mind if I make that change? I'll ask you and faidon for review [14:24:51] I'd more than happy to +2 a change that merges those two efforts [14:25:23] ok cool, i'll do that then, it shoudl be the same to hashar's stuff, but we'll need him to veriy [14:26:33] ottomata: I am happy with anything that move code from the main puppet dir to a submodule [14:26:50] ottomata: making your existing java.pp an init.pp to java module makes a lot of sense [14:26:58] + that is definitely not going to break anything [14:27:16] I don't think java.pp uses any template / file so that should be fine. [14:27:50] cool, which java version do you want to be the alternative default? [14:27:51] 6 or 7? [14:29:03] as in, which version do you want to invoke when you just run `java` [14:29:03] ? [14:29:29] looks like you have 6 as default on gallium right now [14:33:09] ottomata: we have some Oracle version installed to build the mobile application [14:33:20] ottomata: and going to need openjdk 6 and 7 [14:33:32] I guess the default should be whatever default is shipped by Ubuntu [14:33:34] oh, ok, so this should install all 3? [14:33:38] this is probably oracle 6 [14:33:45] i think so [14:33:47] install all the javas! [14:33:57] the default on gallium is openjdk6 right now [14:34:00] ottomata: I still have to check with the mobile team if openjdk is fine to them [14:34:04] I think we want to drop oracle [14:34:10] (don't quote me on this :-) ) [14:34:21] want me to install oracle too then? [14:34:45] ah Ubuntu has defaults packages: default-jdk and default-jre [14:34:53] I would expect a puppet "java" class to install thoses [14:35:13] that is 1.6 on precise [14:35:21] maybe the next ubuntu version will have 1.7 by default [14:35:22] one is just the runtime environment [14:35:25] the other is a development kit [14:35:32] yeah, those are just package aliases to the real package names [14:35:54] on gallium I would prefer we do not install Oracle from puppet [14:35:56] but the java define does choose a default for you if you don't specify [14:36:03] but if you are installing multiple versions [14:36:07] I am not sure which version currently run nor what is going to be installed by puppet [14:36:08] if you do it manually [14:36:18] the default will change based on the last one you install (90% on that) [14:36:20] and I am afraid it might cause troubles with the existing jobs building the mobile apps [14:36:28] so, this define lets you pick which one you want to be the default java [14:36:34] by specifying alternative => true [14:36:43] ok [14:36:54] so cool, puppet will only install openjdk6 and 7 [14:36:57] and make 6 the default [14:37:04] (since that is what you ahve as default now) [14:37:22] yup :) [14:37:31] could find out later oni if we still need Oracle version on gallium [14:37:37] if it is no more needed, I will drop it :-] [14:38:01] as for the java.pp manifest you wrote, you are checking the $::lsbdistrelease to find out the package prefix [14:38:21] I think that should instead rely on the default packages provided by Ubuntu ( default-jre and default-jdk ) [14:38:32] but that is not much of a problem right now since our latest Ubuntu version is Precise [14:38:51] would be a problem whenever we start installing new Ubuntu versions though since we will end up forcing an install of 1.6 [14:38:58] I guess we can fix that later [14:40:11] hmm, ok I will add that into that TODO [14:41:05] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/35540 [14:45:02] New patchset: Ottomata; "Merging duplicated java installation efforts into java module." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35621 [14:45:17] <^demon> !log fenari:/home/wikipedia/common now reflects Ia319794c - rm'd all traces of submodule [14:45:26] Logged the message, Master [14:46:21] hashar, paravoid: https://gerrit.wikimedia.org/r/#/c/35621/ [14:46:55] ottomata: reviewing :) [14:49:10] New review: Hashar; "Looks fine, sorry for the earlier code duplication." [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/35621 [14:49:25] one more module, one less puppet manifests in global space [14:49:27] \O/ [14:49:54] <^demon> I can't seem to sync-dir multiversion :\ [14:50:25] wee, thank you, should I let paravoid merge it or shall I go ahead? [14:51:12] !g Ieea46ba6d92d32a83c8795992a6a6c4012a18d8d [14:51:12] https://gerrit.wikimedia.org/r/#q,Ieea46ba6d92d32a83c8795992a6a6c4012a18d8d,n,z [14:51:27] ottomata: fine to me. Merge is up to you :-] [14:51:38] ottomata: I guess since you have rights to merge you can go ahead and merge that in [14:51:53] oook [14:51:56] ^demon: does it complains about the path already existing or something ? [14:52:02] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35621 [14:52:02] <^demon> No, it was just slow. [14:52:03] New patchset: Hashar; "LBFactory_Multi setup for labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29344 [14:52:08] <^demon> paravoid: Who was doing the apache upgrades? [14:52:18] notpeter: [14:52:31] if we're talking about the same upgrades [14:52:33] what's the issue? [14:52:45] <^demon> I'm getting a ton of host authenticity errors from rsync. [14:53:02] ok, hashar, merged on sockpuppet, go ahead and try on gallium [14:53:09] <^demon> All in the mw* range, plus a few search idxs and hume. [14:53:11] i'm changing locations, i'll be back on in 15ish [14:54:34] New review: Hashar; "This has been on beta for quiet sometime, lets deploy it." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/29344 [14:54:34] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/29344 [14:55:49] I liked hashar's version better tbh :) [14:56:02] java { 'java-6-openjdk': version => 6, alternative => true } [14:56:22] is tehre a java-7-openjdk version = 6 [14:56:27] win [14:57:08] <^demon> There is a 7, but those aren't the package names anymore. [14:57:21] <^demon> It's openjdk-6-jre, openjdk-7-jdk, so on. [14:57:51] no [14:57:56] that is just a title for the define [14:57:59] it is arbitrary [14:58:15] you can refer to it later if you somethign depends on it, for example: [14:58:32] <^demon> Indeed. But better to use the actual package name, rather than the old one that's obsolete. [14:58:39] <^demon> Otherwise you'll just confuse someone like me ;-) [14:58:44] package { "java-crazy-thing": require => Java["java-6-openjdk"] .. [14:59:03] so the define doesn't care what you put there, [14:59:14] hashar could change that to whatever he wanted [14:59:22] but, the reason the define doesn't use the package names explicitly [14:59:28] is because they change in different versions of ubuntu [14:59:50] https://gist.github.com/4161806 [15:00:23] New patchset: Hashar; "beta: use IP for database hostname" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/35623 [15:00:27] and not all versions and distributions are available on all OS versions [15:00:53] New review: Hashar; "beta is a bit broken without that. File content is properly safeguarded and not going to interact wi..." [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/35623 [15:00:53] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/35623 [15:01:00] so, paravoid, if you like, we could change this line: [15:01:00] java { 'java-6-openjdk': version => 6, alternative => true  } [15:01:00] to [15:01:00] java { 'muffins': version => 6, alternative => true  } [15:01:13] <^demon> Muffins! [15:01:38] package { "breakfast": require => "muffins" } [15:01:49] require Java["muffins"] [15:02:00] they've already got java (coffee?) beans, eh? [15:02:08] ok and it is coffee time for meeeee [15:02:10] back in a bit [15:02:42] sbernardin: hey, can you do me a favor and see the brand name of the fiber raceway in rack d3-sdtpa? [15:02:50] chris and i are going to order some, or you can take photo [15:02:52] or both. [15:02:59] but brandname is what i really need [15:06:14] sbernardin: you may need to ask miguel...we got it from him [15:06:58] RobH: let me check on it [15:14:07] it may be panduit [15:14:13] usually is, but just confirm, thx [15:17:05] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [15:29:04] sbernardin: ms-be3 goes on b2...ms-be5 will replace the existing one..i need to schedule this w/apergos before taking ms-be5 offline [15:30:21] yep, I need to set weights to zero a couple days in advance [15:30:31] did you see the proposed schedule? [15:31:42] yes i did...so we are ready for be3 and be5 when you are [15:32:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:32:30] so do you have the info on the deadline for shipping back the c2100s? (if so can you stuff that on the wiki page?) [15:32:49] I should set the weights to zero today on those two [15:34:15] apergos: once i have the information, I will update the page [15:34:34] ah, I thought someone had that from talking with dell, my bad [15:34:56] okay, so do you want to plan for a Friday? [15:35:06] we did but our time frames have changed [15:35:13] do the slow shipping [15:38:05] PROBLEM - Puppet freshness on sockpuppet is CRITICAL: Puppet has not run in the last 10 hours [15:38:13] er for which boxes? are they not here? [15:39:11] apergos: for all of them...they didn't arrive like they should...and we are still waiting on 10 for eqiad [15:39:34] which ones did come in? [15:39:53] RECOVERY - Puppet freshness on snapshot1002 is OK: puppet ran at Wed Nov 28 15:39:40 UTC 2012 [15:40:00] all of Tampa is onsite...not all of it is racked because they need to be changed 1:1 [15:40:05] yup [15:40:11] eqiad is missing 10 swift still [15:40:20] racking 2 today [15:40:27] ok well we can start with ms-be3 and 5 on friday if you want [15:40:36] okay...lets do that [15:40:52] we'll follow your schedule [15:41:25] it might get streetched (assumign we don't run into the dell deadline) depending on cluster performance, hopefully not [15:41:29] how long after be3 and 5 are replaced should we wait until we do the next 2? [15:41:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [15:42:20] well a few days, and that's the question is how few we can make it [15:42:26] if i recall correctly it was 45 days but since we haven't received the last group we have some extra time [15:42:32] great [15:43:03] ah do we have ssds for all the new boxes in tampa? [15:43:14] that was the other q I had on the wiki page [15:44:40] yes, we do [15:44:53] yay [15:45:14] ok I want them all to go in with ssds then (I know, after we said not, but then 4 boxes are being pulled for ceph testing so I want tthem all in) [15:45:38] okay...so all of them get ssds [15:45:49] yep [15:45:53] sbernardin: ^ [15:46:11] in tampa. [15:46:14] we will need to add the 2 ssd's to all of the 720's now [15:46:37] RobH: yes...It's Panduit [15:46:55] yes in tampa...we will deal w/eqiad once they all arrive [15:47:03] yeah, I don't know what is wanted there. [15:47:21] so we are set for ms-be3 and 5 for friday? [15:47:40] we will be [15:47:49] ok...sounds good [15:47:53] I have to head out in a minute but I have new rings ready to go for when I get back [15:48:11] and I'll watch over the next little while to make sure everything behaves properly [15:48:19] we'll chat friday mornign your time anyways [15:48:32] ok [15:49:44] sbernardin: make sure you check w/ apergos and myself before making any changes [15:53:18] backin a little while [16:02:14] RECOVERY - NTP on snapshot1002 is OK: NTP OK: Offset -0.002396702766 secs [16:14:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:23:17] does modification of Squid configuration a al http://wikitech.wikimedia.org/view/How_to_block_a_remote_loader require root privileges? [16:24:42] TIAS? [16:25:26] what if it explodes?:P [16:29:41] sooo? [16:33:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.025 seconds [16:41:26] New patchset: Dereckson; "(bug 42077) Namespace configuration for ba.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/35642 [17:05:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:15] MaxSem: yes, squid config changes need root [17:16:59] paravoid, could you do a little change for me then: https://bugzilla.wikimedia.org/show_bug.cgi?id=40919#c10 [17:19:11] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.218 seconds [17:19:53] huh, didn't know we had mobile stuff in squid [17:21:08] I wonder if pushing squid configs is okay or not during the FR :-) [17:21:35] Ryan_Lane: any idea if it is? [17:21:36] the only mobile stuff squids do is redirect people to .m. domains [17:22:11] paravoid: I'd imagine it is [17:22:17] it's just the redirects, yeah [17:22:30] it's the redirects, but it's pushed on all servers [17:22:39] assuming no one fucked up the redirects it should be ok ;) [17:22:46] we don't show banners on mobile [17:24:01] I find it funny that ops is asked to not do anything during the fundraiser, but devs can still do whatever they want [17:24:10] they break the site just as often as us [17:24:50] <^demon> I suppose keeping gerrit up during the fundraiser is probably a good idea. Scheduling the 2.6 upgrade might be tricky :\ [17:24:55] so, the risk here is a) me messing up the squid config (it's a single line change, so difficult but not impossible), b) that "mobi" in the UA might match desktop browsers and we'll end up redirecting random people on their desktops to the mobile site [17:25:03] where there's no FR banner :-) [17:25:27] yeah [17:25:30] that would be bad [17:25:35] ^demon: meh [17:25:50] ^demon: they can live without it [17:25:52] :) [17:26:06] seriously, though, I want the upgrade :) [17:26:15] <^demon> Nobody should have to depend on gerrit since git's distributed ;-) [17:26:38] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [17:26:49] technically people can deploy without gerrit [17:26:57] it's more of a pain in the ass, but it's doablw [17:27:06] <^demon> I've got the uuid thing mostly ironed out--just one last bug I'm trying to resolve. [17:27:07] doable* [17:27:16] why is Squid config not under version control/puppet? [17:27:47] because we're eventually moving away to varnish [17:28:01] anyway, the changes you want are in puppet [17:28:09] the change is already in [17:28:19] redirects.conf [17:29:33] MaxSem: replied to the bug report [17:29:41] paravoid, thank you [17:29:51] I didn't do it, so don't thank me yet [17:31:09] clarity is always good;) [17:31:24] ^demon: are the host fingerprints in your personal known hosts> [17:31:24] ? [17:33:48] <^demon> notpeter: I don't believe so. I've only got 5 entries in my known_hosts [17:34:52] gotcha [17:35:18] that's really weird... that was finished a week ago and no one else is seeing that, afaik [17:36:08] I mena, puppet's not that slow.... [17:43:17] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [17:51:14] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.004 second response time on port 11000 [17:53:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:09:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.046 seconds [18:14:41] notpeter: can you please merge — https://gerrit.wikimedia.org/r/35653 [18:16:47] New patchset: Jgreen; "adding aluminium fundraising-ssl apache vhost" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35654 [18:17:49] New patchset: Dereckson; "Cosmetic code/README fixes for multiversion" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/35655 [18:17:57] paravoid: ping [18:18:01] mutante: ping [18:18:05] heh [18:18:12] AaronSchulz: ping [18:18:21] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35654 [18:18:28] AaronSchulz: Just so you felt like you were part of the ping parade [18:18:31] isn't there an rt ticket to solve this? ;) [18:18:37] AaronSchulz: indeed [18:21:40] Ryan_Lane: ping [18:21:52] in a openstack-dns meeting [18:22:51] should be done in 20-30 mins [18:22:52] Okay, so I really need someone to merge — https://gerrit.wikimedia.org/r/35653 it's for Wikipedia Zero Partner IP Live testing [18:22:56] woosters: ^^ [18:23:15] ah [18:23:18] this is an easy one [18:23:18] See https://office.wikimedia.org/wiki/Partner_IP_Live_testing_schedule for more information [18:23:31] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35653 [18:23:40] Ryan_Lane: cool thanks [18:23:56] Ryan_Lane: can you merge it on sock puppet and force a puppet run as well? [18:24:06] hmm [18:24:11] there's something wrong with the repo [18:24:16] New patchset: Dereckson; "(bug 41877) Namespace configuration for bxr.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/35656 [18:24:27] error: Ref refs/remotes/origin/production is at 3eade19b5b602d58ed87f26ac98a5fa20efbbf1b but expected 7153ca6b5989854f392c134de9dc95e990277813 [18:24:31] ^demon: ^^ [18:24:56] ya perilly ? [18:25:07] preilly i mean :-) [18:25:10] woosters: it's preilly [18:25:31] woosters: it looks like Ryan_Lane is addressing the issue — I appreciate your response [18:25:41] cool [18:25:53] <^demon> Ryan_Lane: Try `git remote update`? [18:26:38] well, it says it merged it [18:27:27] preilly: need me to run puppet anywhere? [18:27:39] <^demon> This is why I tell everyone to use `git pull --ff-only` rather than blindly pulling. [18:27:44] Ryan_Lane: on the Varnish boxes for mobile [18:27:53] <^demon> $5 says local history has diverged from the origin. [18:28:24] ^demon: we don't pull [18:28:28] Ryan_Lane: so is it merged and live now? [18:28:33] should be [18:28:35] well [18:28:38] not live [18:28:47] which varnish servers are mobile? [18:28:50] can never remember [18:28:55] cp10?? [18:29:13] I really need to set up salt grains [18:29:26] cp1041,2,3,4 [18:29:40] ok [18:29:46] Ryan_Lane: I'm basing that off of http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&s=by+name&c=Mobile%2520caches%2520eqiad&tab=m&vn= [18:30:35] Ryan_Lane: cp104{1,2,3,4} [18:31:38] salt 'cp104[1-4]*' cmd.run 'puppetd -tv' [18:31:38] \o/ [18:31:38] yeah. I checked puppet [18:31:38] it's correct [18:31:39] ok. it's live [18:32:29] PROBLEM - SSH on ms1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:32:36] Ryan_Lane: Okay cool thanks! [18:32:43] yw [18:32:50] Ryan_Lane: can we rely on salt now? [18:33:02] I just used it for this [18:33:08] I used it to upgrade itself yesterday [18:33:14] Ryan_Lane: Okay cool [18:33:38] I need to add grains so that we can target clusters, but it's working otherwise [18:33:59] can I get a little help unfouling my git checkout? [18:34:06] I'm confused about its state [18:34:17] git reset --hard origin/production [18:34:21] I'm not sure what fucked it up [18:34:41] didn't work [18:34:51] what error are you getting? [18:34:52] "You are not currently on a branch, so I cannot use any 'branch..merge' in your configuration file." [18:35:24] i tried to amend a change, without thinking I did a git pull between [18:35:31] Jeff_Green: what does git branch tell you? [18:35:42] * (no branch) [18:35:42] production [18:35:42] test [18:35:52] git checkout production [18:35:57] Jeff_Green: so do a git checkout production [18:36:10] Ryan_Lane: jinks [18:36:24] oh ho. ok, then I do a stash and I'm back to square one! [18:36:24] heh [18:36:26] thanks! [18:36:29] yw [18:36:32] Jeff_Green: np [18:37:17] whoa that's interesting--the file I just tried to stash has inline diffs now. [18:37:27] err the file that I had modified before stashing [18:38:22] Errors running git rebase -i remotes/gerrit/production [18:38:50] * Jeff_Green face-desk [18:42:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:48:17] !log temp stopping puppet on brewster [18:48:26] Logged the message, notpeter [18:48:52] good morning guys :-) [18:49:27] I would like to add the "pep8" packages, I thought about creating a puppet module named python which will have a python::pep8 class simply requiring the package [18:49:49] is it worth having a module or should i just use the usual package { 'pep8' : ensure => present } ? :-] [18:52:28] New patchset: Jgreen; "fixed typo" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35661 [18:53:07] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35661 [18:55:35] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [18:55:35] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [18:55:35] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [18:57:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.162 seconds [18:58:21] New patchset: Pyoungmeister; "correcing mac on mc1016" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35662 [18:59:55] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35662 [19:00:14] New patchset: Dereckson; "(bug 42511) Namespaces configuration for ru.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/35663 [19:08:27] New patchset: Hashar; "python module just containing pep8 for now" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35665 [19:08:27] New patchset: Hashar; "install python pep8 on contint server" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35666 [19:09:06] so....still no logmsgbot_ message [19:09:10] *messages [19:09:38] i guess file an RT ticket? [19:11:36] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [19:11:36] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [19:12:12] Reedy: you ready to deploy 1.21wmf5? [19:18:02] preilly: is there a way to straighten these monitor stands? [19:18:22] I almost got used to it being crooked and then remembered how annoying it is [19:18:39] AaronSchulz: what do you mean? [19:19:42] I guess lots of people have this monitor tilt then [19:20:59] Ryan_Lane: so the "The authenticity of host can't be established" [19:21:05] agh [19:21:30] eh? [19:21:42] AaronSchulz: where are you getting an error like that? [19:21:47] on sync-dir [19:21:53] ah [19:22:41] maybe salt will fix all this one day magically ;) [19:23:13] Ryan_Lane: gerrit-wm seems to be broken? [19:25:08] lemme restart the bot [19:25:53] why's irc such a pita? [19:29:31] probably because it's a 500 year old protocol that was never meant to be used as long as it has been [19:30:39] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:51] anyone seen Reedy around today? [19:39:21] New patchset: Demon; "Moving all non-pedias to 1.21wmf5" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/35677 [19:39:26] robla, /me [19:40:41] MaxSem: Chad's filling in for him on the deploy now. just wondering if he gave any hints about being afk [19:41:01] nope, but he was here today [19:41:42] ok...thanks anyway [19:45:46] New patchset: Dzahn; "move misc::monitoring::htcp-loss to ganglia.pp and misc::zfs::monitoring to ganglia.pp (RT-720)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35680 [19:48:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds [20:07:34] New patchset: preilly; "Fixed init stop to wait for the process to stop" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/35345 [20:08:04] Change merged: preilly; [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/35345 [20:08:26] it is too soon for me too push out new rings, I am going to look at it tomorrow morning and see how network activity is on the new hosts again [20:08:38] we may have to move the schedule back, I hope not [20:08:48] ( cmjohnson1, sbernardin ) [20:09:20] apergos: okay, let me know [20:09:26] I will [20:09:56] New patchset: Ottomata; "Including Ori and Evan on analytics machines" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35687 [20:10:14] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35687 [20:12:39] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [20:19:53] mutante: so when might those fingerprint prompts going to be fixed? [20:20:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:11] <^demon> AaronSchulz: We were discussing it in -tech a bit ago. [20:23:10] how long has this been happening? [20:23:25] AaronSchulz: you're getting them too then? [20:23:32] <^demon> Nobody else had complained until I noticed it earlier today. [20:23:36] <^demon> Which makes me think "recently" [20:23:49] I was trying to sync an extension hours ago and was getting that [20:24:04] <^demon> Yeah, I noticed this morning when trying to sync-dir something. [20:24:06] woosters: this is a pretty big problem. we can't deploy code until this is fixed [20:24:08] <^demon> So at least since this morning. [20:24:22] is anyone looking at this at the moment? [20:24:38] <^demon> mutante was, but he had to leave I think. [20:24:41] robla - on it [20:25:02] thanks [20:27:54] New patchset: Demon; "Moving all non-pedias to 1.21wmf5" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/35677 [20:30:16] mutante: your fix didn't take? [20:30:31] have any changes been made to deployment system? [20:30:42] New review: Demon; "PS2 removes commonswiki pending resolution of bug 42512." [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/35677 [20:30:43] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/35677 [20:30:56] mutante is out .. notpeter [20:31:07] I'm looking at the deployment issue [20:31:34] wow, crap [20:31:40] so [20:31:46] no changes have been made to the hosts [20:31:48] The authenticity of host 'mw14 (10.0.11.14)' can't be established. [20:31:49] RSA key fingerprint is 2f:09:54:54:91:9b:10:4e:8f:d7:af:ad:6c:af:ce:34. [20:31:51] Are you sure you want to continue connecting (yes/no)? [20:31:52] Lots of these, for various hosts [20:31:57] Including hume and searchidx1001 [20:32:00] <^demon> Yes, this is what I've been complaining about. [20:32:07] And -- WTF -- fenari itself [20:32:09] and the paaches are all in /etc/ssh/known_hosts [20:36:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [20:39:15] ok [20:39:17] should work now [20:39:28] fenari's giving me a known good fingerprint... [20:39:31] ^demon: RoanKattouw robla [20:39:45] so [20:39:47] yeah, I tried ssh from fenari command line to spence, know I've done this multiple times. and yet today I get the warning message. can't tell what's changed, see nothing in the config files nor puppet runs [20:39:52] what did you fix? [20:39:55] OK seems to be working now [20:39:59] /etc/ssh/known_hosts is generated by puppet [20:40:09] Except that logmsgbot seems to be borked [20:40:14] it does work now, you must have updatred it [20:40:37] and puppet sometimes, just sometimes, doesn't set the perms correctly on huge-ass crazy generated files [20:40:38] like the spence shit [20:40:47] /etc/ssh/known_hosts turned out to be 600 [20:40:51] ah gee [20:40:53] which meant that it worked just ifne for root [20:40:55] <^demon> Yay, no more key errors. [20:40:55] RoanKattouw: https://rt.wikimedia.org/Ticket/Display.html?id=3982 <-logmsgbot ticket [20:40:56] but n one else [20:41:04] Test [20:41:09] binasher, what else is needed to move GeoData forward? [20:41:10] !log Restarted logmsgbot [20:41:10] !log aaron synchronized php-1.21wmf5/extensions/SwiftCloudFiles 'deployed d50d754971991bb2362455fe47eaadb20aff3a80' [20:41:17] Logged the message, Mr. Obvious [20:41:21] notpeter: now if only someone could fix the searchidx1001 warning [20:41:23] xlnt! [20:41:26] Logged the message, Master [20:41:45] <^demon> snapshot1002 is giving permission errors. [20:41:56] <^demon> I've got no problems with searchidx1001 [20:41:59] ignore it [20:42:02] not set up yet [20:42:18] sorry but it's in the middle of 'move to precise' and there are unbuilt packages and any kind of thing giong on [20:42:24] the rest should be ok [20:42:32] <^demon> k [20:45:13] sorry about that, all. also... yay puppet.... [20:45:50] yeah, another reason to love it [20:51:08] notpeter: no worries, thanks for figuring it out! [20:57:26] ^demon: I'm about to head into a meeting, but it's ok for the deployment to go past the appointed time [20:57:40] <^demon> Well we've already finished 1.21wmf5 [21:08:06] RobH, around? [21:08:12] Yep [21:09:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:09:52] RobH, how can I make Yttrium inaccessible from teh interwebs? is it something doable in puppet? (we don't need it accessible in its future Solr server role) [21:10:23] ideally we would reinstall its os after putting it from external to internal vlan [21:10:37] we can do that without reinstalling, but its a major pain in the ass and i dont wanna do it. [21:10:45] with puppet its extremely painful to do that shit without reinstall [21:10:56] MaxSem: can we reinstall it or does it need to stay online as is? [21:11:11] either way we can move it from external to internal vlan no problem [21:11:21] but if no reinstall, im leaving the making the OS work to someone else ;] [21:13:23] !log kaldari synchronized php-1.21wmf5/extensions/UploadWizard/resources/mw.UploadWizardDetails.js 'fixing uplaodwizrd for ie8' [21:13:30] Logged the message, Master [21:21:11] RobH, it's currently accessible only as yttrium.wikimedia.org and I've already salvaged everything that needed to be salvaged so feel free to do whatever you want with it [21:25:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [21:26:51] New patchset: Demon; "Moving commonswiki back to 1.21wmf5" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/35796 [21:27:38] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: Commonswiki back to 1.21wmf5 [21:27:45] Logged the message, Master [21:28:05] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/35796 [21:29:37] Tim-away: ping [21:29:56] * preilly — is aware that it's 8:29am Thursday (EST)  [21:30:38] Tim-away: I'm keen to get https://gerrit.wikimedia.org/r/#/c/35345/ deployed [21:31:34] * AaronSchulz lols at the commit summary [21:31:58] AaronSchulz: what part of it? [21:32:26] the bit about soldiering on with useless results [21:33:07] AaronSchulz: ha ha yeah, "This apparently is not a serious enough problem for the daemon to fail to start, rather it just soldiers on, with empty result sets returned for queries that need RMI on the affected host." [21:35:10] On kaulen (Bugzilla), could somebody tell me the output of "ls /usr/share/bugzilla/extensions" (or wherever it is located) please? [21:35:57] Asking because I don't trust the SVN/Git repository that it's up-to-date when it comes to installed Bugzilla extensions. [21:37:04] <^demon> Needs a root. [21:37:11] <^demon> I can't access /srv/org/wikimedia/bugzilla/ [21:37:12] ah ha [21:37:55] root@kaulen:~# ls /srv/org/wikimedia/bugzilla/extensions/ [21:37:55] BmpConvert create.pl Example OldBugMove Sitemap Voting WeeklyReport Wikimedia [21:38:37] apergos, thanks! [21:39:21] ^demon: I have pinged you as a reviwer on https://gerrit.wikimedia.org/r/#/c/35799/ (need to build debian package for precise, those old deb files from 5 years ago are long since dead) [21:39:33] Hmm, I need to find out what Sitemap does. It's not in Git. [21:39:38] if you are not a good person, feel free to point me to someone else [21:39:42] andre__: sure thing [21:40:00] ah. Sitemap = Google indexing. [21:40:47] <^demon> apergos: So this change is just nuking the crappy build files? [21:41:06] yep and taking out the related cruft form the Makefile [21:41:38] I left the rpm stuff in though heh [21:41:52] <^demon> merged. [21:41:56] thanks [21:42:04] <^demon> yw [21:44:28] yeah andre__ I think it's this: http://bzr.mozilla.org/bugzilla/extensions/sitemap/trunk/files [21:44:47] hmm. I thought it's this: https://code.google.com/p/bugzilla-sitemap/ [21:44:49] (also here http://code.google.com/p/bugzilla-sitemap/) [21:44:55] same thing :-D [21:44:55] ah. phou :) [21:45:24] okay. I don't mind if that breaks. (I was asking all this only to get an idea about risks when upgrading Bugzilla) [21:45:32] ok [21:45:32] plus we cannot find out directly anyway :) [21:46:10] should be exciting :-D [21:55:39] MaxSem: Sorry, middle of a few things [21:55:43] thats good news then [21:55:50] apergos: yeah, I'm lovin' the planning already! ;) [21:55:53] yes we can move it easily from external to internal and we prefer it! [21:56:05] MaxSem: i'll drop a network ticket for it later today and once thats done we can reinstall [21:56:12] thanks =] [21:56:20] RobH, thanks! [21:56:24] (always good to reclaim ipv4 addresses when we can ;) [21:57:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:58:02] MaxSem: On that note (mobile solr) [21:58:08] i just ordered your new cluster [21:58:29] looks like the estimated ship date is 12/19 [21:58:42] so in time for christmas, though they may not be racked by then ;] [21:58:55] RobH, cool! [21:59:02] tfinc: ^ [21:59:26] now that i did that, ill make that ticket for ya [21:59:33] :) [22:00:37] MaxSem: So yttrium is in nagios for monitoring on the external ip [22:00:53] im going to put it in the decommissioning.pp file so it parses out of nagios over the next 24 hours [22:00:59] !log disabled page counters on labsconsole [22:01:06] Logged the message, Master [22:01:12] MaxSem: we can manually pull it but its a pita, so i rather let it auto pull [22:01:26] but that means no nagios monitoring on it until we put it back in tomorrow, that ok? [22:01:34] (i assume yes but i wanted to let you know) [22:02:00] yes, it's unused right now [22:03:33] awesome [22:08:42] cmjohnson1: i have to update decomissioning.pp [22:08:46] want me to remove db42? [22:08:50] your note says 11/27 [22:08:59] yes please [22:09:03] thx [22:09:06] notpeter: so we have all the mw servers in here [22:09:11] from when we turned them off before i assume [22:09:19] should i pull them so as they add back we monitor ok? [22:09:29] (decomissioning.pp) [22:11:24] New patchset: RobH; "yttrium moving from external to internal vlan" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35806 [22:11:37] cmjohnson1: in an effort to do as i say when i do [22:11:45] wanna review that for me? i know you cannot +2 it [22:11:53] k' [22:11:55] but you can +1 which is good enough, if you think its fine its fine, and i can commit it [22:11:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.264 seconds [22:13:01] RobH: ? [22:13:05] we shouldn't have to do anything to them [22:13:14] robh: looks good to me [22:13:16] well, they are in the decommissioning.pp file [22:13:23] so when they are put back into service, they wont stay in nagios [22:13:30] New patchset: Andrew Bogott; "Replace keep_up_to_date with ensure present/latest." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35293 [22:14:11] robh: i am going to put labsdb1001 /1002 in 10.64.20.0/24 - labs-hosts1-b-eqiad [22:14:32] New review: Andrew Bogott; "Hm... the wikidata class still doesn't work, maybe because I'm doing something wrong?" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/35293 [22:14:37] yep [22:14:50] k [22:14:53] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35806 [22:16:09] New review: Andrew Bogott; "You should move it into mediawiki.pp and rearrange the code so that it invokes git::clone. Catch me..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/35173 [22:18:02] preilly: ok, but note that the bug will hit again when the package is upgraded, since the stop action from the old init script will be used [22:18:10] !log aaron synchronized php-1.21wmf4/extensions/SwiftCloudFiles [22:18:17] Logged the message, Master [22:18:25] oh [22:18:30] I didn't realise they're in there [22:18:31] ^demon: you said searchidx1001 worked for you? [22:18:36] sure, take them out, that's smrt [22:27:50] is it possible to use ensure=>present for lucene-search-2 instead of ensure=>latest? [22:28:05] ensure=>latest is a very scary way to do a software upgrade [22:31:24] TimStarling: sure [22:31:28] i don't believe there is anyone seriously in charge of lucene so i would say yes [22:31:34] as less insanity == better [22:31:42] indeed :) [22:31:54] patrick and asher and I have been doing some work on it [22:31:58] TimStarling: I'm all for everything you do to make our lucene setup better :) [22:32:18] [21:30] Tim-away: I'm keen to get https://gerrit.wikimedia.org/r/#/c/35345/ deployed [22:32:41] that is a lucene-search-2 change that I committed, I'm going to deploy it [22:33:57] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: Rolling wikiversions.dat back to de90c27d, undoing todays deploy [22:34:04] Logged the message, Master [22:39:29] paravoid: Unexpected response (): (curl error: 6) Couldn't resolve host 'ms-fe.pmtpa.wmnet' [22:39:39] I wonder if just using the IP would help [22:39:53] that's how it is for the dbs [22:40:36] New patchset: Tim Starling; "lucene-search-2 "present" instead of "latest"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35813 [22:41:48] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: metawiki to wmf5 [22:41:55] Logged the message, Master [22:42:34] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35813 [22:45:03] anyone has any idea about how many row we have in the parsercache ? [22:45:21] !log adding DNS entry labsdb1001/1002 .eqiad.wmnet [22:45:28] Logged the message, Master [22:46:06] !log running authdns update [22:46:13] Logged the message, Master [22:46:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:47:52] hashar: based on the bad estimates of show table status and eyeballing, I would guess somewhere around 250million [22:47:55] but tim could probably give a better answer [22:48:04] that is good enough thanks notpeter :-] [22:48:34] notpeter: does it gives the table size on disk ? [22:48:34] if you mean now, then I would have just done a show table status myself [22:48:51] just wondering, don't spend anytime on that request :-] [22:48:56] if you mean generally, well the parser cache goes through cycles of using up all disk space, crashing, being deleted and then starting again [22:49:04] ahah [22:49:07] heh [22:49:10] so it depends on how long since the last total failure [22:49:20] paravoid: can MW just hardcode 10.2.1.27? [22:49:41] TimStarling: I should relocate to AUS to enjoy your sense of humor :-] [22:49:45] TimStarling: speaking of which! what do you think of adding db sharding support? [22:50:25] depends on what you mean by sharding [22:50:33] also are we going to replace memcached with redis ? ;) [22:50:44] no [22:50:53] you know I was ranting to domas, apparently "sharding" just means "distributing" now [22:51:05] thanks asher :-) [22:51:06] s/sharding/hashing [22:51:11] it used to mean splitting the rows of a table over multiple servers [22:51:40] now it means distributing, e.g. "I sharded some milk over my breakfast cereal in order to get it uniformly wet" [22:51:48] hahaha [22:51:51] lol [22:52:00] TimStarling: that's very fine grained sharding [22:52:10] i dunno, looks like you had a milk hotspot [22:52:41] so, if you mean the traditional sense, I would want to know what table [22:52:57] and if you mean the modern sense, I would just want to know what it is you want to distribute [22:54:00] TimStarling: I think he means the bagostuff sql tables [22:54:23] we currently write to 256 tables on one server based on the modulo of a hash of the key name [22:54:38] i want to spread that out over three servers [22:54:43] New review: awjrichards; "req.http.User-Agent ~ "MSIE (8|9|1\d)\." should be sufficient. UA's with IEMobile in them should wil..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/35298 [22:54:50] yes, we can have that [22:55:14] maybe we should have a class for doing that in a generic way, wrapping multiple BagOStuff instances [22:55:21] similar to the multiwrite class [22:55:45] hrm, what else would shard? [22:55:57] milk on my cheerios [22:56:02] I mean memcached and redis are already sharded properly [22:56:22] they only reason to do it manually it when you are using something kind of hacky [22:56:34] like mysql for caching ;) [22:57:05] *shrug* we can add it to SqlBagOStuff I guess [22:57:42] the configuration parameter is "server", we can add another called "servers" [22:58:04] * TimStarling shoots puppet [22:58:50] TimStarling: and use composition with RDBStore! [22:58:55] * AaronSchulz likes to push tim's buttons [22:58:59] yay RDBStore [22:59:10] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: Moving all non-pedias back to wmf5 [22:59:17] Logged the message, Master [23:00:30] aren't puppet runs meant to be staggered? [23:00:46] so puppet has to be drunk first? [23:01:42] there's 562 connections to stafford right now [23:03:12] normal apparently: http://ganglia.wikimedia.org/latest/graph_all_periods.php?h=stafford.pmtpa.wmnet&m=cpu_report&r=day&s=by%20name&hc=4&mc=2&st=1354143743&g=cpu_report&z=large&c=Miscellaneous%20pmtpa [23:04:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.014 seconds [23:06:06] splay [23:08:23] funny how there was a few hours when the splay seemed to be working [23:09:33] TimStarling: paravoid has a good explanation for this [23:10:45] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [23:11:18] I would be interested to hear that [23:17:00] New patchset: Aaron Schulz; "Enabled retries for jobs that fail to be acknowledged as done." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/35818 [23:19:56] how is it possible for select() to block on a pipe that is reading from a zombie process? [23:20:10] as is the case for many many puppet processes stuck reading from apt-get zombies? [23:21:07] "man 7 pipe" says that if the write end is closed, the read end will see an end-of-file [23:21:51] and "man select" says "Those listed in readfds will be watched to see if characters become available for reading (more precisely, to see if a read will not block; in particular, a file descriptor is also ready on end-of-file)" [23:22:29] so you would think that a select(4, [3], [], [], {9, 828134}) would not block when 3 is a half-closed pipe [23:22:31] but it does [23:24:47] ahhh, except that it is not half-closed [23:25:12] puppet must call pipe() and then neglect to close the write end in the parent [23:25:33] so when apt-get exits, the pipe stays fully open [23:29:48] !log kaldari synchronized php-1.21wmf5/extensions/Vector/modules/ext.vector.collapsibleNav.css 'adding forward-compat to Vector for bug 42452' [23:29:55] Logged the message, Master [23:30:27] !log kaldari synchronized php-1.21wmf5/skins/modern/main.css 'updating modern skin' [23:30:34] Logged the message, Master [23:30:55] !log kaldari synchronized php-1.21wmf5/skins/vector/screen.css 'updating vector skin' [23:31:01] Logged the message, Master [23:31:39] TimStarling: maybe the ignoreDuplicates() check for the db job queue should be a select before the insert to reduce writes [23:32:39] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [23:33:12] yes, that should make it a bit faster [23:36:05] I mean it already has small race windows anyway, so it not much is lost there [23:36:55] the DELETE is missing a job_token = '' condition also, which would be a race if two runners to two identical jobs and crashed and claimTTL was on (they wouldn't get retried)...not likely but still amusing [23:37:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:45:40] so it seems select() on a self-pipe is just the normal way for puppet to spend its time [23:45:49] when it has nothing better to do [23:48:25] TimStarling: there's a joke to be made there, but I'm not going to make it. [23:48:49] puppet gets bored and hits the self pipe.. sounds like a root [23:48:50] I'm sure it would have been very funny [23:49:14] there we go [23:53:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [23:58:47] * AaronSchulz feeds puppet some graham crackers