[00:37:15] !log Gerrit has once again closed the Zuul socket, ssh event stream down [00:41:29] !log wake up morebots [00:41:37] now what [00:41:37] Logged the message, Master [00:41:42] !log Gerrit has once again closed the Zuul socket, ssh event stream down [00:41:49] Logged the message, Master [02:01:29] !log LocalisationUpdate completed (1.22wmf5) at Mon Jun 10 02:01:29 UTC 2013 [02:01:44] Logged the message, Master [02:02:14] !log LocalisationUpdate completed (1.22wmf6) at Mon Jun 10 02:02:14 UTC 2013 [02:02:22] Logged the message, Master [02:06:50] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jun 10 02:06:49 UTC 2013 [02:07:02] Logged the message, Master [02:25:14] !updated Parsoid to 956117df0e [02:35:09] gwicke: You want "!log" first. :-) [02:35:33] oops, good point [02:35:50] those bots should simply be come a bit more intelligent ;) [02:35:58] !log updated Parsoid to 956117df0e [02:36:06] Logged the message, Master [03:55:24] New patchset: Krinkle; "wgRC2UDPPrefix: Use hostname-".org" instead of lang.site" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47307 [07:33:35] !log updated Board Election translations on cluster [07:33:44] Logged the message, Master [08:09:52] !log nikerabbit synchronized php-1.22wmf6/extensions/UniversalLanguageSelector/ 'ULS to master' [08:10:02] Logged the message, Master [08:25:59] !log nikerabbit synchronized php-1.22wmf5/extensions/UniversalLanguageSelector/ 'ULS to master' [08:26:05] Logged the message, Master [08:44:45] Nikerabbit: so is gerrit stuck again ? [08:44:59] hashar: yep [08:45:29] New review: Hashar; "ping" [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66906 [08:45:34] indeed :/ [08:45:51] apergos: good morning! would you mind restarting the gerrit service on manganese pleas ? [08:46:01] mrning [08:46:02] ok [08:46:10] apergos: it got stuck over and over for the last 3 or 4 days and restating it is the only way to restore the service unfortunately [08:46:17] I thought demon got it [08:46:29] it crashes again after a few hours :( [08:46:51] ah so it's dead again after last night, wow [08:47:25] we will talk about it with Chad when he connects 'in roughly 3 hours') [08:48:50] !Log restarted gerrit again, poor thing [08:49:00] Logged the message, Master [08:49:04] good bot! [08:49:07] New review: Hashar; "ping" [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66906 [08:49:11] apergos: thanks :) [08:49:15] yw [08:49:17] Nikerabbit: solved for now [08:49:34] hashar: magnificenta [09:32:36] New review: Faidon; "I'm not sure I understand why a low hit rate results in PHP fatal errors. Do you know why this happens?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67551 [11:32:14] New review: Mark Bergsma; "Perhaps it makes sense to keep APC enabled, but clear the cache on every build (if that's possible)?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67551 [12:33:43] Gerrit is down. [12:33:53] "The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later." [12:34:28] <^demon> I had to restart it so zuul would pick up again [12:34:44] back up [12:34:56] ^demon: *nod* :( Twice already today. [12:35:01] <^demon> I know. [12:42:35] New review: Hashar; "ping" [operations/puppet/zookeeper] (master) - https://gerrit.wikimedia.org/r/66906 [12:47:07] New review: Siebrand; "manybubbles: Where are the settings you speak of found?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67252 [12:49:37] New review: Faidon; "(my previous review on PS1 still stands in its entirety)" [operations/debs/kafka] (master) C: -1; - https://gerrit.wikimedia.org/r/67442 [12:52:08] New review: Manybubbles; "I'm not really sure. Sometimes Java properties are burried in the chain of shell scripts used to ex..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67252 [12:53:31] New review: Manybubbles; "To be clear not setting the -Xmx parameter won't be a problem now, it'll be a problem if we move to ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67252 [12:54:20] New review: Faidon; "Hm, debian/patches seem to be iterations over the same (Make)files, it probably would be more readab..." [operations/debs/kafka] (master) - https://gerrit.wikimedia.org/r/67442 [12:58:41] New patchset: Petrb; "improved sql script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67826 [13:02:23] New review: Siebrand; "If it's not blocking now, please don't block this patch set..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67252 [13:04:47] New review: Manybubbles; "Sorry, I'm not used to how we work." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/67252 [13:05:13] * siebrand grins at manybubbles  [13:05:38] New patchset: Faidon; "Varnish radosgw: only shard certain containers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67827 [13:06:23] siebrand: so if this were my last job I wouldn't have let that go to production because it'll bite us eventually and we'd never have fixed it. Is the right thing to do here to file another bug about _maybe_ not setting max memory? [13:09:31] hashar, does integration-zuul-layoutdiff return non-zero if there's any diff at all? [13:10:47] New review: Faidon; "No, I'd rather not risk it. Please investigate what Manybubbles suggests and come back with an adjus..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/67252 [13:11:22] manybubbles: I don't know what the right thing is, but Id really like to know. I need a problem fixed that we have, and I'd like to get that change in. [13:11:35] siebrand: then find out what the right thing is :) [13:11:49] manybubbles: If there is more, I need to be able to ask someone what the right change is, so who can I ask? [13:12:00] siebrand: fair enough. can we figure out what the command running java is? [13:12:02] andrewbogott: probably :-) [13:12:12] manybubbles: commenting on patchsets explaining why something is a bad idea, is perfectly fine and *is* how we work [13:12:18] indeed [13:12:20] hashar, ok then… how does https://gerrit.wikimedia.org/r/#/c/67462/ look? [13:12:30] manybubbles: I do not know. I have no access to the machines, I do no know ops procedures. [13:12:32] andrewbogott: the build status is irrelevant, that is merely a convenience to easily diff the Zuul interpretation of the config [13:12:35] manybubbles, and people not liking it and complaining, also :) [13:12:51] hashar: It took me surprisingly long to figure that out :) [13:13:24] andrewbogott: sorry :( [13:13:29] siebrand: so we need that. flying blind is silly. It could be just fine but we can't find where the config is set. that has happened to me a few times. [13:13:47] hashar, not your fault [13:13:48] siebrand: and if it is set then we shouldn't set it twice because then we'd just be confusing [13:13:56] manybubbles: okay, I'll add ops as a reviewer and ask who knows. [13:14:20] siebrand: sounds good [13:14:46] I already became a "reviewer" when I added my comment [13:15:05] hashar, anyway… does my regexp look OK? That test should pass now, and I want to make pep8 mandatory before we backslide :) [13:15:29] New review: Siebrand; "Faidon: I really have no idea, because you operate the machines, and they don't do what they need to..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67252 [13:15:32] andrewbogott: commented at https://gerrit.wikimedia.org/r/#/c/67462/13/layout.yaml,unified :D [13:15:38] thanks [13:16:06] andrewbogott: Imeant, the patchset 2 should be fine https://gerrit.wikimedia.org/r/#/c/67462/2/layout.yaml,unified [13:16:15] ok -- /me simplifies [13:16:25] andrewbogott: which is the run at https://integration.wikimedia.org/ci/job/integration-zuul-layoutdiff/128/console [13:16:25] uhm [13:16:30] so ttmserver runs on... vanadium!? [13:16:46] oh my... [13:16:54] yes, so here's how it works [13:17:08] some random group asks for a server to do whatever on, doesn't need any ops support whatsoever [13:17:14] so after some hesitation we give that [13:17:22] and then all kinds of services start appearing on that machine, since, you know, there are no barriers [13:17:27] and then suddenly we need to support that [13:18:01] mark: like when you added etherpad for testing? :D [13:18:14] different, but sure [13:18:32] of course noone in ops knows much about that solr install [13:18:37] much? [13:18:38] I'm sure LangEng and translators would have liked to be considered worth a dedicated machine [13:18:56] since we had nothing to do with it, and we're just planning to assign someone to do a support solr (re)install [13:18:57] and also about someone else taking care of setting solr up [13:19:18] Nemo_bis: afaik faidon helped with the packages. [13:19:24] what packages? [13:19:32] Nemo_bis: for solr/jetty [13:19:36] don't think so [13:19:50] okay then. Well, someone did... [13:20:09] so anyway [13:20:58] there's a jetty running on vanadium [13:22:07] jsvc.exec -user jetty -cp /usr/share/java/commons-daemon.jar:/usr/share/jetty/start.jar:/usr/share/jetty/start-daemon.jar:/usr/lib/jvm/default-java/lib/tools.jar -outfile /var/log/jetty/out.log -errfile /var/log/jetty/out.log -pidfile /var/run/jetty.pid -XX:+UseConcMarkSweepGC -Djava.io.tmpdir=/var/cache/jetty/data -Djava.library.path=/usr/lib -DSTART=/etc/jetty/start.config -Djetty.home=/usr/share/jetty -Djetty.logs=/var/log/jetty -Djetty. [13:22:21] manybubbles: that's the command-line, although I doubt it helps [13:22:36] PERFECT! [13:23:19] so we _don't_ set the -Xmx or -Xms parameters meaning we default the to something the JVM figures out based on ram size. [13:23:35] probably? [13:23:38] except that there are other services on the machine [13:24:52] paravoid: actually you did something to provide solr-3.6.0 package for us ;) [13:25:02] maybe? [13:25:15] backporting a package is very different than actually knowing what it's used for :) [13:25:47] be careful when you help someone faidon, someone might even mistake that for complete support and approval of everything they're doing ;) [13:26:22] mark: I don't think you're being helpful, mark. Cynicism won't solve a problem, here. [13:26:26] hashar, ok, restored version 2. [13:26:41] indeed it won't [13:26:44] it makes me feel better though [13:26:47] anyways [13:27:01] as for ULS... can we have some engineer working on ULS in next monday's engineering/ops coordination meeting? [13:27:17] paravoid: would you mind telling me how much memory that process is using just now? that'll give me a sane default to recommend for the parameters [13:27:27] mark: Would you mind sending a mail to localisation-team@wikimedia.org with what you need? [13:27:50] mark: We have sprint planning tomorrow, and so we can reply by Wed morning. [13:28:05] what I need for the meeting you mean? [13:28:25] mark: You're asking for a resource for something. I'm asking you do drop us a mail with that request, yes. [13:28:30] manybubbles: virt 3.3G, rss ~1GB [13:28:45] i'm not asking for a resource [13:28:51] paravoid: perfect - it is likely using about a 1G heap [13:28:52] siebrand: I think the point is that you are asking for a resource... :-) [13:29:06] we have a biweekly meeting to discuss things that require coordination between engineering groups and ops [13:29:35] <^demon> mark: When's the next one for that? [13:29:39] monday [13:29:40] next monday [13:29:42] paravoid: All communication I've been having up to now is not about ULS. mark brings up ULS, and I asked him to send a mail about that. [13:30:16] you will have a mail about that, it'll be my reply to erik [13:30:21] i'll make sure to cc language-team [13:30:33] mark: Ah, that… Great. [13:30:44] manybubbles: feel free to be bold and submit patchsets, you're probably more qualified than most of us and people who've previously done solr work [13:30:48] New patchset: Hashar; "jenkins validation of pep8" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67830 [13:31:05] * Nikerabbit rolls eyes [13:31:33] New review: Manybubbles; "paravoid was kind enough to post the command line that all the shell scripts collapse into in IRC:" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/67252 [13:31:56] I'll submit the patch [13:32:05] perfect [13:32:08] we have another solr setup [13:32:28] <^demon> Huh? [13:32:54] the geodata one [13:33:05] that's 9.5G virt, 1.8G rss [13:33:11] the jetty settings are likely the same [13:33:33] that's from solr1001, let me do a quick check on the other boxes too [13:33:39] andrewbogott_afk: deployed :D [13:33:54] 10107768 1926384 [13:34:08] 9843704 1841500 [13:34:34] Change abandoned: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67830 [13:34:36] manybubbles: so we either need a superset of defaults, or the puppet class should be parameterized to allow for tuning from the role class [13:35:01] or, I don't know, maybe we don't need multiple solr installations all over the place on random boxes *g* [13:35:25] <^demon> We need clouds, obviously ;-) [13:35:40] obviously [13:35:46] we have Labs [13:35:51] in near term we need something that works [13:36:05] be careful what you wish for: http://www.theregister.co.uk/2013/06/08/facebook_cloud_versus_cloud/ [13:36:06] paravoid: 1G is a pretty sane default anyway. [13:36:30] "I got a call, 'Jay, there's a cloud in the data center'," Parikh says. "'What do you mean, outside?'. 'No, inside'." [13:36:33] There was panic. [13:36:36] "It was raining in the datacenter," he explains. [13:37:48] paravoid: to be clear, you don't want me to hard set the default to 1G, you'd prefer I let that be configurable? [13:37:49] manybubbles: so this extra gb rss that those solr instances use is just garbage that needs to be collected? [13:38:19] I'm just saying that we have another set of solr servers that seem to exceed 1G rss [13:39:11] ah ha [13:39:20] let me read go make sure I'm not confused [13:39:22] I pasted numbers above [13:40:17] there's two role classes related to solr, role::solr::geodata and role::solr::ttm [13:40:30] role::solr::ttm is vanadium [13:40:51] the geodata ones are -confusingly enough- boxes named solr1001.eqiad.wmnet etc. [13:41:00] MaxSem: ping. [13:41:07] pong [13:41:24] paravoid: by rss you mean the resident memory, right? [13:41:27] yes [13:42:20] New review: Hashar; "Ohh. Maybe we should just enable instant commons everywhere." [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/67407 [13:42:31] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/67407 [13:42:53] New patchset: Siebrand; "Increase ramBufferSizeMB from 32 to 100 and set Xmx/Xms" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67252 [13:43:21] MaxSem: There was talk about the geo search. Thought I'd poke you.... [13:45:01] New review: MaxSem; "Can we make the total memory settings customizable? The solr* boxes have 64 gigs RAM total, it is co..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/67252 [13:45:30] they have 64G of ram of which they use 2GB [13:45:34] apergos: looks like you fixed up wikilove / instant commons :) [13:45:35] awesome capacity planning [13:45:40] paravoid: so the resident memory is how much the JVM has allocated for itself. My guess is the virtual memory has to do with reading and writing so many files. The 1926384 values mean that machine has _about_ 2G of heap usage. Well, more like 1.8G because there is other junk going on in there. [13:45:40] aww [13:46:00] paravoid: they actually do use quite a bit of page cache [13:46:02] not yet, instant commons is working but the fix for wikilove is awaiting someone to +2 it [13:46:23] also I didn't redo the media infrastructure yet, I am still in the documenting phase for that [13:46:36] I'll send you some pics soon [13:46:43] so would it be possible to get dedicated machine for ttmserver solr, perhaps access to it? [13:47:01] would it be possible to merge the two solr installations? [13:47:08] hashar: can't see the commit because gerrit's down, but fyi :p https://www.mediawiki.org/wiki/Talk:InstantCommons#Enable_by_default [13:47:37] i.e. merge ttmserver into solr100Ns [13:47:41] in any case, if all the machines running solr have like 64Gb of ram the JVM is defaulting the maximum number of ram to 16G, it just hasn't need it. [13:47:50] fyi, Mem: 64354 17564 46790 0 1593 8565 [13:47:55] as far as I know it is not straightforward with solr 3.6, but someone correct me [13:48:02] (overprovisioned) [13:48:04] Nemo_bis: no way we are going to make instant commons enabled in core by default :D [13:48:30] hashar: because of what Chris said or something else? but yes, something smarter would be needed [13:48:34] I've seen Solr use up to ~8G in other places, but I have no experience with instances using more. [13:48:40] paravoid, AFAIK these boxes were intentionally purchased in configuration matching the lucene boxes [13:49:02] I'm guessing from the name that noone expected them to be doing just geodata [13:49:08] wow, it isn't even getting close to using that much page cache. [13:49:11] <^demon> Nemo_bis: I don't see the reason to enable it by default either. It's already configurable in the installer. Since it makes external requests, an admin should have to explicitly turn it on. [13:49:14] who knows, maybe this will soon be the case [13:49:15] the difficulty of running multiple cores is one of the reasons ^demon has been proposing the cloud stuff I think [13:49:44] in any case, setting the max memory to something massive won't hurt us then. 4G would be conservative. [13:50:06] <^demon> Nikerabbit: Well, that's part of the reason, but the main reason is the ability to scale out without thinking about it & not having a SPOF in a master -> slave setup. [13:50:13] <^demon> Easier core management is a nice ++ [13:50:15] Nemo_bis: replied there (I will most probably not follow up though) [13:50:17] multiple cores isn't too bad but you have to think about it and write your queries using it. solr cloud stuff all has to by definition though. [13:50:23] hashar: aw, thanks :) [13:50:32] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67827 [13:50:34] ^demon, not nice but a friggin huge ++++++++++++++++++++++++++ [13:50:36] ^demon: ok, I'll check better how the installer looks like [13:50:47] solr cloud also lets us shard so we don't need 64Gb of ram everywhere [13:52:00] proposal: give solr 8 gigs of ram on the machines we have. That is twice what we've seen it use and it won't hurt with speed. [13:52:10] New patchset: Faidon; "Swift: get rid of test setup configs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67831 [13:52:16] sorry, 4 gigs of ram is twice what we've seen it use. [13:52:21] manybubbles, and like 1 gig on vanadium [13:52:44] ^demon, how far is solr 4? [13:52:46] MaxSem: is vanadium still a 64 gig ram box? [13:52:57] vanadium has 8GB and runs a ton of other things [13:52:58] MaxSem: not far, really. [13:53:05] no, 8 G and shared with EvenLogging collector [13:53:18] New review: Mark Bergsma; "+2+2+2+2+2+2+2..." [operations/puppet] (production); V: 2 - https://gerrit.wikimedia.org/r/67831 [13:53:22] paravoid: k. Then it should have 1G if it can afford it. [13:53:34] manybubbles: btw, ganglia.wikimedia.org can help you answer questions like that even without having access [13:53:45] heh - I forgot [13:53:58] wasn't sure if you knew already :) [13:54:26] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67831 [13:55:59] making puppet changes now [13:59:36] <^demon> MaxSem: Like manybubbles said, not that far off realistically. In theory, we could possibly start moving you guys over to it now(ish) if we redid the existing Solr boxes with Solr4. [13:59:59] <^demon> Something has to be the guinea pig :) [14:00:12] mmm, should I start working on Solr 4 schema? [14:00:39] ^demon: so long as they understand multicore then solr cloud would work. [14:00:51] <^demon> yup yup [14:01:38] ok. is there a sane package already? [14:02:09] <^demon> Nope, ubuntu only has 3.x :( [14:02:41] then I'll wait [14:04:16] ttmserver works fine with solr4 [14:04:44] <^demon> Yeah, this isn't like a "let's do it next week" thing...but definitely something to keep in mind. [14:05:57] <^demon> I hate freenode. [14:06:28] New patchset: Ottomata; "Installing OpenJDK Java 7 instead of Sun/Oracle Java 6 on newly reinstalled analytics nodes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67832 [14:07:17] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67832 [14:07:19] ottomata: \o/ [14:07:41] pretty [14:07:46] :) [14:09:24] User<|title == otto|> { groups +> [ "stats" ] } [14:09:33] lovely syntax [14:09:40] hah, yeah [14:09:54] was the best way I could think to do that (that was a while ago, this commit only changed tabs to spaces there) [14:10:47] New patchset: Ottomata; "Prepping analytics1020 for reinstall." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67834 [14:10:56] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67834 [14:13:17] New patchset: Faidon; "Swift rewrite.py: get rid of shard_containers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67835 [14:13:42] siebrand: sorry this is taking so long, because this has to be configurable I having to spin up a new local puppetmaster machine so I can test it [14:14:06] manybubbles: No problem. In all honesty, this is what I consider quick. [14:14:31] hashar: pep8 failed? [14:14:46] paravoid: in ops/puppet ? [14:15:10] paravoid: andrew boggott has worked on linting all the .py files so we are now enforcing pep8 *grin* [14:15:15] I have enabled it a couple hours ago [14:15:38] paravoid: dashboard : https://integration.wikimedia.org/ci/job/operations-puppet-pep8/3640/violations/? [14:15:45] New patchset: Faidon; "Swift rewrite.py: get rid of shard_containers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67835 [14:15:48] paravoid: all the issues are reported on https://integration.wikimedia.org/ci/job/operations-puppet-pep8/3640/violations/file/files/swift/SwiftMedia/wmf/rewrite.py/? [14:15:56] manybubbles: there used to be a labs instance for ttmserver but it must be outdated by now if it still exists [14:16:24] not bad [14:16:26] surprisingly the pep8 dashboards works fine [14:16:41] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67835 [14:20:34] <^demon> Nikerabbit: Is there any chance you could take a look at https://gerrit.wikimedia.org/r/#/c/67531/? I tested this locally and it seems to work. [14:22:17] <^demon> paravoid: https://gerrit.wikimedia.org/r/#/c/67642/ is a one-line fix for gitblit. [14:23:02] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67642 [14:23:33] The page you requested was not found, or you do not have permission to view this page. [14:23:43] ^demon: are you doing antimony or should I? [14:23:50] <^demon> I'm already logged in, can do it. [14:23:59] thanks [14:24:06] <^demon> Nikerabbit: Did the ? maybe get appended by your client? [14:24:09] yes [14:24:32] ^demon: looks good, but I don't have an easy way to test, should I just +2? [14:26:08] <^demon> Nikerabbit: I installed LU locally and ran its maintenance script. [14:26:16] <^demon> Seemed to work ok. [14:26:29] off for shoppin [14:26:30] g [14:45:43] paravoid, i think I can't do openjdk right now [14:45:46] http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Release-Notes/cdh4rn_topic_2_2.html [14:45:58] • MRv2 (YARN) is not supported on JDK 7 at present, because of https://issues.apache.org/jira/browse/MAPREDUCE-2264. This problem is expected to be fixed in an upcoming release. [14:46:03] also from Snaps: [14:46:49] i had thought I'd seen that openjdk 7 is supported, but its not really [14:46:49] 10:46 [14:46:50] ottomata [14:46:51] people try it, but then report problems, and it is not recommended [14:46:53] 10:46 [14:46:54] ottomata [14:46:55] oop [14:47:59] also, we have had the exact same discussion about openJDK vs Oracle back in August and we agreed we would migrate as soon as CDH would support it [14:49:30] New patchset: Ottomata; "Reverting change to install OpenJDK 7 on analytics nodes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67837 [14:50:00] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67837 [14:57:03] New review: Ottomata; "> CLASSPATH doesn't really belong in default, looks like init script material." [operations/debs/kafka] (master) - https://gerrit.wikimedia.org/r/67442 [14:58:20] New patchset: Ottomata; "First version debian kafka package" [operations/debs/kafka] (master) - https://gerrit.wikimedia.org/r/67442 [15:05:12] anyone know why nothing ever PXE boots for me? :/ [15:05:17] is it me? [15:05:27] maybe brewster doesn't like the way I smell? [15:06:02] notpeter you up for helping? :) [15:06:17] ('up' here means awake and willing) [15:10:00] sigh [15:10:12] how about openjdk 6? [15:10:42] sun java 6 is EOLed and with known security bugs [15:15:53] paravoid: [15:15:54] Note*: OpenJDK6 has some open bugs w.r.t handling of generics (https://bugs.launchpad.net/ubuntu/+source/openjdk-6/+bug/611284, https://bugs.launchpad.net/ubuntu/+source/openjdk-6/+bug/716959), so OpenJDK cannot be used to compile hadoop mapreduce code in branch-0.23 and beyond, please use other JDKs. [15:25:42] New patchset: Manybubbles; "Increase ramBufferSizeMB from 32 to 100 and make maximum heap size configurable." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67252 [15:29:21] ottomata: grumble grumbele [15:29:36] New review: MaxSem; "role::solr::ttm shouldn't have more than 1G, looks good otherwise." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/67252 [15:32:21] New patchset: Manybubbles; "Increase ramBufferSizeMB from 32 to 100 and make maximum heap size configurable." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67252 [15:33:36] maxsem: ^ all better? [15:34:15] +1 [15:35:05] MaxSem: now that I'm one of the authors of the patch do I remove myself from the review list of +1 it? [15:38:12] don't remove, no point in voting yourself:) [15:56:34] New review: Akosiaris; "Uploading new patchset solving most of these." [operations/debs/kafka] (master) - https://gerrit.wikimedia.org/r/67442 [16:03:32] ^demon: hey, over in openstack i think we need to set up a dedicated gitweb/git server to take some load off of gerrit [16:03:48] ^demon: hashar pointed me at https://git.wikimedia.org/ [16:04:13] we have no gitweb! ;) [16:04:30] <^demon> jeblair: Yeah, there's no real secret magic to that. Using standard gerrit replication to keep repos in sync on the second box, then just running gitblit with apache acting as reverse proxy. [16:04:31] Reedy: you have a gitblit though :) [16:04:52] <^demon> Then configured gerrit.config to point to gitblit. [16:05:01] ^demon: i'm not familiar with gitblit -- how's it compare with gitweb or cgit? [16:05:13] ^demon: (it looks nicer, i can see that right away. :) [16:05:33] <^demon> I think it's got some nice features like statistics. The lucene search is kind of disappointing though. [16:05:44] <^demon> Definitely *looks nicer* than gitweb or cgit. [16:05:54] <^demon> And is generally faster than gitweb [16:05:55] and faster as well isn't it ? [16:05:57] New review: Akosiaris; "Comments inline. One extra point is the overloading of JMX_PORT variable. I would rather there was a..." [operations/debs/kafka] (master) - https://gerrit.wikimedia.org/r/67442 [16:06:16] jeblair: one use case was to download a tar ball of HEAD, which AFAIK was slow in gitweb [16:06:43] <^demon> That's part of it. I didn't want people's requesting of $randomTar to affect gerrit's performance. [16:06:50] <^demon> Hence a 2nd box. [16:07:39] ^demon: do you know whether the Gerrit configuration stored in the git repo is replicated as well ? And is it available via git blit? [16:07:54] oh wow, impressive. people used to do that with github a lot and it failed all the time, so we started auto-building tarballs of each branch and publishing them to known locations [16:08:13] <^demon> hashar: Well, all of refs/* is replicated, so gerrit config like refs/meta/config is there. [16:08:18] (eg http://tarballs.openstack.org/nova/nova-master.tar.gz ) [16:08:20] <^demon> Gitblit doesn't know or care about it much though. [16:08:51] I have been too lazy to generate tarballs :-D [16:09:48] ^demon: and the (gitblit) links in gerrit -- is that just a gerrit config option to specify the url and link name? [16:10:08] <^demon> Indeed, lemme find it in our puppet repo. [16:10:52] <^demon> jeblair: https://git.wikimedia.org/blob/operations%2Fpuppet.git/0f963e11c9342d1486d0de0eec82b92394ab364a/templates%2Fgerrit%2Fgerrit.config.erb#L53 [16:10:57] manybubbles: hmm now only need to find someone who can give +2 [16:11:23] ^demon: awesome thanks! [16:11:57] Nikerabbit: yup. Is Faidon the best for that? [16:12:23] Nikerabbit: should I +1 it or just leave it? What is normal for someone who submitted a patch? [16:12:26] I'm not sure about best [16:12:29] but I can give it a try [16:12:49] manybubbles: peter y. has helped me before, but doesn't seem around [16:13:12] manybubbles: for my own patches or patches I've updated I almost never give +1 [16:13:20] paravoid: please do. I've been running it against solr3-puppet-test.pmtpa.wmflabs to check saneness if that helps. [16:13:35] Nikerabbit: cool. I'll just leave it then. [16:15:09] ugh, the solr manifests are so horrible [16:15:55] manybubbles: tell you what [16:15:56] thanks, I think they were the first puppet code I wrote [16:16:27] I'll merge it for now [16:16:52] but when you start working on search 2.0, we'll rewrite role class + modules [16:17:43] Makes sense to me. I think we'll want to do for technical reasons any way. [16:18:28] depends on autocommit, so I'll merge that too [16:18:38] paravoid, want us programmers to write less horrible manifests? explain what's wrong so we can learn from our mistakes;) [16:19:07] I do when people put me as a reviewer [16:19:25] I can't really go back to all of our merged manifests and comment on them now, can I :) [16:19:35] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67249 [16:20:01] also, while I think it's commendable for developers to prepare puppet manifests and very much welcome [16:20:27] for important changes, such as building a new big piece of infrastructure I think ops people should be heavily involved [16:21:01] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67252 [16:22:24] I didn't get much help for writing those manifests ~ year ago when I was doing this... [16:22:33] yeah, but we don't have enough ops to pay equal and timely attention to all projects, so sometimes we devs have to get shit done sooner rather than later:) [16:22:43] lobby for more ops hires then. [16:23:10] I do constantly :-D [16:23:11] alternatively, you could put "puppet skills" in your job ads, but that's kind of crazy isn't it? [16:23:33] again, there's nothing wrong with leveling up devs on puppet [16:23:46] hashar for example has great puppet skills now [16:23:47] New patchset: Demon; "Utility manifest for building hiphop" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67120 [16:23:50] thx [16:23:51] :°D [16:24:08] Nikerabbit: MaxSem for puppet, feel free to ping me over IRC. [16:24:32] Nikerabbit: MaxSem I can surely give a first level of review, granted that is not an entirely new module for some entirely new software :-D [16:24:41] soo... if you tell me what's wrong I'll definitely refactor it;) [16:24:44] hashar: aren't you already overworked? :o [16:24:45] thanks hashar [16:25:08] MaxSem: no worries, I think manybubbles & ^demon have grand plans about creating a new unified solr architecture [16:25:32] (and as I was saying the other day I think this should be a cross functional team with someone from ops too) [16:25:43] (and I'll lobby for that :-) [16:25:59] I have this pet project of puppetizing translatewiki.net, but meh this conversation makes it feel even less interesting to me [16:26:00] Peter? [16:26:05] maybe? [16:26:13] who knows [16:26:23] New patchset: Demon; "Block another misbehaving spider" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67844 [16:26:30] Nikerabbit: I am overworked. So if i don't have anytime I will say so :-D But if your puppet change is "simple" enough, it is not going to take more than a minute to review it. [16:26:38] Nikerabbit: so just bring more work hehe [16:26:40] I wouldn't turn down another pair of eyes and a healthy brain [16:27:25] meanwhile, I am off for cooking / diner / daughter etc. Be back around 9pm (GMT+2) [16:27:27] <^demon> paravoid: Can you look at https://gerrit.wikimedia.org/r/#/c/67844/ and its parent? I've got a spider that's not listening to robots.txt [16:27:41] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67533 [16:27:47] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67844 [16:27:53] I was already on it :) [16:27:57] ^demon: got a chinese spider browsing Jenkins, I have basically ip route blackholed it :) [16:28:05] uyeah Sogou :) [16:28:18] damn I need to write the same patch for jenkins [16:29:00] <^demon> !log restarting apache on gerrit box [16:29:02] New review: Faidon; "What do you need this for? This feels something like that belongs in a Debian package's Build-Depend..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67120 [16:29:08] Logged the message, Master [16:29:09] |log having dinner [16:29:43] lol [16:29:49] bon appetit hashar [16:30:08] New review: Demon; "Because I'm lazy and wanted to get a manifest for building this in labs. Ideally we'd package it, ye..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67120 [16:30:25] Change abandoned: Demon; "(no reason)" [operations/debs/lucene-search-2] (master) - https://gerrit.wikimedia.org/r/60860 [16:31:16] manybubbles: jetty on vanadium running with -Xmx1G -Xms1G [16:31:21] New patchset: Akosiaris; "First version debian kafka package" [operations/debs/kafka] (master) - https://gerrit.wikimedia.org/r/67442 [16:31:44] paravoid: sounds good to me [16:32:31] <^demon> paravoid: To package for 12.04, we'd have to backport 2 other pages (easy), plus forward port 2 others + their patches (pain enough as-is) [16:32:55] <^demon> As of 13.04, everything's in apt except 1 package which still has hacks. [16:33:06] <^demon> s/pages/packages/ [16:34:17] ^demon: context? [16:34:25] <^demon> HHVM [16:34:27] <^demon> https://github.com/facebook/hiphop-php/wiki/Building-and-installing-HHVM-on-Ubuntu-12.04 [16:34:27] ah [16:34:39] they ship some modified libraries iirc [16:34:50] that they embed in the source [16:34:57] <^demon> No, they don't ship with...you have to compile yourself + their patch. [16:35:01] right, libevent [16:35:12] yeah, we wouldn't ship the modified libevent for everything to use [16:35:15] <^demon> libevent and libcurl (libcurl is fixed upstream as of 13.04, which is nice) [16:35:28] we could set up a new section in our apt just for hhvm I guess [16:35:33] but libevent is too commonly used to risk this [16:35:59] <^demon> libcurl, glog and jmalloc3 are all in 13.04, so backporting should be pretty easy. [16:36:13] <^demon> It's just that damn libevent. [16:36:18] they embed other stuff too [16:36:21] but that's okay for now I guess [16:37:03] liblz4, libsqlite3, ... [16:42:12] yo notpeter, you around? [16:46:31] New review: Catrope; "APC has issues when it's not allocated "enough" memory. Because the code we're serving from gallium ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67551 [16:51:16] mark: ping [16:51:23] hm? [16:51:34] hi! can you review https://gerrit.wikimedia.org/r/#/c/67497/? [16:51:51] apparently I can't [16:51:54] [16:51:54] Not Found [16:51:54] [16:51:54] The page you requested was not found, or you do not have permission to view this page. [16:52:06] ah [16:52:08] without ? ;) [16:52:08] your client probably included the ? [16:52:12] https://gerrit.wikimedia.org/r/#/c/67497/ [16:52:36] silly irc client for including characters valid in urls [16:52:54] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67497 [16:52:55] it should really be more intelligent and do some look-ahead reasoning [16:56:26] gwicke: so RoanKattouw mentioned that mediawiki now does PURGE directly? [16:56:31] for parsoid I mean? [16:58:43] heya paravoid, quick q for you [16:58:55] i'm going to be working on hive and oozie puppetization next, some of it needs mysql databases created [16:59:02] do you thikn something like this is reasonable to do? [16:59:02] https://gist.github.com/ottomata/5750394 [16:59:05] oh god, more manifests [16:59:16] (this is old unused puppetization, don't worry about reviewing the content) [16:59:21] (just the idea) [16:59:37] just want to check in with you that execs to create mysql dbs and users is sane [16:59:54] I don't think we generally do that [17:00:13] i'd like to make the installation of these things as automated as possible [17:00:30] possible or sensible? :) [17:00:36] whichever! [17:00:56] i'm not tied to this, i'd like to be able to just apply a manifest to a new node and have everything work [17:01:00] but i know that's not always possibel [17:01:12] or [17:01:15] sensible. :) [17:01:19] how often would you need to do this? :) [17:01:47] in labs, maybe often as I dev, or in vagrant environments for analysts to dev in [17:01:59] in production, just once [17:02:09] (unless we reinstall it, of course) [17:02:34] there's also the risk that it causes problems of course [17:02:36] factor that in also [17:02:48] eh? [17:05:43] well, probably not with pure creates that fail if something already exists [17:05:56] but don't go anywhere near removes, upgrades, and things like that ;) [17:06:10] paravoid: the Parsoid extension does purge old revisions after edits, but that does not work in the current setup [17:06:24] it is entirely optional though- not purging at all would be correct too [17:06:51] the only reason for purging is to free resources quickly before LRU would get to it [17:07:55] different URLs? [17:07:57] yeah, mark, this is meant to only work on brand new install [17:08:28] mark: Yes, although we're going to start dropping the mtime from the URLs, once we do that we will need gwicke's purging [17:08:40] We still have the oldid in the URL though [17:09:08] no, we keep the current revision up to date with implicit refreshes [17:09:14] so no purges there [17:09:21] Right, yeah, that [17:09:21] aha [17:09:27] and what's the thing that you deployed last week? [17:09:29] GET with Cache-Control: no-cache [17:09:29] purging is only used for old revisions [17:09:31] then you also don't need to purge [17:09:37] because varnish does that automatically when it gets a newer one [17:09:37] I'm not sure I understood it [17:09:54] are you doing visual editor GETs on mediawiki save? [17:09:57] Sorry, I'm guilty of conflating purging with regeneration [17:10:04] The more common operation is regeneration [17:10:04] s/visual editor/parsoid/ [17:10:11] paravoid: no, the Parsoid extension does GETs to the varnishes [17:10:23] When an edit comes in, we regenerate the Parsoid URL for that page [17:10:34] by sending a GET to the Varnishes with CC:n-c [17:10:38] well, we generate it since the URL will have a new oldid [17:10:43] that takes a toll on saves though, no? [17:10:49] on template updates however we regenerate [17:10:56] since the URL will stay the same [17:11:00] the user will wait until parsoid parses the page, no? [17:11:23] paravoid: no, it is async in a bg job [17:11:33] using the job queue [17:11:35] ah [17:11:51] and if someone requests it before the job runs? [17:11:59] oldid takes care of it? [17:12:04] then it will be kicked off at that point [17:12:16] the job then gets the cached version [17:12:26] and concurrent requests are coalesced [17:12:29] what is the purpose of that job then? [17:12:39] trying to keep everything in the cache? [17:12:42] to make sure our cache is up to date [17:12:43] yup [17:12:57] normally requesting a page for editing should just be a cache hit [17:13:04] that kind of assumes that articles that are edited will soon be edited again [17:13:10] !log catrope synchronized php-1.22wmf5/extensions/VisualEditor/ApiVisualEditor.php 'Live hack for debugging' [17:13:14] we might do some spidering [17:13:15] it does defeat LRU a bit [17:13:16] paravoid: How does it assume that? [17:13:20] Logged the message, Master [17:13:25] so you're assuming everything fits in the cache [17:13:28] and our cache is large enough to hold pretty much all pages [17:13:37] right [17:13:44] if everything fits in cache, that works nicely [17:13:54] eventually we don't want to use a cache, but proper storage [17:14:07] but for now it simplified and sped up the implementation [17:14:50] !log catrope synchronized php-1.22wmf5/extensions/VisualEditor/ApiVisualEditor.php 'Live hack for debugging' [17:14:58] what is proper storage? [17:14:59] Logged the message, Master [17:15:07] database? [17:15:28] ceph ;-p [17:15:33] paravoid: https://bugzilla.wikimedia.org/show_bug.cgi?id=49143 [17:15:42] that's the idea so far [17:15:58] so compound content type in db [17:16:52] we did not have the time to tackle that before July, but plan to attack that after the release [17:17:26] aha [17:17:45] erm, so in that case all those varnish boxes would not be used? [17:19:17] paravoid: all those two varnish boxes, yes [17:19:34] two? [17:19:46] yes, for failover [17:19:49] okay, I thought we were ordering more [17:19:53] probably misremembering [17:19:59] the request rate could be handled by a cell phone these days [17:20:09] heh [17:20:21] up to 100/s [17:20:23] jimmy's cell phone, specifically [17:20:33] not by parsoid, though ;p [17:20:58] with the current numbers it looks like we could actually handle 100, as calculated [17:21:27] !log catrope synchronized php-1.22wmf5/extensions/VisualEditor/ApiVisualEditor.php 'Undo live hack' [17:21:30] we were mainly concerned about the load that would create on the API [17:21:35] Logged the message, Master [17:21:39] hence all the expansion reuse optimizations [17:21:56] paravoid: We're ordering two new ones because we're currently using two misc boxes [17:22:09] So we're ordering real cache boxes and giving the misc boxes back [17:22:36] okay [17:22:50] thanks to both of you :) [17:22:56] Damn those fake cache boxes [17:35:26] mark, there's a bunch of stuff in rt's exim config that doesn't really fit in exim4.conf.SMTP_IMAP_MM.erb . Any thoughts how I should handle that? Does exim support multiple config files, or should I implement some kind of #include logic, or…? [17:37:32] !log restarting search indexers to let them know about new wikis, running import-db scripts for elwikivoyage, vecwiktionary, testwikidatawiki [17:37:44] Logged the message, Master [17:39:13] New review: Ottomata; "> I would rather there was a variable in /etc/default/kafka for every use and then set in the functi..." [operations/debs/kafka] (master) - https://gerrit.wikimedia.org/r/67442 [17:40:56] hiyaa mutante! [17:40:57] you around? [17:41:04] could you help me PXE boot an analytics Dell? [17:41:24] its an R720 [17:41:36] ottomata: hey, yea, give me a min , creating search indices [17:41:37] it shoudn't be too hard, it just won't PXE boot for me [17:41:38] sure [17:41:41] ok [17:41:45] wha'ts the issue ottomata? [17:42:38] not sure beyond I tell it to PXE boot [17:42:38] New patchset: Faidon; "Debianize Kafka" [operations/debs/kafka] (master) - https://gerrit.wikimedia.org/r/67442 [17:42:39] and then it waits a bit, and then just boots the installer [17:42:45] i can see it talk to DHCP on brewster [17:42:46] but that's it [17:42:49] sirrt [17:42:51] sorry* [17:42:54] not 'boots the installer' [17:42:54] boots the installer? [17:42:56] ah [17:43:00] that's what it should do :) [17:43:03] meant to type "boots the OS" [17:44:26] ottomata: which server? [17:45:21] analytics1020 [17:45:30] it can be rebooted at will [17:45:40] New patchset: Cmjohnson; "updating dhcpd and netboot cfg for rdb1003-4" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67849 [17:45:47] !log killed and restared lucene on search indexers after imports, starting incremental updater [17:45:50] New review: Faidon; "PS diff: d/copyright fixes + whitespace fixes all over." [operations/debs/kafka] (master) C: -1; - https://gerrit.wikimedia.org/r/67442 [17:45:57] Logged the message, Master [17:46:39] ottomata: you turned off tftp on analytics [17:46:56] oh [17:47:33] cmjohnson1: how? [17:47:46] ? [17:47:50] I did? [17:47:50] !log rebooting analytics1020 [17:47:59] Logged the message, Master [17:48:47] Attempting PXE Boot [17:49:35] Initializing firmware interfaces... [17:49:57] Lifecycle Controller: Collecting System Inventory... [17:50:24] Scanning for devices. Please wait, this may take several minutes... [17:50:56] cmjohnson1: So did you ever find out what's up with wtp1008? [17:51:46] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67849 [17:51:59] ottomata: eh, yeah, confirmed, so it does the above, and then when its done, straight to OS [17:53:01] roankattouw: no i haven't looked yet [17:53:25] aye yah, thanks mutante. yeah no idea [17:53:26] OK [17:53:46] thanks for the reminder though�cuz i forgot to make a ticket [17:53:51] chrisj_: what did you mean.. about tftp? [17:53:58] it is possible what cmjohnson1 said is true, mark and I set up a network ACL that whitelists outgoing analytics traffic [17:54:04] No rush, just wondering [17:54:17] buuut, i thought we've been through this or something. [17:54:39] https://rt.wikimedia.org/Ticket/Display.html?id=4433 [17:55:43] just talked to Leslie, says it's quite possible it doesnt have a hole for that yet [17:55:49] creating ticket for it [17:56:36] you might want to ust reopen that ticket [17:56:46] so things stay easy to track with this [17:56:48] yea, ok [17:57:29] mutante: did you get that questions answered? [17:58:10] chrisj_: eh, yea, kind of, it's possibly the ACL in RT-4433 [17:58:21] tftp uses tcp port 69 and if that is not allowed than I don't think the server can connect to brewster to do an install [17:58:27] it's probably that [17:58:27] yep [17:58:29] that's my thought ^ [17:58:53] reopened 4433 [17:59:01] meeting time? [18:10:40] deployment time! [18:11:08] everything non 'pedia [18:13:24] yay [18:13:54] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikimedia, special, private and fishbowl to 1.22wmf6 [18:14:02] Logged the message, Master [18:15:54] Reedy: do the submodules still need update? [18:15:55] 2 Fatal error: Call to undefined method JsonSchemaContent::getHtml() in /usr/local/apache/common-local/php-1.22wmf6/includes/content/TextContent.php on line 216 [18:16:03] huh? [18:16:24] Might just be transitional [18:16:31] * aude knows nothing about JsonSchemaContent [18:18:08] Think I'm going to just ignore them [18:18:38] looks like the branch is up-to-date [18:18:42] aude: Submodules should be all up to date... [18:18:48] even localisation cache is good :) [18:18:52] :D [18:19:11] testwikidatawiki [18:19:13] * Reedy coughs [18:20:12] :) [18:20:41] much better this way [18:21:10] yeaah [18:22:00] Be even better if we can come up with a solution for https://bugzilla.wikimedia.org/show_bug.cgi?id=49392 too [18:23:34] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikivoyage and wiktionary to 1.22wmf6 [18:23:43] Logged the message, Master [18:25:33] !log mw1171 is running with readonly file system [18:25:36] reedy@mw1171:~$ df --si [18:25:36] Bus error [18:25:41] Logged the message, Master [18:25:44] Can someone at least depool it please? [18:26:33] mutante powered it off last week [18:26:52] lol, it's currently on [18:26:57] I figured that [18:26:58] depooling... [18:27:07] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikiquote and wikisource to 1.22wmf6 [18:27:12] (we're in the ops meeting) [18:27:15] Logged the message, Master [18:27:33] Reedy: would it be too much trouble to run localisation update again for wikibase? :/ [18:27:45] or one sec.... [18:28:06] !log depooling mw1171, broken hardware #5231 [18:28:15] Logged the message, Master [18:29:28] how did it return to the world of living after being powered down? [18:29:41] who knows [18:29:42] That JsonSchemaContent::getHtml() error is indeed a bug [18:29:45] phpstorm knows it [18:30:37] localisation update won't help us so forget it [18:32:08] that's EventLogging [18:33:06] The JsonSchemaContent? [18:33:09] * Reedy prods ori-l [18:34:07] yup [18:34:15] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikinews and wikibooks to 1.22wmf6 [18:34:24] Logged the message, Master [18:34:32] New patchset: Reedy; "Everything non 'pedia to 1.22wmf6" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/67852 [18:34:46] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/67852 [18:44:52] maxsem: i had to turn mw1171 back on to get some log reports for dell tech support [18:45:07] heh [18:46:05] !log deleting files from /var/spool/snmptt to fix neon again [18:46:14] Logged the message, Master [18:49:07] New patchset: Asher; "sudo rule for mwdeploy to restart twemproxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67854 [18:52:08] New patchset: Aaron Schulz; "Added missing comment" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/67855 [18:52:26] New patchset: Asher; "sudo rule for mwdeploy to (re)start twemproxy" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67854 [18:52:43] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/67855 [18:53:26] RECOVERY - SSH on mw1171 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [18:53:26] RECOVERY - DPKG on mw1171 is OK: All packages OK [18:53:26] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.076 second response time [18:55:32] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67854 [18:57:04] !log Restarted Varnish on cerium and titanium [18:57:12] Logged the message, Mr. Obvious [18:58:33] New review: Hashar; "I suspect the APC cache corruption is caused by the files being hard linked and some weird issues in..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/67551 [19:07:01] akosiaris: you think I should just remove JMX_PORT defaults from kafka.sh altogether [19:07:01] ? [19:07:10] and just let people set that themselves if they want it? [19:07:27] !log powering down ms-fe1004 to relocate to different row [19:07:36] Logged the message, Master [19:13:22] New review: coren; "LGM" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/67826 [19:13:22] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67826 [19:14:07] New patchset: GWicke; "Forward the Cache-Control header from frontends to backend caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67861 [19:15:08] RoanKattouw, binasher ^^ [19:16:47] mutante, hmmmMMmMm [19:16:59] mark added a rule to allow tftp [19:17:02] New patchset: GWicke; "Forward the Cache-Control header from frontends to backend caches" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67861 [19:17:20] and the reboot this time took much longer, I thought it must have been loading the ubuntu installer image [19:17:25] but, still, eventually it OS booted [19:21:33] preilly, ping [19:23:37] RECOVERY - Host mw1171 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [19:23:50] !log deleting and recreating /var/spool/snmptt on neon [19:24:00] Logged the message, Master [19:25:48] PROBLEM - Solr on solr1001 is CRITICAL: Average request time is 406.19287 (gt 400) [19:25:50] PROBLEM - RAID on analytics1020 is CRITICAL: Timeout while attempting connection [19:25:58] RECOVERY - Host labstore4 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [19:26:00] !log powercycling analytics1020 [19:26:08] Logged the message, Master [19:26:31] ohp, i just did that myself mutante [19:26:42] i was about to poke around in bios [19:27:02] Attempting PXE Boot [19:27:06] you want console? [19:27:11] got it already [19:27:18] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [19:27:18] ok cool, we can both view it then i guess [19:27:43] cool, theres the installer ":) [19:27:51] ottomata: there you go,, works [19:27:53] hey! [19:27:57] why didn'ti t do that for me [19:27:59] !? [19:27:59] not [19:28:06] !? [19:28:06] not [19:28:08] PROBLEM - Host mw1020 is DOWN: PING CRITICAL - Packet loss = 100% [19:28:12] heh. [19:28:24] ^ petan [19:28:24] ottomata: i did this by hitting F12 [19:28:32] yeah i did that too [19:28:34] ottomata: maybe you tried by using racadm commands? [19:28:38] i did both [19:28:38] hmm, then i dunno :p [19:28:38] PROBLEM - Disk space on labstore4 is CRITICAL: Connection refused by host [19:28:48] PROBLEM - RAID on labstore4 is CRITICAL: Connection refused by host [19:28:48] PROBLEM - DPKG on labstore4 is CRITICAL: Connection refused by host [19:28:53] hm, ok, so uhhhh, now that i'm looking at the installer, what [19:28:59] do I go into BIOS and tell it to netboot? [19:29:04] is this installer or just boot menu? [19:29:08] PROBLEM - Host ms-fe1004 is DOWN: PING CRITICAL - Packet loss = 100% [19:29:12] this is the installer [19:29:17] ideally you just wait [19:29:20] until it's done [19:29:22] Change restored: Hashar; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67830 [19:29:25] i have a menu [19:29:30] i don't see it [19:29:30] Dell Inc BOOT MANAGER [19:29:46] hmm, seems at this point we cant both see same thing anymore [19:29:47] sigh [19:29:56] ok escaping console [19:30:04] New patchset: Hashar; "jenkins validation of pep8" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67830 [19:30:05] getting back in [19:30:12] oh i guess I can't [19:30:14] i see what you're saying [19:30:14] ah [19:30:32] welp, I think the partman recipe for this should work [19:30:35] wait a bit longer [19:30:37] so hopefully the installer will just work, eh? [19:30:40] yea [19:30:52] ok, you've got the console up? i guess let me know when it looks ok [19:31:08] yes.i'll just let it run and tell you [19:31:22] be back in a little while, then we'll see [19:31:33] Change abandoned: Hashar; "reject invalid pep8 as expected!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67830 [19:31:39] ah, there it continues, it works ottomata [19:31:51] it's fetching packages now [19:32:08] great [19:32:12] thanks [19:32:35] eh, i mean additional installer components, creates ext4 fs [19:32:45] and now base system, so that worked too [19:33:12] !log reinstalling analytics1020 [19:33:18] RECOVERY - Host mw1020 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [19:33:20] Logged the message, Master [19:33:36] ext4? i thought the partman was using ext3 [19:33:42] for / partition i don't really care [19:33:47] !log olivneh synchronized php-1.22wmf6/includes/content 'Reverting change Ibfb2cbefe49398' [19:33:49] New review: Hashar; "that needs someone to poke ops about it :-) I got enough changes to babysit as is so I am not inves..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54692 [19:33:54] Logged the message, Master [19:35:49] !log Restarted Apache on gallium [19:35:57] Logged the message, Mr. Obvious [19:36:48] paravoid: I responded to your comments on the APC change, any chance it could be deployed soonish? [19:37:00] * RoanKattouw is getting a bit tired of having to restart Apache (----^) several times per day [19:37:06] PROBLEM - Disk space on mw1020 is CRITICAL: Connection refused by host [19:37:06] PROBLEM - twemproxy process on mw1020 is CRITICAL: Connection refused by host [19:37:14] PROBLEM - SSH on mw1020 is CRITICAL: Connection refused [19:37:24] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [19:37:24] PROBLEM - RAID on mw1020 is CRITICAL: Connection refused by host [19:37:24] PROBLEM - Puppet freshness on lardner is CRITICAL: No successful Puppet run in the last 10 hours [19:37:24] PROBLEM - Apache HTTP on mw1020 is CRITICAL: Connection refused [19:37:24] PROBLEM - Puppet freshness on tola is CRITICAL: No successful Puppet run in the last 10 hours [19:37:25] PROBLEM - DPKG on mw1020 is CRITICAL: Connection refused by host [19:37:25] PROBLEM - mysqld processes on db44 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [19:37:26] PROBLEM - RAID on db44 is CRITICAL: CRITICAL: Degraded [19:38:26] PROBLEM - swift-container-replicator on ms-be1 is CRITICAL: Connection refused by host [19:39:02] !log dns update [19:39:10] Logged the message, Master [19:39:26] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: No successful Puppet run in the last 10 hours [19:39:26] PROBLEM - Puppet freshness on amssq34 is CRITICAL: No successful Puppet run in the last 10 hours [19:39:26] PROBLEM - Puppet freshness on amssq40 is CRITICAL: No successful Puppet run in the last 10 hours [19:39:26] PROBLEM - Puppet freshness on aluminium is CRITICAL: No successful Puppet run in the last 10 hours [19:39:26] PROBLEM - Puppet freshness on amssq36 is CRITICAL: No successful Puppet run in the last 10 hours [19:40:46] PROBLEM - twemproxy process on mw1171 is CRITICAL: NRPE: Unable to read output [19:44:06] PROBLEM - Host stat1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:44:36] RECOVERY - Host stat1001 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [19:44:59] PROBLEM - NTP on labstore4 is CRITICAL: NTP CRITICAL: No response from NTP server [19:47:08] notpeter: around? can you look at https://gerrit.wikimedia.org/r/#/c/67544/? [19:47:16] RECOVERY - SSH on mw1020 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [19:49:51] binasher, mark or preilly, why does MobileFrontend disable caching for requests coming from frontend proxies? this looks like something from times prehistorical [19:50:56] PROBLEM - MySQL Replication Heartbeat on db34 is CRITICAL: NRPE: Unable to read output [19:50:57] PROBLEM - MySQL Replication Heartbeat on db64 is CRITICAL: NRPE: Unable to read output [19:51:29] PROBLEM - MySQL Replication Heartbeat on db39 is CRITICAL: NRPE: Unable to read output [19:51:46] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: NRPE: Unable to read output [19:52:17] PROBLEM - NTP on mw1020 is CRITICAL: NTP CRITICAL: No response from NTP server [19:53:16] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: NRPE: Unable to read output [19:57:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:58:16] RECOVERY - MySQL Replication Heartbeat on db1003 is OK: OK replication delay seconds [19:58:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.168 second response time [19:59:16] PROBLEM - NTP on stat1001 is CRITICAL: NTP CRITICAL: Offset unknown [20:00:46] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: NRPE: Unable to read output [20:01:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:46] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay seconds [20:01:47] RoanKattouw: hey [20:01:53] Howdy [20:01:59] RoanKattouw: so my worry is -and I'll back down if you all agree that it isn't an issue- [20:02:05] that this: "If there is a difference in how code behaves with APC vs without, that's a bug in APC. Sadly, APC is buggy, which is why we want to disable it on gallium." [20:02:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.135 second response time [20:02:30] if APC is buggy, we should find out from CI and/or betalabs [20:03:04] i.e. it sounds to me a bit like "this test crashes because a bug, let's disable the test" [20:03:16] RECOVERY - NTP on stat1001 is OK: NTP OK: Offset -0.006206035614 secs [20:03:57] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: NRPE: Unable to read output [20:04:00] paravoid, so here's the problem: the patterns of opcode cache usage are very different between CI and prod [20:04:01] paravoid: When I say APC is buggy [20:04:14] What I mean is "when it runs out of memory it causes random fatal errors with misspelled class names" [20:04:25] MaxSem: Just for WMF right? [20:05:20] preilly, and for other sites that have configured $wgSquidServers. it originates from http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/MobileFrontend/MobileFrontend.php?r1=93404&r2=93405& [20:05:21] The kinds of bugs that we get from APC on gallium (and occasionally on prod as well) are not at all related to our code [20:05:55] Sometimes it just randomly corrupts its cache and complains that wfFooBwr() doesn't exist while the code is really calling wfFooBar() [20:06:29] the point is that "not related to our code" doesn't mean "we shoudn't care" [20:06:42] Look [20:06:46] APC is a broken piece of garbage [20:06:48] lol [20:06:53] And ideally it would be fixed and wonderful [20:06:56] RECOVERY - MySQL Replication Heartbeat on db1019 is OK: OK replication delay seconds [20:07:03] But it's not, and we're not gonna fix it in the short term [20:07:14] RoanKattouw: it's much better in 5.5 [20:07:21] We need it in prod despite its warts for performance reasons [20:07:28] ok [20:07:34] But for CI it's just causing problems and doesn't provide much benefit [20:07:34] I wonder what triggered this now in CI... [20:07:42] This has been going on for a while [20:07:45] but okay, sure [20:08:01] It's probably just the amount of turnover with new code hitting CI all the time [20:08:09] New patchset: Faidon; "contint: Disable php-apc on gallium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67551 [20:08:42] I just feel uncomfortable merging workarounds for problems we don't fully understand, but meh [20:08:45] RECOVERY - Puppet freshness on mw1158 is OK: puppet ran at Mon Jun 10 20:08:43 UTC 2013 [20:08:45] RECOVERY - Puppet freshness on ms-be1003 is OK: puppet ran at Mon Jun 10 20:08:43 UTC 2013 [20:08:46] RECOVERY - Puppet freshness on mw1082 is OK: puppet ran at Mon Jun 10 20:08:43 UTC 2013 [20:08:46] RECOVERY - Puppet freshness on cp1035 is OK: puppet ran at Mon Jun 10 20:08:43 UTC 2013 [20:08:46] RECOVERY - Puppet freshness on mw113 is OK: puppet ran at Mon Jun 10 20:08:43 UTC 2013 [20:08:46] RECOVERY - Puppet freshness on db56 is OK: puppet ran at Mon Jun 10 20:08:43 UTC 2013 [20:08:46] RECOVERY - Puppet freshness on mw1121 is OK: puppet ran at Mon Jun 10 20:08:43 UTC 2013 [20:08:47] RECOVERY - Puppet freshness on arsenic is OK: puppet ran at Mon Jun 10 20:08:43 UTC 2013 [20:08:47] RECOVERY - Puppet freshness on mw1 is OK: puppet ran at Mon Jun 10 20:08:43 UTC 2013 [20:08:48] RECOVERY - Puppet freshness on wtp1006 is OK: puppet ran at Mon Jun 10 20:08:44 UTC 2013 [20:08:48] RECOVERY - Puppet freshness on db1057 is OK: puppet ran at Mon Jun 10 20:08:44 UTC 2013 [20:08:49] RECOVERY - Puppet freshness on mw1135 is OK: puppet ran at Mon Jun 10 20:08:44 UTC 2013 [20:08:49] RECOVERY - Puppet freshness on wtp1008 is OK: puppet ran at Mon Jun 10 20:08:44 UTC 2013 [20:08:49] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67551 [20:08:50] RECOVERY - Puppet freshness on db1056 is OK: puppet ran at Mon Jun 10 20:08:44 UTC 2013 [20:08:50] RECOVERY - Puppet freshness on srv298 is OK: puppet ran at Mon Jun 10 20:08:44 UTC 2013 [20:08:56] RECOVERY - Puppet freshness on mc13 is OK: puppet ran at Mon Jun 10 20:08:45 UTC 2013 [20:08:56] RECOVERY - Puppet freshness on cp1001 is OK: puppet ran at Mon Jun 10 20:08:45 UTC 2013 [20:08:56] RECOVERY - Puppet freshness on mw15 is OK: puppet ran at Mon Jun 10 20:08:45 UTC 2013 [20:08:56] RECOVERY - Puppet freshness on ms1004 is OK: puppet ran at Mon Jun 10 20:08:45 UTC 2013 [20:08:56] RECOVERY - Puppet freshness on sq49 is OK: puppet ran at Mon Jun 10 20:08:45 UTC 2013 [20:09:22] ottomata: so, it's done installing, but i can't really login from sockpuppet, i guess that is also part of the ACL, can you do the puppet re-siging and stuff ? [20:10:05] RECOVERY - Puppet freshness on ms5 is OK: puppet ran at Mon Jun 10 20:09:54 UTC 2013 [20:10:05] RECOVERY - Puppet freshness on analytics1015 is OK: puppet ran at Mon Jun 10 20:09:55 UTC 2013 [20:10:05] RECOVERY - Puppet freshness on mw1124 is OK: puppet ran at Mon Jun 10 20:09:55 UTC 2013 [20:10:05] RECOVERY - Puppet freshness on mw38 is OK: puppet ran at Mon Jun 10 20:09:55 UTC 2013 [20:10:05] RECOVERY - Puppet freshness on mw119 is OK: puppet ran at Mon Jun 10 20:09:55 UTC 2013 [20:11:16] PROBLEM - Puppet freshness on kuo is CRITICAL: No successful Puppet run in the last 10 hours [20:11:25] PROBLEM - Puppet freshness on constable is CRITICAL: No successful Puppet run in the last 10 hours [20:11:25] PROBLEM - Puppet freshness on lardner is CRITICAL: No successful Puppet run in the last 10 hours [20:11:25] PROBLEM - Puppet freshness on tola is CRITICAL: No successful Puppet run in the last 10 hours [20:13:09] APC is a broken piece of garbage <--- I think 'A Piece of Crap' works better [20:13:27] hahaha [20:14:16] RECOVERY - Puppet freshness on mw1027 is OK: puppet ran at Mon Jun 10 20:14:07 UTC 2013 [20:14:16] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: NRPE: Unable to read output [20:14:27] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: NRPE: Unable to read output [20:15:15] RECOVERY - MySQL Replication Heartbeat on db1003 is OK: OK replication delay seconds [20:16:55] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: NRPE: Unable to read output [20:17:25] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay seconds [20:17:55] RECOVERY - MySQL Replication Heartbeat on db1019 is OK: OK replication delay seconds [20:24:45] RECOVERY - Puppet freshness on constable is OK: puppet ran at Mon Jun 10 20:24:40 UTC 2013 [20:25:25] PROBLEM - Puppet freshness on constable is CRITICAL: No successful Puppet run in the last 10 hours [20:26:26] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: NRPE: Unable to read output [20:26:51] yurik: hey [20:27:01] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67395 [20:27:27] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay seconds [20:28:56] RECOVERY - Puppet freshness on wtp1 is OK: puppet ran at Mon Jun 10 20:28:50 UTC 2013 [20:28:56] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Jun 10 20:28:50 UTC 2013 [20:29:35] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [20:29:45] PROBLEM - Puppet freshness on wtp1 is CRITICAL: No successful Puppet run in the last 10 hours [20:29:56] RECOVERY - Puppet freshness on lardner is OK: puppet ran at Mon Jun 10 20:29:51 UTC 2013 [20:30:05] RECOVERY - Solr on solr1001 is OK: All OK [20:30:25] PROBLEM - Puppet freshness on lardner is CRITICAL: No successful Puppet run in the last 10 hours [20:32:27] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: NRPE: Unable to read output [20:32:45] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: NRPE: Unable to read output [20:32:55] RECOVERY - Puppet freshness on tola is OK: puppet ran at Mon Jun 10 20:32:50 UTC 2013 [20:32:55] PROBLEM - MySQL Replication Heartbeat on db1019 is CRITICAL: NRPE: Unable to read output [20:33:26] PROBLEM - Puppet freshness on tola is CRITICAL: No successful Puppet run in the last 10 hours [20:34:05] PROBLEM - Solr on solr1001 is CRITICAL: Average request time is 410.8793 (gt 400) [20:34:10] New patchset: Andrew Bogott; "Removed many unneeded scope.lookupvar calls." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67881 [20:34:11] New patchset: Andrew Bogott; "Remove switches that depend on enable_mediawiki_relay." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67882 [20:34:25] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay seconds [20:34:29] manybubbles: hey [20:34:42] paravoid: yo [20:34:55] RECOVERY - MySQL Replication Heartbeat on db1019 is OK: OK replication delay seconds [20:35:15] PROBLEM - MySQL Replication Heartbeat on db1003 is CRITICAL: NRPE: Unable to read output [20:35:24] New review: Andrew Bogott; "I tested this with a test role class to verify that the final exim4.conf output file is unchanged by..." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/67882 [20:35:42] manybubbles: see the solr alert above [20:35:45] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay seconds [20:35:58] New review: Andrew Bogott; "I tested this with a test role class to verify that the final exim4.conf output file is largely unch..." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/67881 [20:36:25] paravoid: are those in seconds or millis? [20:36:39] ms I hope :) [20:36:54] I think MaxSem added that check [20:37:01] could be the restart? [20:37:20] do we happen to graph request time? I'm checking [20:37:27] I don't think so [20:37:28] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: NRPE: Unable to read output [20:37:42] manybubbles, ms [20:38:17] RECOVERY - MySQL Replication Heartbeat on db1003 is OK: OK replication delay seconds [20:38:27] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay seconds [20:38:45] MaxSem: Looks like we're bumping against: $average_request_time = "400:600" [20:38:53] yep [20:39:05] lemme look if it's lower on slaves [20:39:17] PROBLEM - Host labstore4 is DOWN: PING CRITICAL - Packet loss = 100% [20:40:07] RECOVERY - Puppet freshness on kuo is OK: puppet ran at Mon Jun 10 20:40:04 UTC 2013 [20:40:17] PROBLEM - Puppet freshness on kuo is CRITICAL: No successful Puppet run in the last 10 hours [20:42:53] It'd be really nice to get some java specific graphs so we could measure stuff like the number of full and partial GCs and other good stuff. [20:43:24] I certainly don't see a smoking gun on ganglia. any chance this limit was recently lowered? [20:43:39] don't think so, but git log is your friend [20:43:57] RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay seconds [20:43:57] RECOVERY - MySQL Replication Heartbeat on db34 is OK: OK replication delay seconds [20:44:05] manybubbles, looks like it was restarted due to your change, now it's barfing about a small number of samples [20:44:27] RECOVERY - MySQL Replication Heartbeat on db39 is OK: OK replication delay seconds [20:44:47] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay seconds [20:44:48] probably because the index was initially cold [20:45:01] when was it restarted? [20:45:14] puppet did it when it applied the Xm change. [20:45:16] when puppet applied your change [20:45:17] also, what do you mean, barfing. [20:45:47] yeah, but what time was that? a couple of minutes after we merged it, I imagine. [20:45:48] let me run my load tester on it to see if the average decreases [20:46:01] puppet runs once per 30 minutes [20:47:07] RECOVERY - Puppet freshness on amslvs2 is OK: puppet ran at Mon Jun 10 20:47:01 UTC 2013 [20:47:13] !log restarting lsearchd on all pool4 search nodes [20:47:22] Logged the message, Master [20:47:24] heh, it's already 399.28494 [20:47:47] RECOVERY - Solr on solr1001 is OK: All OK [20:49:13] huh - it'd be really nice to graph those things we're alerting on so we can see if we just pushed it over the edge or if restarts do it or what. [20:49:54] yep [20:49:57] feel free :P [20:50:08] PROBLEM - search indices - check lucene status page on search1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:51:08] not true, i get results from search1016 with curl [20:52:59] okay, I've dropped it to 270, let's see if "normal" operations will raise the number above the threshold again [20:53:15] mutante how's an20 looking? [20:53:29] 13:09 < mutante> ottomata: so, it's done installing, but i can't really login from sockpuppet, i guess that is also part of the ACL, can you do the puppet re-siging and stuff ? [20:53:51] that shouldn't be part of the ACL [20:53:51] i don't know how you'd handle the puppetmaster part [20:53:58] do you have your own? [20:54:01] no we don't [20:54:04] it should be puppetized as is [20:54:12] just the usual puppet run should work [20:54:18] well, try to connect from sockpuppet [20:54:23] the ACL only whiltelists outgoing connections [20:54:27] it's freshly installed [20:54:28] coming in from anywhere should be fine [20:54:30] k [20:54:47] RECOVERY - Puppet freshness on constable is OK: puppet ran at Mon Jun 10 20:54:39 UTC 2013 [20:54:56] should it respond to ping, mutante? [20:55:22] i don't know about the ACL regarding ICMP [20:55:27] PROBLEM - Puppet freshness on constable is CRITICAL: No successful Puppet run in the last 10 hours [20:55:59] naw [20:56:14] ACL only prevents new outgoing connections FROM analytics to others [20:56:16] root@sockpuppet:~# tcptraceroute analytics1020.eqiad.wmnet [20:56:16] incoming is allowed [20:56:27] i can't ping it from inside the cluster anyway [20:57:54] wait, why is it back at BIOS ? [20:57:58] and now booting [20:58:05] it was all done and sitting at login [20:58:12] and i didn't touch it since then [20:58:38] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [20:58:44] there you go [20:58:47] RECOVERY - Puppet freshness on wtp1 is OK: puppet ran at Mon Jun 10 20:58:40 UTC 2013 [20:58:47] PROBLEM - Puppet freshness on wtp1 is CRITICAL: No successful Puppet run in the last 10 hours [20:58:47] RECOVERY - Puppet freshness on mexia is OK: puppet ran at Mon Jun 10 20:58:45 UTC 2013 [20:59:37] PROBLEM - Puppet freshness on mexia is CRITICAL: No successful Puppet run in the last 10 hours [20:59:57] RECOVERY - Puppet freshness on lardner is OK: puppet ran at Mon Jun 10 20:59:52 UTC 2013 [21:00:27] PROBLEM - Puppet freshness on lardner is CRITICAL: No successful Puppet run in the last 10 hours [21:00:57] PROBLEM - DPKG on analytics1020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:01:01] !log puppet re-signing analytics1020 [21:01:15] Logged the message, Master [21:02:02] i don't know how you'd handle the puppetmaster part [21:02:12] wrg, wrong clipboard [21:02:17] Server hostname 'sockpuppet.pmtpa.wmnet' did not match server certificate; expected sockpuppet.pmtpa.wmnet [21:02:29] doesn't create a new request [21:02:47] RECOVERY - Puppet freshness on tola is OK: puppet ran at Mon Jun 10 21:02:41 UTC 2013 [21:03:27] PROBLEM - Puppet freshness on tola is CRITICAL: No successful Puppet run in the last 10 hours [21:05:08] paravoid, pong [21:05:16] yurik: nvm [21:05:52] any time :) [21:05:58] heh [21:05:59] sorry [21:06:01] ottomata: fixed the puppet cert, info: Applying configuration version '1370898339' [21:06:18] no worries :) thx for merging btw [21:06:51] ottomata: notice: Finished catalog run in 51.61 seconds [21:07:55] !log dpkg --configure -a on analytics1020 to fix interrrupted dpkg [21:08:05] Logged the message, Master [21:08:14] dpkg: error processing libjpeg8 (--configure): Package is in a very bad inconsistent state - you should reinstall it before attempting configuration. [21:08:21] Errors were encountered while processing: ca-certificates-java openjdk-7-jre-lib [21:08:55] ca-certificates-java : Depends: openjdk-6-jre-headless (>= 6b16-1.6.1-2) but it is not going to be installed or [21:08:58] java6-runtime-headless [21:09:01] openjdk-7-jre-lib : Depends: openjdk-7-jre-headless (>= 7~b130~pre0) but it is not going to be installed [21:09:04] the usual jdk crap [21:09:15] holy fuck [21:09:24] I've never seen the "very bad inconsistent state" message [21:09:35] ever, in the 10 or so years I've been managing Debian systems [21:09:39] heh :o [21:09:47] i dont think i have either [21:09:55] not the "very bad"-part [21:10:24] right [21:10:51] RECOVERY - DPKG on analytics1020 is OK: All packages OK [21:11:05] wha, we shoudln't be doing jdk7 hm [21:11:09] apt-get -f install to fix it, setting up icedtea-7-jre-jamvm and openjdk-7-jre-lib [21:11:23] Setting up icedtea-7-jre-jamvm (7u21-2.3.9-0ubuntu0.12.04.1) ... [21:11:23] Setting up openjdk-7-jre-lib (7u21-2.3.9-0ubuntu0.12.04.1) ... [21:11:33] speaking of that [21:11:48] WARNING: The following packages cannot be authenticated! kafka-hadoop-consumer [21:11:49] ottomata: so what is the hadoop community using? [21:11:53] Install these packages without verification [y/N]? [21:12:07] why are those being installed?! [21:12:14] i haven't set that up yuet [21:12:24] i don't know, all i'm trying is to give it to you in an "installed and dpkg not broken" state [21:12:25] i wanted a base install before I did that [21:12:27] ghrrr [21:12:28] heheh [21:12:28] thanks [21:12:30] hm [21:12:46] runs apt-get upgrade as well [21:13:00] but do you want kafka-hadoop-consumer? [21:13:06] nope [21:13:19] that package shouldn't even be installable [21:13:21] its not in our apt [21:13:26] hrm, then it says Thanks and doesnt install any other upgrade either .. hrmm [21:13:34] PROBLEM - Parsoid on wtp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:13:36] hmmm [21:14:05] ii kafka-hadoop-consumer 0.1.0 A Kafka Hadoop Consumer that uses ZooKeeper to keep track of Kafka brokers and consumption offset. [21:14:10] somehow it got on there though [21:14:26] removes it [21:14:39] all that sound weird to me? [21:14:40] um [21:14:41] reinstall? [21:14:48] root@analytics1020:/etc/apt/sources.list.d# ls [21:14:48] cdh4.list kraken.list wikimedia.list [21:14:53] lol, this is right aftere reinstall [21:14:54] this shoudl ahve been a fresh reinstall [21:15:07] why are these apt sources here after a reinstall?? [21:15:44] did the reinstall actually work :? [21:15:59] yeah [21:16:00] mutante [21:16:05] the resinstall didn't work, for sure [21:16:09] root@analytics1020:/etc/apt/sources.list.d# service hadoop-yarn-nodemanager status [21:16:09] * Hadoop nodemanager is running [21:16:11] i saw the installer run [21:16:23] i saw it create a filesystem [21:16:42] is the mgmt IP wrong or something? [21:16:45] maybe on a diff partition? [21:16:48] don't thikn so [21:16:57] i got kicked out of my ssh session on power cycle [21:17:22] why does a running service mean it didnt reinstall? [21:17:28] i ran puppet [21:18:02] welp, i guess it could be, since this is what i was about to puppetize, but none of these configs are actually applied in puppet yet [21:18:05] was going to do a base install [21:18:07] but [21:18:14] the /etc/apt/sources.list thing is crazier [21:18:16] because [21:18:21] no way those files are installed by puppet [21:18:26] what do you mean "doing a base install"? you wanted to change site.pp first? [21:18:31] PROBLEM - Parsoid on wtp1023 is CRITICAL: Connection refused [21:18:32] how about this [21:18:33] -rw-r----- 1 syslog adm 7055 Jun 4 06:25 syslog.7.gz [21:18:46] root@analytics1020:/var/log# ls -l /var/log/syslog.7.gz [21:18:46] -rw-r----- 1 syslog adm 7055 Jun 4 06:25 /var/log/syslog.7.gz [21:18:49] June 4 date on that file [21:18:56] PROBLEM - DPKG on analytics1020 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:19:10] mutante: i mean that i just wanted role::analytics applied on this node [21:19:12] no hadoop stuff yet [21:19:14] just user accounts, etc. [21:19:26] and that is what is in site.pp right now [21:19:31] looks what is applied now [21:19:41] PROBLEM - Parsoid on wtp1011 is CRITICAL: Connection refused [21:19:50] right, but I don't think installer did what its supposed to do [21:19:55] maybe it installed on a different partition or somethign? [21:19:59] and this booted from the old one? [21:20:09] there are zipped log files from 6 days ago here [21:20:12] so this can't be a fresh install [21:20:13] right? [21:22:51] RECOVERY - DPKG on analytics1020 is OK: All packages OK [21:25:08] ottomata: are you still having issues ? [21:25:55] yes but not related to the ACL thing, we thikn [21:26:02] mutante actually got the installer to run [21:26:15] buuuut, it looks like it wasn't actually reinstalled [21:26:18] the machine still has all the crap from before [21:26:53] New review: GWicke; "Tested this on a labs VM, where it seems to be doing the right thing:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67861 [21:27:41] PROBLEM - RAID on analytics1020 is CRITICAL: Timeout while attempting connection [21:28:28] 12 Non-RAID Disk(s) found on the host adapter [21:28:28] New review: GWicke; "Argh, Gerrit ate the message.. The ab commandline is:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67861 [21:29:01] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [21:29:42] 35 analytics101[1-9]|analytics102[0-2]) echo partman/raid1-30G.cfg ;; \ [21:30:31] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.011 second response time [21:30:43] per partman recipe it's supposed to use the first 2 of those 12 disks [21:30:48] and no clue about the other 10 [21:30:57] so they wouldn't be touched by reinstall [21:31:09] yea, it's booting again, not getting to installer [21:31:16] i don't know what's going on here [21:31:37] you should try again [21:31:41] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [21:34:11] growl [21:34:21] yeah, the other 10 shouldn't be touched [21:34:26] we partition those manually for hadoop stuff [21:34:32] PROBLEM - Parsoid on wtp1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:34:46] ok, so mutante, you want me to try PXE booting again? [21:34:48] i should powercycle? [21:38:11] anyone know why bast1001 is in the mediawiki-installation dsh group? [21:39:03] presumably so it could be used somewhat like fenari at some point [21:39:07] running mw shell scripts etc [21:39:29] New patchset: coren; "Tool Labs: Use the newfangled local packages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67889 [21:39:39] that was probably pre-tin though, yeah? [21:39:40] PROBLEM - Parsoid on wtp1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:39:40] PROBLEM - RAID on mc15 is CRITICAL: Timeout while attempting connection [21:39:40] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.008 second response time [21:39:46] I'd say so, yeah [21:40:34] RECOVERY - RAID on mc15 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [21:40:38] mutante^^, I should try PXE booting now? [21:40:41] New review: coren; "Simple package additions." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/67889 [21:40:42] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67889 [21:41:27] New patchset: Asher; "removing bast1001 from mediawiki-installation dsh group" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67891 [21:41:43] PROBLEM - Parsoid on wtp1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:42:10] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67891 [21:45:22] binasher: yay thanks for removing [21:47:50] PROBLEM - Parsoid on wtp1003 is CRITICAL: Connection refused [21:49:06] New review: coren; "Check inline comments, I'd rather not install ftp or telnet." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67055 [21:49:50] PROBLEM - Parsoid on wtp1010 is CRITICAL: Connection refused [21:50:32] Change abandoned: coren; "Made moot by https://gerrit.wikimedia.org/r/#/c/67055/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66273 [21:50:40] PROBLEM - Parsoid on wtp1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:50:59] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66328 [21:52:00] PROBLEM - Parsoid on wtp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:53:30] RECOVERY - Parsoid on wtp1017 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.003 second response time [21:53:30] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.003 second response time [21:53:30] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [21:53:30] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.005 second response time [21:53:30] RECOVERY - Parsoid on wtp1014 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [21:53:37] New patchset: Andrew Bogott; "pep8 cleanup for udpprofile" [operations/software] (master) - https://gerrit.wikimedia.org/r/67893 [21:53:50] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.003 second response time [21:53:50] RECOVERY - Parsoid on wtp1003 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.004 second response time [21:53:50] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [21:54:10] !log updated Parsoid to 6e40256821 [21:54:17] Logged the message, Master [22:00:00] New review: coren; "This should be put in the misctools package from labs/toollabs rather." [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/66266 [22:01:27] New review: coren; "Moar statistics!" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/64511 [22:01:28] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/64511 [22:01:46] New patchset: Andrew Bogott; "pep8 cleanup for udpprofile" [operations/software] (master) - https://gerrit.wikimedia.org/r/67893 [22:02:06] ottomata: re.. yes please, and i just got disconnected [22:04:47] binasher: I have a review request.. [22:05:00] https://gerrit.wikimedia.org/r/#/c/67861/ [22:05:03] ok [22:05:13] k [22:05:53] binasher: that patch is tested in labs and only affects Parsoid [22:07:24] gwicke: looks fine, will merge [22:07:46] binasher: cool, thanks! [22:07:50] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67861 [22:08:05] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [22:09:15] gwicke: merged on the puppetmaster [22:09:53] binasher: thanks! Will ask Roan to restart the varnishes then (auto restart seems to be broken currently) [22:09:56] !log restarting lucene on search prefix hosts [22:10:05] Logged the message, Master [22:10:52] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.40 ms [22:10:59] gwicke: On it [22:11:10] RoanKattouw: thanks ;) [22:12:19] New patchset: Andrew Bogott; "pep8 cleanup of swiftrepl" [operations/software] (master) - https://gerrit.wikimedia.org/r/67898 [22:13:57] gwicke: Ran puppet and reloaded Varnish [22:14:22] PROBLEM - search indices - check lucene status page on search1018 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern found - 60534 bytes in 0.025 second response time [22:14:30] heh, need somebody from Greece :) [22:15:21] meh, i already see it doesn't seem to work.. search on el.wikivoyage [22:16:01] like last time, i can import the db's to search indexer and restart things but doesnt work yet.. or i just have to wait longer [22:17:28] New patchset: Tim Landscheidt; "Fix Puppet path to gridengine file." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67899 [22:24:20] New patchset: Andrew Bogott; "Pep8 cleanup for swiftcleaner and fwconfigtool." [operations/software] (master) - https://gerrit.wikimedia.org/r/67903 [22:24:44] New patchset: Asher; "adding restart-twemproxy script to scap" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67904 [22:25:23] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67904 [22:26:40] New patchset: Andrew Bogott; "Pep8 cleanup for swiftcleaner and fwconfigtool." [operations/software] (master) - https://gerrit.wikimedia.org/r/67903 [22:30:49] Reedy or someone around to deploy hotfix for us? [22:30:50] https://gerrit.wikimedia.org/r/#/c/67906/ [22:31:01] please? :) [22:31:02] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [22:31:45] greg-g: ^ [22:33:00] New review: coren; "That should do the trick." [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/67899 [22:33:01] Change merged: coren; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67899 [22:33:11] growl, mutante, you still around? [22:33:52] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [22:34:40] robla: is someone around who can deploy https://gerrit.wikimedia.org/r/#/c/67906/ for us? [22:34:57] it's kind of late for reedy [22:34:58] ottomata: yea, but kind of busy [22:35:08] hmmk [22:35:19] stupid booger an20 still won't pxe boot for me [22:35:49] ottomata: remove from netboot.cfg and see if you end up in installer with manual disk setup or not [22:36:17] that we at least we know which disks it tries to use [22:36:32] RoanKattouw: AaronSchulz ? [22:36:56] * RoanKattouw flees [22:37:00] :) [22:37:04] lol [22:37:07] You say that [22:37:08] if you have a minute, we need https://gerrit.wikimedia.org/r/#/c/67906/ [22:37:10] And it's later still for you ;) [22:37:12] it's reedy! [22:37:13] (But seriously, VE deployment in 3 days) [22:37:19] \o/ [22:37:20] ok, mutante, can I try that on brewster directly or do I need to commit to puppet? [22:37:24] serious? [22:38:30] ottomata: you could hack on brewster if you also stop puppet, but seems almost as much work as just submitting.. [22:38:35] hm, ok [22:39:14] hmm, wait, but mutante, i haven't even seen the installer yet [22:39:27] the contents of netboot.cfg shouldn't make a difference for that [22:39:31] every time I reboot I just get the OS [22:40:11] i really saw the installer,, that one time you were also on the console [22:40:49] PROBLEM - RAID on analytics1020 is CRITICAL: Timeout while attempting connection [22:41:04] it seems really strange, but remember how lucid worked but precise didnt, that other time [22:41:24] i think that was pure coincidence when we kept retrying [22:41:31] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [22:41:45] yeah, but hm, notpeter and binasher were saying that was because those machines we were trying were udp2log hosts [22:41:50] so they were being beamed a huge udp stream [22:41:59] PROBLEM - Solr on vanadium is CRITICAL: Average request time is 1011.0996 (gt 1000) [22:42:01] which was interfering with network stuff [22:42:09] that's not the case this time [22:42:32] can you still see requests from it on brewser now? [22:43:14] uhm, lemme check [22:44:09] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [22:46:44] does this mean anytinbg? [22:46:45] mutante? [22:46:45] 0 Virtual Drive(s) found on the host adapter. [22:46:45] 0 Virtual Drive(s) handled by BIOS [22:46:51] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [22:46:59] RECOVERY - Solr on vanadium is OK: All OK [22:47:46] !log reedy synchronized php-1.22wmf5/extensions/Wikibase/ [22:47:52] ottomata: means you dont have RAID setup [22:47:54] Logged the message, Master [22:48:09] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [22:48:11] that's from the beginning of the PXE boot attempt [22:48:16] mw1171 is back online again? [22:48:40] and mw1020 has been reinstalled (?) [22:48:41] mw1020: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ [22:48:41] etc [22:48:43] ottomata: since you are trying to have software raid afaik, this is expected [22:49:20] !log reedy synchronized php-1.22wmf6/extensions/Wikibase/ [22:49:28] Logged the message, Master [22:49:41] even on PXE boot? i shouldn't see that unless its trying to read the disks, right? [22:49:52] and, i do not see anyting in brewster syslog [22:49:58] (that's where i should be looking, right?) [22:49:59] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [22:50:02] do you know if you have a hardware raid controller and want to use it? [22:50:07] it is software [22:50:09] and no [22:50:11] :) [22:50:15] LeslieCarr: maybe the ACL fix didn't take? [22:50:32] then you shouldnt have to worry about ''0 Virtual Drives" at boot [22:50:35] k [22:50:51] was wondering if that meant '0 virtual netboot drives something blabla' [22:51:06] so the acl was in there and should be correct [22:52:11] ottomata: i do not see the MAC address of it in brewsters's syslog, ack [22:52:14] hm. [22:52:14] yeah [22:52:16] me neither [22:52:36] but i did see the damn installer earlier [22:52:54] it was a mirrage [22:52:56] mirage* [22:52:58] this keeps being an issue way too much [22:53:01] i know right! [22:53:12] is it on other servers too? or just the ones I need to work on? :p [22:53:25] it's always something [22:53:35] just different enough to not show a pattern :p [22:53:41] but similar enough to feel familiar [22:55:39] gah [22:55:46] so if it gets through sometimes it's not the acl [22:55:50] since it would fail every time [22:56:41] computers + sometimes ..grrr [22:58:10] now i am on the console and i can "connect com2" or "console com2" and it neither shows it to me, nor displays an error [22:58:26] i'm in console still [22:58:32] try now [22:58:39] do it one more time, i'm watching brewster [22:58:44] ok [22:59:46] ottomata: what's the history of this server? did it work before? has BIOS been touched or hardware in any way? why reinstall now [22:59:51] PROBLEM - RAID on analytics1020 is CRITICAL: Timeout while attempting connection [22:59:58] New review: Faidon; "Can't you just install these manually in labs until we create packages? Alternatively, I can offer t..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67120 [23:00:45] ahha [23:00:50] i see now [23:00:57] the filter doesn't allow vrrp packets [23:01:09] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [23:01:30] mutante: they've been installed once, afaik, bios has not been touched. we are resinstaling now because they were puppetized by my old analytics puppetmaster and had things done to them not in ops/puppet [23:01:39] after reinstall they will be 100% ops/puppet puppetized [23:02:56] ottomata: hah, look at this https://rt.wikimedia.org/Ticket/Display.html?id=3429 [23:03:09] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [23:03:10] ha, hmmm [23:03:34] ok, LeslieCarr, so should we try again in aminute? [23:03:37] ok, got the install .. looks like https://rt.wikimedia.org/Ticket/Display.html?id=3367 was it [23:03:46] but no details [23:04:02] well, it says / is RAID 1 on disk 1/2 [23:04:07] that matches the partman setup [23:05:03] in a minute :) [23:05:29] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [23:08:52] cmjohnson1: ping? [23:09:17] Change abandoned: Demon; "That's what I did for now. Helping with packaging would be most appreciated." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67120 [23:10:23] damn, 2h away [23:10:33] I'm too late [23:14:23] <^demon> paravoid: Thanks for the offer to help package hhvm. [23:14:30] ottomata: need to go for now, ttyl [23:14:32] of course [23:14:54] ok, tahnks so much! [23:15:08] mutante,hopefully whatever Leslie is doing will fix this and it will all work :) [23:15:35] trying to figure out why vrrp is not being nice [23:16:08] oh duh [23:16:10] AH [23:18:29] ottomata: ok, try now [23:18:35] that would also explain intermittent [23:18:49] honestly that makes me a bit surprised there wasn't other network issues [23:18:57] hmm, what's the prob? [23:19:11] there are still issues with ganglia + icinga [23:19:16] not sure if that is realted [23:19:28] icinga is still reporting STALE values from ganglia [23:19:45] both routers claiming to be the default gw [23:19:49] hmmm [23:20:34] New patchset: Hazard-SJ; "(bug 29902) Tidied up CommonSettings and InitialiseSettings" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65860 [23:21:31] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [23:23:33] dohp, hm LeslieCarr, no change. still booted OS, nothing on brewster [23:24:01] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [23:24:20] hrm [23:24:20] why is gadolinium trying to send to 233.58.59.1 ? [23:24:29] ltos of udp [23:25:17] multicast? [23:25:22] yeah [23:25:35] gadolinium is the multicast webrequest relay [23:25:41] frontends all send to it [23:25:50] it has a socat multicast relay on it [23:26:08] ah, it's all being rejected fyi [23:26:12] ? [23:26:18] on analytics cluster? [23:26:18] to that specific address [23:26:22] yeah [23:27:00] wait, at what point is it being rejected? [23:27:21] at the router [23:28:12] i can join the group on analytics1026 [23:28:14] and get plenty of data [23:30:10] LeslieCarr: ^ [23:30:43] outside the vlan is not getting data though, right ? [23:31:28] sure, its being used both by analytics and udp2log machines [23:31:32] oxygen, for instance, [23:32:48] !log disabling asw-c-eqiad ge-4/0/7, enabling ge-4/0/6, moving description ms-fe1004 from one to the other and fixing VLANs [23:32:58] Logged the message, Master [23:36:49] hrm, interesting that ip is getting data from oxygen when i see all the drops [23:37:01] i mean oxygen is getting data from that ip [23:37:04] hm [23:37:06] lemme check oxygen... [23:37:11] ah yep [23:37:14] i see lots of traffic [23:37:48] yeah [23:44:45] New patchset: Faidon; "Ceph: switch 3rd monitor ms-be1005 -> ms-fe1004" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67926 [23:45:35] New patchset: Hoo man; "Run db list tests for Wikivoyage as well" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/67927 [23:45:42] LeslieCarr: are we stumped? :) [23:45:42] New patchset: Faidon; "Ceph: switch 3rd monitor ms-be1005 -> ms-fe1004" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67926 [23:45:51] i am a bit stumped [23:46:06] what's up guys and girls? [23:46:17] hiya paravoid [23:46:26] welp, i'm trying to PXE boot and reinstally analytics1020 [23:46:41] i have not yet seen the installer, and when I PXE boot I don't see anything relevant in /var/log/syslog on brewster [23:46:42] okay [23:46:42] New patchset: Faidon; "Ceph: switch 3rd monitor ms-be1005 -> ms-fe1004" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67926 [23:46:45] my next step is to tcpdump on the port [23:46:54] on brewster? [23:46:58] did you have udp2log streams going there? [23:47:01] mutante says he got the installer once [23:47:02] on the switch [23:47:03] no [23:47:09] aye k [23:47:16] but I've never seen it [23:47:24] so mutante says he went through the installer [23:47:30] is it on the same lan as oxygen? [23:47:33] but when the machine came back up, it was the same as before [23:47:38] no, [23:47:45] want to reboot analytics1020 ? i'm tcpdumping on the switch [23:47:45] or [23:47:46] is it on a subnet where a multicast stream is present? [23:47:47] don't htink so [23:47:48] sure [23:47:54] udp2log stream that is? [23:48:06] hmm [23:48:09] yes..... [23:48:12] there's an issue that notpeter was facing when reinstalling oxygen [23:48:20] lemme double check, there are 2 subnets, but i'm not sure the difference [23:48:23] the multicast stream floods the port and fills PXEs buffers [23:48:28] ah [23:48:42] and PXE TFTP fails [23:48:54] LeslieCarr, PXE booting in one sec... [23:48:57] interesting [23:49:25] tftp is basically random UDP ports [23:49:36] so maybe the firmware is doing strange things [23:49:37] attempting PXE boot now [23:49:54] New patchset: Faidon; "Ceph: switch 3rd monitor ms-be1005 -> ms-fe1004" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67926 [23:50:25] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/67926 [23:50:37] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [23:51:06] so, there are no regular udp2log instances joining the multicast group on 10.64.36.0/24 [23:51:12] that is an20's subnet [23:51:27] but there are some on 10.64.21.0/24 [23:51:31] which is the other analytics subnet [23:51:45] but, sometimes I use an26 (on an22's subnet) to test multicast stuff [23:51:50] so occasionally I join the group there [23:53:06] ok LeslieCarr, no PXE boot, it just booted OS [23:53:07] RECOVERY - Host analytics1020 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [23:53:14] nothing on brewster logs, did you see anything in tcpdump? [23:53:26] i saw the dhcp requests [23:53:39] ok, that's interseting then, I didn't see them on brewster... [23:53:52] analytics acls? [23:53:54] from the analytics1020 [23:53:59] i think it's not in the common-infrastructure [23:54:05] which really confuses me as to how it was working some [23:54:11] what, tftp? [23:54:16] or that node in particular? [23:54:28] on the analytiscs subnet in particular [23:54:52] tftp on analytics subnet? this is the first we've tried to reinstall an analytics node since the ACLs went up [23:55:10] mark said he did somethign to fix the tftp issue in the ACL earlier today [23:56:00] let's try again ? [23:56:09] tftp is fixed, but dhcp wasn't i believe [23:56:42] paravoid: are some backends also monitors? [23:56:49] AaronSchulz: not anymore [23:56:54] ok :) [23:56:56] AaronSchulz: as of 10' ago :-) [23:57:11] AaronSchulz: we didn't have any frontend on row C, I asked Chris to move ms-fe1004 today [23:57:32] AaronSchulz: the goal was to have one monitor per row [23:57:48] ok , trying again... [23:58:18] gwicke: fwiw, verified that hash_always_miss always implies hash_ignore_busy, there's no way around it [23:58:37] PROBLEM - RAID on analytics1020 is CRITICAL: Timeout while attempting connection [23:58:48] binasher: too bad, but somewhat understandable [23:58:50] LeslieCarr: attempting PXE boot [23:58:56] thanks for checking! [23:59:25] paravoid: where those solr changes for translate deployed? [23:59:30] AaronSchulz: yes