[00:01:43] jdlrobson: How do I run the Mantle unit tests? [00:01:53] They don't appear to register themselves with the MW unit tests? [00:02:39] RoanKattouw: mmm? [00:02:40] * jdlrobson looks [00:02:42] RoanKattouw: thanks for the deploy. I wrote https://www.mediawiki.org/wiki/Extension:Mantle#Developer_features but mostly focused on templates since that's what Flow's using, but it says "object-oriented JavaScript code, similar to OOjs: Class.js, eventemitter, View.js. As used by Extension:MobileFrontend." [00:02:50] Cool thanks [00:02:56] I'm writing a bug report about the event emitter [00:03:00] RoanKattouw: they should do... [00:03:06] But in it I want to claim that OO.EventEmitter is fully compatible [00:03:16] And in order to claim that I want to monkey-patch it into Mantle and run the tests [00:03:32] so that I can say "look I dropped OO.EventEmitter into Mantle and all the tests still pass" [00:03:35] it https://en.wiktionary.org/wiki/warm_the_cockles_of_someone%27s_heart that y'all are talking about this [00:03:45] <3 :) [00:03:53] RoanKattouw: Hm.. 1 month old extension, 25 commits. [00:04:22] Krinkle: A lot of code copied from other repos without history [00:04:42] Doesn't look like production material yet. Ah, beta labs [00:04:49] RoanKattouw: if the tests are not running i'll take a look tomorrow and see what's going on [00:04:55] Krinkle: it's a carbon copy of the code in mobile [00:04:59] that's been there for 2 years [00:05:25] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: Fetching readonly [00:05:29] ohh i see what's going on RoanKattouw the tests only run with the mobile target :) [00:05:36] lolwut [00:05:51] I can see how that happened [00:06:01] i think that's what is happening wil double check 2mo [00:06:26] OK found it [00:06:28] Testing fix [00:07:35] jdlrobson: I mentioned the mobile target in gerrit. At the time Mantle's RL dependencies needed the mobile stuff but that's fixed now [00:08:10] Hmm [00:08:11] Uncaught Error: Unknown dependency: mobile.tests.base [00:08:29] * jdlrobson facepalms [00:08:36] guess that's what happens when you carbon copy code :) [00:08:39] RoanKattouw: that can be removed [00:08:48] it shouldn't be needed [00:08:57] I thought os [00:08:59] Already trying that [00:09:04] (03CR) 10Springle: [C: 032] allow setting mysql_mode though mysql::config [operations/puppet] - 10https://gerrit.wikimedia.org/r/142162 (owner: 10Rush) [00:09:11] * jdlrobson hopes [00:09:21] actually that may not even me needed in mobile anymore. [00:09:41] OK *now* they run [00:09:44] Let me submit this to Gerrit [00:11:53] thanks RoanKattouw appreciated [00:12:14] RoanKattouw: i assume they pass? :) [00:12:23] Yes they do [00:12:31] Haven't tried with OO yet [00:15:28] (03CR) 10Rush: "removing myself as it seems gtg?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141452 (owner: 10Alexandros Kosiaris) [00:20:54] hahaa [00:21:13] There's a one-character API incompatibility [00:21:16] .one() vs .once() [00:24:04] (03PS10) 10Rush: Phabricator module [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [00:25:14] RoanKattouw, Krinkle : thanks for OOjs' ES3/ES5 support work, that removes an impediment [00:25:46] Dude it works in IE6 now :O [00:25:48] spagewmf: It's (still) not merged, so don't thank them yet. [00:25:48] Allegedly [00:25:50] I was impressed [00:26:13] * Jasper_Deng wonders why RoanKattouw is still on IE6 [00:26:24] Jasper_Deng: Krinkle tests impressively. :-) [00:27:54] James_F : ? https://gerrit.wikimedia.org/r/#/c/140436/ is merged, bug 63303 is fixed [00:29:02] spagewmf: That makes OOjs parse in MSIE6+. [00:29:04] (03CR) 10Rush: [C: 032] "need this to get going, talked it over with mukunda. seems good. I will sync up with daniel when he is back." [operations/puppet] - 10https://gerrit.wikimedia.org/r/132505 (owner: 10Dzahn) [00:29:59] spagewmf: However, https://gerrit.wikimedia.org/r/#/c/139308/ is the patch to bring the ES5 shim into MW-core so OOjs actually /works/ in production in IE6+. [00:52:34] (03PS1) 10Springle: Expose sql_mode to templates. [operations/puppet/mariadb] - 10https://gerrit.wikimedia.org/r/142174 [00:52:59] (03CR) 10Springle: [C: 032] Expose sql_mode to templates. [operations/puppet/mariadb] - 10https://gerrit.wikimedia.org/r/142174 (owner: 10Springle) [01:01:56] (03PS1) 10Springle: Update mariadb to 6abaf07092a6304a41db456cb8ed177b9a8a214b [operations/puppet] - 10https://gerrit.wikimedia.org/r/142175 [01:02:21] (03PS2) 10Springle: Update mariadb to 6abaf07092a6304a41db456cb8ed177b9a8a214b [operations/puppet] - 10https://gerrit.wikimedia.org/r/142175 [01:04:36] (03CR) 10Springle: [C: 032] Update mariadb to 6abaf07092a6304a41db456cb8ed177b9a8a214b [operations/puppet] - 10https://gerrit.wikimedia.org/r/142175 (owner: 10Springle) [01:08:20] for wikimedia git stuff, we make branches right? [01:08:27] (03PS1) 10Springle: m3/phabricator role with strict sql_mode [operations/puppet] - 10https://gerrit.wikimedia.org/r/142178 [01:08:28] or is that taken care of by git review? [01:09:57] dogeydogey: gerrit will create a branch, yes [01:11:27] dogeydogey: The expected workflow is you create a local branch off master, commit your code into that branch, then run git review [01:11:44] git review manages uploading and downloading of changes but does not manage your branches beyond that [01:11:51] So it doesn't e.g. create a branch for you [01:12:11] (unless you said "please download change XYZ" in which case it'll create a branch called review/nameofauthor/XYZ to put it in) [01:15:06] (03PS2) 10Springle: m3/phabricator role with strict sql_mode [operations/puppet] - 10https://gerrit.wikimedia.org/r/142178 [01:18:23] (03PS3) 10Springle: m3/phabricator role with strict sql_mode [operations/puppet] - 10https://gerrit.wikimedia.org/r/142178 [01:20:17] (03CR) 10Springle: [C: 032] m3/phabricator role with strict sql_mode [operations/puppet] - 10https://gerrit.wikimedia.org/r/142178 (owner: 10Springle) [01:21:48] RoanKattouw usually when I commit a branch for review I do git push master branch [01:22:01] but the wiki just says git commit -a then git review [01:22:13] kinda confused where I push the branch [01:25:38] git review pushes it for you [01:25:41] To a special place [01:25:48] ah, got it [01:26:01] If you really want to you can run git push origin HEAD:refs/for/master instead but that's hard to remember which is why we have tools [01:26:19] (and also sometimes needs to be different in some cases, which is why we have a config file for the tool, in .gitreview) [01:26:22] <^demon|away> Hard to remember? [01:26:23] <^demon|away> {{cn}} [01:46:27] hey the rakefile says "Welcome #{ENV['USER']} in WMF wonderful rake helper to play with puppet." [01:46:41] Should it be "Welcome #{ENV['USER']} toWMF's wonderful rake helper to play with puppet."? [01:46:43] or what? [01:46:48] it doesn't make sense to e [01:46:51] me [01:58:05] (03PS1) 10Springle: Support monitoring single and multi-source replication configurations. [operations/puppet/mariadb] - 10https://gerrit.wikimedia.org/r/142183 [02:09:55] (03PS1) 10Scottlee: Fixed text formatting and grammar. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142186 [02:11:33] anyways, I pushed [02:11:39] someone please let me know if they see it [02:31:34] (03PS1) 10Springle: Expose read_only parameter to templates. [operations/puppet/mariadb] - 10https://gerrit.wikimedia.org/r/142187 [02:32:54] !log LocalisationUpdate completed (1.24wmf9) at 2014-06-26 02:31:50+00:00 [02:33:02] Logged the message, Master [02:39:54] (03CR) 10Springle: [C: 032] Expose read_only parameter to templates. [operations/puppet/mariadb] - 10https://gerrit.wikimedia.org/r/142187 (owner: 10Springle) [02:42:27] (03PS1) 10Springle: master/slave configuration for m3/phabricator [operations/puppet] - 10https://gerrit.wikimedia.org/r/142188 [02:43:49] (03CR) 10Springle: [C: 04-1] master/slave configuration for m3/phabricator (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142188 (owner: 10Springle) [03:02:46] !log LocalisationUpdate completed (1.24wmf10) at 2014-06-26 03:01:43+00:00 [03:02:51] Logged the message, Master [03:34:25] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jun 26 03:33:19 UTC 2014 (duration 33m 18s) [03:34:30] Logged the message, Master [04:09:45] PROBLEM - Puppet freshness on mw1014 is CRITICAL: Last successful Puppet run was Thu 26 Jun 2014 01:08:37 UTC [04:09:58] hmm [04:10:00] i'll look at mw1014 [04:10:39] not something that i've seen before: [04:10:46] Error: /Stage[main]/Sysctl/Exec[update_sysctl]: Could not evaluate: Read-only file system - /tmp/puppet20140626-4328-y3jal2-0 [04:10:46] Error: Could not get latest version: cannot generate tempfile `/tmp/puppet20140626-4328-1tc7kzy-9' [04:10:46] Error: /Stage[main]/Base::Monitoring::Host/Package[mpt-status]/ensure: change from 1.2.0-7 to latest failed: Could not get latest version: cannot generate tempfile `/tmp/puppet20140626-4328-1tc7kzy-9' [04:10:48] Notice: Finished catalog run in 70.61 seconds [04:44:56] !log mw1014 is sad, has filesystem issues: "Attempt to read block from filesystem resulted in short read while trying to open /tmp". Puppet can't run. Should be depooled. [04:45:05] Logged the message, Master [04:45:31] ^ springle [05:00:44] mw1014 isn't in fenari pybal [05:00:55] oh vcl? [05:01:10] ori: can't help with that [05:01:23] _joe_: might [05:10:20] (03CR) 10Springle: [C: 032] Additional labsdb federated tables for commonswiki_f_p, each already accessible via direct view on slice s4 commonswiki_p. [operations/software] - 10https://gerrit.wikimedia.org/r/140644 (owner: 10Springle) [05:15:36] (03CR) 10Matanya: "Nice first patch :)" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142186 (owner: 10Scottlee) [05:19:05] (03CR) 10Florianschmidtwelzow: [C: 031] Disable mobile upload CTA on wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142155 (https://bugzilla.wikimedia.org/66958) (owner: 10MaxSem) [05:25:44] twkozlowski: hey? around? wondering if I can ask for a favor :) [05:36:40] (03PS1) 10Springle: Add missing fa_major_mime field to labsdb views. The filearchive table was vetted by Legal in bug 57697 and fa_sha1 omitted, but not fa_major_mime, plus we already expose fa_minor_mime. [operations/software] - 10https://gerrit.wikimedia.org/r/142197 [05:40:24] (03PS2) 10Springle: Add missing fa_major_mime field to labsdb views. The filearchive table was vetted by Legal in bug 57697 and fa_major_mime was not a problem, plus we already expose fa_minor_mime. [operations/software] - 10https://gerrit.wikimedia.org/r/142197 [05:44:08] (03CR) 10Springle: [C: 032] Add missing fa_major_mime field to labsdb views. The filearchive table was vetted by Legal in bug 57697 and fa_major_mime was not a problem, [operations/software] - 10https://gerrit.wikimedia.org/r/142197 (owner: 10Springle) [05:49:05] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:49:06] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.007 second response time [05:50:39] !log on mw1014: stopped job runner due to bad /tmp [05:50:44] Logged the message, Master [05:54:54] !log on mw1014: reformatted the /tmp partition [05:54:56] RECOVERY - Puppet freshness on mw1014 is OK: puppet ran at Thu Jun 26 05:54:53 UTC 2014 [05:54:59] Logged the message, Master [06:26:40] !log ran operations/software maintain-replicas.pl and fedtables.pl on labsdbs for bug 59683 [06:26:44] Logged the message, Master [06:44:01] (03PS27) 10KartikMistry: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [06:45:38] (03CR) 10jenkins-bot: [V: 04-1] cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [06:49:10] eh [06:59:31] (03PS28) 10KartikMistry: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [07:00:59] (03CR) 10jenkins-bot: [V: 04-1] cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [07:01:22] (03PS1) 10Springle: Update coredb topology array which is out of sync after slave moves. Remove role::coredb::m2 as it is no longer in use. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142199 [07:19:21] (03PS6) 10Ori.livneh: Add a lightweight apache::site resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 [07:24:22] YuviPanda: ? [07:35:15] <_joe_> hey someone summoning me? [07:38:05] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 1.505 second response time [07:40:04] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.023 second response time [07:46:46] _joe_: is it permissible now to introduce puppet3-only syntax like <%= scope['::domain'] %> ? [07:47:40] i can't imagine we'd ever downgrade, right? [07:48:00] <_joe_> ori: I hope not [07:48:42] hello [07:48:46] cool [07:48:48] hey hashar [07:48:58] <_joe_> ori: I never used scope[] honestly, what's it for? [07:49:15] it's just syntactic sugar for scope.lookupvar('::domain') [07:49:23] if i were an adult i'd use the backward-compatible syntax [07:49:30] but i am an idiot for syntactic sugar [07:49:31] <_joe_> I'd use scope.lookupvar [07:49:34] everyone has their weaknesses [07:49:40] <_joe_> eheh [07:49:52] how does it relate to @domain ? [07:49:56] I never understood the difference [07:50:08] <_joe_> hashar: @domain catches variables in the local scope [07:50:24] <_joe_> that is, variables of the class you're in [07:50:25] @ is an instance var, which is how puppet passes local scope to erb [07:50:30] <_joe_> and node-scope variables [07:50:44] <_joe_> which are inherited in the local scope [07:50:52] because puppet is crazy [07:50:59] <_joe_> ori: every time I explain these things I do get how much puppet sucks [07:51:03] but the global variables aren't inherited/available right. [07:51:09] <_joe_> hashar: right [07:51:12] \O/ [07:51:24] <_joe_> hashar: any variable from another namespace should be looked up [07:53:11] puppet is like mcdonald's, if you stick to familiar menu items and avoid looking in the kitchen you're ok [07:53:27] that is the hardest part of puppet to explain, it took me a few days to pass it on to one of my teammates. [07:53:29] <_joe_> ori: well, I beg to differ [07:53:45] <_joe_> ori: I do agree that puppet is like mcdonalds [07:54:11] <_joe_> it is made with shit, tastes like it, still in some cases you've got no better option for your money [07:54:19] haha [07:54:21] well put [07:54:30] * _joe_ hates mcdonalds, loves hamburgers [07:55:06] <_joe_> and yesterday I discovered a place where you can eat a decent pastrami in Rome \o/ [07:58:17] <_joe_> ori: while you're here, I must add apc monitoring to the mediawiki webservers; I think the sensible thing to do would be to add a class under mediawiki::monitoring [07:59:01] kinda sorta. i'd like to get rid of the monitoring/ subdirectory and replace it with monitoring.pp [07:59:10] <_joe_> I disagree [07:59:14] <_joe_> strongly, also [07:59:21] _joe_: if you are in burger there is a very good burger place next to the SF office. Super Duper Burger iirc [07:59:32] i introduced monitoring/ so former me agrees [07:59:40] <_joe_> for instance, apc monitoring makes sense only on web servers [07:59:41] the only monitoring in monitoring/ is a hack that reports mediawiki errors to ganglia [07:59:52] <_joe_> ori: now we'll have good use for it [07:59:55] which should be rewritten as a python module [08:00:14] <_joe_> ori: but I'm a fan of small small puppet classes well organized in hierarchies [08:00:19] <_joe_> hashar: noted. [08:00:25] _joe_: don't get me wrong, having multiple classes is fine, but i just prefer having small classes in single files [08:00:33] module autoload layout be damned [08:00:37] <_joe_> hashar: given good enough burgers, I can be like man vs food [08:00:48] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Thu 26 Jun 2014 04:59:47 UTC [08:00:48] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Thu Jun 26 08:00:42 UTC 2014 [08:01:05] <_joe_> ori: I think the autoload layout is about one of the few things they got right [08:01:10] <_joe_> it resembles python's one [08:01:13] <_joe_> sorta [08:01:28] i think imports are one of the few things python got wrong :P [08:01:28] <_joe_> but again, it's a matter of taste :) [08:01:35] <_joe_> eheh [08:01:54] <_joe_> ori: OTOH, puppet will remove import in puppet 4 [08:01:55] I like python import [08:01:59] though it is sometime confusing [08:02:10] I had the issue of import jenkins giving me a lame jenkins object [08:02:14] surely better than unicode handling [08:02:14] that ganglia script (its output is referenced in the channel topic) really should be rewritten to report to graphite [08:02:24] it is all because my script was named jenkins.py so it self loaded instead of the module I wanted :/ [08:02:34] i have an rt for that [08:02:34] <_joe_> Nemo_bis: well, is there one language that handles unicode well? [08:02:50] <_joe_> ori: RT #? [08:02:52] No idea :) [08:03:06] https://rt.wikimedia.org/Ticket/Display.html?id=7699 [08:03:25] <_joe_> Nemo_bis: well, Erlang kind of does. All strings are bytestrings and you're done :) [08:03:57] ascii was good enough for jesus, etc. [08:04:18] <_joe_> joking of course, but my point is: the only languages I know that do not suck at unicode are the ones that surrendered any possibility to handle it [08:04:36] <_joe_> ori: well jesus most probably used either hebrew or lating [08:04:37] golang does all right [08:04:39] <_joe_> *latin [08:04:55] greek or aramaic actually! [08:04:59] <_joe_> ori: never coded in go [08:05:00] JAVASCRIPT IS CLEARLY THE BEST WE SHOULD COMPILE EVERYTHING TO IT [08:05:08] <_joe_> mmmh greek I doubt [08:05:16] <_joe_> YuviPanda: lol [08:05:22] YuviPanda: HELL YEAH UTF-16! sorry I mean UCS-2! [08:05:42] <_joe_> oh but we do have php6 [08:05:55] <_joe_> that will handle utf-16 well [08:05:59] <_joe_> oh, wait... [08:06:15] ori: android apps segfault if there's any codepoint outside of BMP in any string [08:06:17] http://en.wikipedia.org/wiki/Historical_Jesus#Language.2C_race_and_appearance [08:06:28] "The language spoken in Galilee and Judea during the 1st century amongst the common people was most frequently the Semitic Aramaic tongue. The Hebrew language was spoken by those educated in the scriptures and Greek was spoken by the upper class. Aramaic was the predominant language." [08:06:43] <_joe_> ori: Aramaic, right [08:07:28] <_joe_> I should've remembered mel gibson's snuff movie on the topic [08:07:41] haha [08:07:46] <_joe_> they speak aramaic there, and hollywood can't be wrong [08:07:50] you're on a roll tonight [08:08:07] i nearly woke up the house laughing [08:08:47] <_joe_> bd808|BUFFER: the apc stats script is neat :) [08:14:33] <_joe_> YuviPanda: are you still here? [08:14:41] _joe_: hey! yup [08:14:59] <_joe_> YuviPanda: you cherry-picked the PFS change for labs proxy, correct? [08:15:10] _joe_: nope, am waiting for it to get merged. [08:15:17] _joe_: it doesn't have its own puppetmaster... [08:15:25] _joe_: so I'll have to copy those settings into the proxy's config [08:15:29] <_joe_> mh so maybe beta? [08:15:47] <_joe_> someone did cherry-pick the patch [08:16:15] <_joe_> YuviPanda: that was you :) "We are using a copy of this for the labs proxy as well, so I'll keep an eye on this and copy things over again when this gets merged." [08:16:23] _joe_: indeed! :) [08:16:35] <_joe_> YuviPanda: if you can copy the revised version I posted, that would be great [08:16:44] _joe_: sure, I'll copy and post a patch! :) [08:16:57] <_joe_> https://gerrit.wikimedia.org/r/#/c/132393/ [08:28:17] so many mails [08:28:25] _joe_: got time to pair up on the zuul puppet changes? :D [08:30:12] <_joe_> hashar: give me some time sorry [08:30:39] I will break jenkins meanwhile :] [08:33:09] hashar: hey, I'm available too if _joe_ can't [08:35:12] (03PS7) 10Ori.livneh: Add a lightweight apache::site resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 [08:35:14] (03PS1) 10Ori.livneh: role::deployment: port apache::vhost to apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/142205 [08:35:16] (03PS1) 10Ori.livneh: mediawiki_singlenode: port apache::vhost to apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/142206 [08:37:06] godog: sure :-D [08:37:24] (03CR) 10Nikerabbit: cxserver configuration for beta labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [08:37:59] <_joe_> godog: oh so you're the new official zuul expert! [08:38:08] that is mean :-D [08:39:04] <_joe_> hashar: joking, only mark is allowed to give the "official expert" badge to ops [08:39:24] <_joe_> (which translates to, you will deal with that) [08:39:37] (03PS1) 10Yuvipanda: dynamicproxy: Tweak SSL config to match prod [operations/puppet] - 10https://gerrit.wikimedia.org/r/142207 [08:39:39] (03PS1) 10Yuvipanda: toollabs: Tweak SSL config to match prod [operations/puppet] - 10https://gerrit.wikimedia.org/r/142208 [08:39:46] _joe_: ^ [08:39:49] <_joe_> YuviPanda: \o/ [08:40:01] _joe_: first one is for the proxy that powers most of *.wmflabs.org except tools.wmflabs.org, second one is for that [08:40:13] _joe_: should we wait until your patch is merged, or merge one of these to test? :) [08:40:36] +1 to test labs now [08:40:50] <_joe_> that was my idea honestly :) [08:41:21] _joe_: heh :D first merge dynamicproxy? [08:41:47] godog: how do you want to proceed? Should we just hammer our keyboard there or wanna use modern video :D [08:41:47] <_joe_> YuviPanda: yes, but I'm deep in mediawiki's apache config [08:41:49] (03CR) 10Ori.livneh: "I ported three of the six apache::vhost resources to apache::site in the two follow-up patches" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 (owner: 10Ori.livneh) [08:42:11] _joe_: sure. whenever you want to :) I'll be around for a while... [08:42:30] godog: most patches are trivial enough. One might need a bit more attention and talk about puppet templating designing [08:43:42] hashar: haha I guess we can push for the trivial ones first [08:44:37] <_joe_> sigh, why do we use NameVirtualHost * instead of NameVirtualHost *:80 in the apache config? [08:44:40] godog: sure. First one is the python statsd module that got packaged. So I can remove a ugly hack that adjusted Zuul source code. Think about a puppetized quilt https://gerrit.wikimedia.org/r/#/c/141442/1/modules/zuul/manifests/init.pp :D [08:44:50] <_joe_> I don't think there is a good reason for that [08:45:16] godog: the original idea was to get rid of the statsd dependency because we don't install from pip but need debian package. [08:46:26] that will unleash two other child changes that got approved overnight [08:49:18] ori: I'm setting up a diamond collector in labs so we can modify the role to point to that in labs instead of just disabling it [08:49:38] adding graphite and the txstatsd roles since those seem to be the relevant ones in prod (tungsten) [08:50:01] YuviPanda: that was my previous life [08:50:08] ori: what, diamond? [08:50:12] hashar: ack, looking [08:50:15] that entire puppet stack [08:50:25] ori: all the montioring type stuff? [08:50:26] now that the fun is over and those systems have started to accrue real-world problems [08:50:30] aaah [08:50:33] i'm going to make myself scarce and hand it all off to ops [08:50:35] nice, no? [08:50:37] i'm awesome like that [08:50:44] <_joe_> ori: I found a first case where having apache-config separated from puppet is a PITA [08:50:45] ori: :D [08:50:51] <_joe_> ori: eheheh [08:51:03] <_joe_> ori: I think the fun is with real-world problems [08:51:09] indeed [08:51:12] <_joe_> that's why I do ops and not dev, I think [08:51:16] <_joe_> ;D [08:51:40] _joe_: i'm still in favor of deploying it via puppet [08:51:52] <_joe_> ori: I'm getting to that point tooo [08:51:53] we should all be devops and use mongodb for storing apache config [08:51:55] i even wrote a pretty detailed proposal about how that could work as i recall :P [08:52:03] <_joe_> :) [08:54:29] <_joe_> YuviPanda: devops does not mean "idiot that uses overhyped yet shitty document stores when he'll be best served by mysql" [08:54:52] _joe_: all buzzwords must mandatorily include using mongodb somewhere in the stack [08:55:00] godog: also the whole manifests are a bit of a mess. I mix installation from source with configuration and daemon startup :-/ The idea is to split them out. [08:55:20] <_joe_> YuviPanda: no mongo is now yesterday's news [08:55:25] godog: for example on some nodes I would have to install Zuul but not start any service [08:56:44] <_joe_> YuviPanda: https://github.com/NodeBB/NodeBB "a redis DATABASE" [08:56:51] oh wow [08:57:36] <_joe_> then again, given the average quality of forum postings, redis is perfectly fine as a storage for those [08:57:42] haha [08:58:02] <_joe_> it also limits the number of posts to the size of the given RAM, if not differently configured [08:58:12] <_joe_> so, it keeps your forum lean [08:58:19] hashar: ok! looking [09:00:22] _joe_: :D I don't think Redis supports anything that'll have content more than RAM, does it? [09:00:37] (03PS1) 10Yuvipanda: Enable diamond collection on labs as well [operations/puppet] - 10https://gerrit.wikimedia.org/r/142210 [09:00:52] ori: ^ :) I'll test it on some self hosted puppet thing in a bit. [09:00:59] <_joe_> YuviPanda: it does, but it's strongly discouraged I think [09:01:15] <_joe_> dunno, honestly, never used redis as a persistent store [09:01:30] _joe_: that was redis VM, was removed after 2.4 [09:01:36] _joe_: do you know why redis's default port is 6379? [09:01:38] see notice on http://redis.io/topics/virtual-memory [09:02:04] <_joe_> ori: no not really [09:02:37] _joe_: check out http://oldblog.antirez.com/post/redis-as-LRU-cache.html , "Appendix: how to remember the Redis port number" :D [09:03:27] <_joe_> ori: oh, my [09:03:29] hashar: when does python-statsd get installed? [09:04:09] ori: there's a particular link that's 404ing now [09:05:04] godog: it is already listed in the list of packages to install [09:05:18] godog: in modules/zuul/manifests/init.pp there is a $packages array [09:05:36] godog: ::zuul class is meant to install the dependencies and Zuul from source code + a few more file {} [09:05:57] godog: maybe I should later on refactor the installation to a zuul::install with a zuul::packages class [09:06:25] hashar: ok, I'm assuming I can merge at will? [09:06:31] yup :) [09:07:12] <_joe_> gotta love google. Yesterday I was searching for apache config values both in 2.2 and 2.4, now all searches for apache config values have the 2.4 manual as first result, the 2.2 as second [09:08:37] * YuviPanda looks for a puppetmaster [09:09:20] ori: oh god, I just understood the relationship between 'diamond' and 'graphite' as names [09:09:22] * YuviPanda facepalms [09:10:40] * YuviPanda creates charcoal.wmflabs.org [09:16:59] <_joe_> any preference on the location on disk of the monitoring vhost DocRoot? [09:17:21] <_joe_> I'd go with "/var/www/monitoring" [09:21:10] YuviPanda: also make sure the name is at http://namingschemes.com and switch randomly between naming schemes! [09:21:24] godog: :D will remember! [09:21:33] next up should be firewood, followed by vodka [09:22:26] _joe_: that's probably fine for a small amount of data [09:24:14] <_joe_> mark: it's going to be a few files, for now it's just apc_stats.php that bryan wrote [09:24:26] ok [09:24:29] (03PS2) 10Filippo Giunchedi: swift: trial icehouse upgrade in esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/141677 [09:25:32] (03PS2) 10Filippo Giunchedi: zuul: python-statsd has been packaged [operations/puppet] - 10https://gerrit.wikimedia.org/r/141442 (owner: 10Hashar) [09:26:08] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] zuul: python-statsd has been packaged [operations/puppet] - 10https://gerrit.wikimedia.org/r/141442 (owner: 10Hashar) [09:26:41] hmm, the graphite role doesn't seem to work on labs [09:26:43] godog: will rebase next one [09:27:21] YuviPanda: curious, it has been working for me, what doesn't work? [09:27:36] godog: the uwsgi app doesn't seem to start, am manually starting it to see if it works [09:28:35] (03PS2) 10Hashar: zuul: mv python-voluptous in the array of packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/141486 [09:28:37] (03CR) 10Filippo Giunchedi: swift: trial icehouse upgrade in esams (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/141677 (owner: 10Filippo Giunchedi) [09:28:39] <_joe_> mark: while we're at it - My idea was to add an entry in /etc/hosts for something like mw-monitoring.local, and make the monitoring vhost use that as a ServerName [09:29:01] godog: ah, works now after a service uwsgi start [09:29:03] <_joe_> but i see we have never touched /etc/hosts on those machines. Is this a no-no? [09:29:09] (03PS2) 10Hashar: zuul: get rid of git_dir and zuul_url in server conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/141924 [09:29:15] am rebooting to check if it comes up automatically [09:30:36] _joe_: i'm not sure what the point of that is? [09:30:38] godog: comes up fine after a reboot. must be some random initial puppet run thing. [09:31:13] now to point something at it and test... [09:31:39] you want a hostname resolving to localhost? [09:31:55] <_joe_> mark: yes [09:32:31] (03PS3) 10Hashar: zuul: migrate merger definitions to merger.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/141487 [09:32:34] <_joe_> mark: I did not want to use 'ServerName localhost' as that may be interfering with something [09:33:01] (03PS3) 10Hashar: zuul: migrate server definitions to server.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/141488 [09:33:25] <_joe_> I wanted to make apache listen on another port, but we use NameVirtualHost * and everywhere :/ [09:34:08] could whatever is doing the monitoring send the appropriate Host: instead? [09:34:18] using /etc/hosts could work indeed, but we've never needed to manage that file before [09:34:54] <_joe_> godog: yes, not sure if apache would complain though if it can't resolve the ServerName [09:34:54] how about using the fqdn? [09:35:08] then it could theoretically be used outside of localhost too, if we don't restrict access [09:35:47] <_joe_> mark: well, we could also set the Host header by hand, if we want to :) [09:36:23] _joe_: good point, I think it whines but ultimately works, or maybe that was only for the default server/virtualhost [09:36:54] <_joe_> godog: I don't remember, reading the docs [09:37:58] <_joe_> godog: no in fact, shouldn't be a problem [09:38:54] we can set it by hand, but that's often a pain of course [09:39:11] is the fqdn problematic with config management? [09:39:29] seems to make sense to me to use the "server name" for gathering server specific stats [09:39:55] <_joe_> mark: yes, I may do things in puppet and leave apache-config alone in fact [09:40:24] <_joe_> mmmh but then again, we have UseCanonicalName On [09:40:39] <_joe_> which means using the fqdn in a vhost may interfere with something [09:42:16] why? [09:42:59] <_joe_> you're right, no interference [09:43:19] <_joe_> and honestly, yes, this is better than handling an /etc/hosts entry with puppet [09:44:16] i think it should work fine - but yeah, better test on a single box or something :) [09:44:16] <_joe_> ok, sold, we'll be using the fqdn :) [09:44:40] <_joe_> mark: of course, I'll use IfDefine guards around it [09:47:32] *.local.wmftest.net points to localhost too [09:47:54] <_joe_> ori: via DNS? [09:48:06] faidon helped set that up as a convenient way of enabling vhosts in mediawiki-vagrant without requiring that people edit their host files [09:48:07] yep [09:48:15] <_joe_> cool [09:48:55] <_joe_> I wanted to do something like that as an alternative, but I thought we did not have that already [09:49:12] <_joe_> this is even simpler :) [09:49:54] eep, 3 AM. bye! :) [09:50:14] <_joe_> bye! [09:50:49] <_joe_> mark: that was the other option, we can create an entry on the .wmnet zone pointing to localhost [09:51:16] <_joe_> for now, I'll go with the fqdn anyway [09:51:46] yup, that's what I was thinking too [09:53:28] _joe_: for CI I have a bunch of Apache vhost such as localhost.qunit localhost.mediawiki and so on. Then do 127.0.0.1 and pass a Host header [09:53:40] _joe_: might not be relevant to your use case, I have followed the discussion above [09:53:54] <_joe_> hashar: do they work? apache docs say they should not [09:54:25] (03PS3) 10Filippo Giunchedi: swift: trial icehouse upgrade in esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/141677 [09:54:46] <_joe_> hashar: http://httpd.apache.org/docs/current/mod/core.html#servername the gray box [09:54:52] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: trial icehouse upgrade in esams [operations/puppet] - 10https://gerrit.wikimedia.org/r/141677 (owner: 10Filippo Giunchedi) [09:55:12] _joe_: ah sorry i was wrong. We are using custom ports :-( i.e. localhost:9413 [09:55:15] _joe_: sorry [09:55:25] _joe_: are familiar with pass ? [09:55:30] *are you [09:55:37] <_joe_> hashar: pass ? no :) [09:56:11] _joe_: re : https://rt.wikimedia.org/Ticket/Display.html?id=6665 <-- http://www.passwordstore.org/ [09:56:34] <_joe_> oh I've seen that, it seems nice [09:56:37] (03PS3) 10Filippo Giunchedi: zuul: mv python-voluptous in the array of packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/141486 (owner: 10Hashar) [09:56:44] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] zuul: mv python-voluptous in the array of packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/141486 (owner: 10Hashar) [09:56:45] <_joe_> but never looked at the code [10:05:51] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: Fetching origin [10:12:07] <_joe_> again... [10:12:22] <_joe_> godog: how do you become root on palladium? [10:12:24] !log upgrading ms-be3001 to swift icehouse [10:12:27] <_joe_> sudo -s? [10:12:28] Logged the message, Master [10:12:32] sudo -i usually [10:12:46] <_joe_> ok, so that has definitely nothing to do with it [10:12:46] but I ran sudo puppet-merge [10:12:59] <_joe_> mh ok THAT is the problem probably [10:13:31] ah indeed [10:13:31] ! aedb131..553c2f9 production -> origin/production (unable to update local ref) [10:13:34] Connection to strontium.eqiad.wmnet closed. [10:23:18] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:24:18] le sigh, ms-be3003 has its root full again, taking a look /cc jgage [10:24:18] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 6.604 second response time [10:29:36] tonythomas: please don't spam me :) [10:30:03] anyone familiar with txstatds? [10:30:08] txstatsd even [10:30:51] ori: :) https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:assetcheck,n,z [10:31:02] YuviPanda: the statsd port to twisted ? [10:31:11] matanya: yeah, that's what we're using [10:31:27] _joe_: Are you doing the apc monitoring today (or already done?) [10:31:31] i used it for a short period, can i help ? [10:31:48] matanya: sure! want me to add you to the project or just describe issues? :) [10:32:06] I guess i can't help much [10:32:20] tell me an issue, and i'll see [10:32:37] matanya: so, charcoal.wmflabs.org is picking up stats from beta cluster right now [10:32:44] matanya: via diamond -> txstatsd -> graphite [10:32:49] ok [10:32:53] matanya: diamond is sending them (via diamond.log I can verify, also via tcpdump) [10:32:58] matanya: but they don't show up on graphite [10:33:14] does txstatsd receive them ? [10:33:25] matanya: yes, seems to. I even see replies in tcpdump [10:33:47] and how txstatsd is pushing to graphite in your case? simple push to port 2003 ? [10:34:23] where carbon picks it up? or something more sophisticated? [10:34:25] matanya: am just using the role [10:34:34] matanya: role::txstatsd and role::graphite [10:34:39] looking [10:35:31] YuviPanda: are 2003 and 2004 open on that end ? [10:35:53] matanya: yes. [10:36:14] and getting traffic ? [10:36:53] matanya: yeah, I'm tcpdumping on 2004 [10:37:11] what about 2003 ? [10:37:54] and moreover: don't show up on graphite <-- as in the web gui, or in carbon ? [10:38:29] matanya: web gui [10:38:32] matanya: how do I check carbon? [10:38:35] matanya: no whisper files either [10:39:09] (03PS3) 10Hashar: zuul: get rid of git_dir and zuul_url in server conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/141924 [10:39:24] (03PS4) 10Hashar: zuul: migrate merger definitions to merger.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/141487 [10:39:33] (03PS4) 10Hashar: zuul: migrate server definitions to server.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/141488 [10:39:35] YuviPanda: /var/lib/carbon/whisper/ (or what ever ubuntu uses) is empty ? [10:40:00] matanya: no, it has two folders, but those are just metrics from the host graphite/txstatsd is on [10:40:05] nothing from elsewhere [10:40:35] <_joe_> Krinkle: I am now [10:40:45] YuviPanda: add me to the project, i'll try to debug with you :) [10:40:50] matanya: moment :) [10:41:47] matanya: added. diamond-collector.eqiad.wmflabs [10:42:24] matanya: fully puppetized setup, has role::txstatsd and role::graphite applied. [10:42:25] nothing else [10:42:36] what is the graphite host name? [10:43:30] matanya: carbon.wmflabs.org [10:43:37] matanya: graphite-collector is the host it runs on [10:43:47] matanya: gah, not carbon. charcoal.wmflabs.org is the web frontend [10:43:53] matanya: username password is guest/guest [10:43:59] * YuviPanda hopes to make this fully public data only [10:45:28] YuviPanda: lets take this into query [10:45:40] matanya: are you sure? other people here might also be able to help :) [10:45:42] matanya: I'm ok with it [10:46:00] ok, just don't want to flood [10:46:09] matanya: we aren't icinga-wm :D [10:46:20] matanya: and nobody else is speaking anyway [10:46:56] so lets see i got it right, i should be able to telnet carbon.wmflabs.org 2003 from diamond ? [10:47:12] matanya: no no. [10:47:36] only localhost can access 2003 ? [10:47:55] when you're worried of flooding you can always use #wikimedia-tech :) [10:47:55] matanya: 1. diamond clients run on beta cluster hosts, 2. they send events to diamond-collector.eqiad.wmflabs, where they are received by txstatsd 3. txstatsd sends them over to graphite, on same host 4. charcoal.wmflabs.org is just the webproxy access to the graphite frontend [10:48:19] matanya: listen port is 8125 so stats are coming in from other hosts through 8125 [10:49:05] ok! so my question is where is the graphite host? what is the name/ip? [10:49:16] matanya: diamond-collector.eqiad.wmflabs is the graphite host [10:49:24] matanya: 10.68.17.169 [10:49:27] so localhost, thanks [10:49:33] matanya: ah, yes. [10:51:17] YuviPanda: seems like carbon is dead [10:51:26] oh? [10:51:49] ps -ef |grep carbon, and : [10:51:51] Jun 26 10:32:38 diamond-collector kernel: [ 3778.010179] init: carbon/cache (a) main process (10715) terminated with status 1 [10:51:51] Jun 26 10:32:38 diamond-collector kernel: [ 3778.010231] init: carbon/cache (a) respawning too fast, stopped [10:51:51] Jun 26 10:32:38 diamond-collector kernel: [ 3778.011198] init: carbon/relay main process (10724) terminated with status 1 [10:51:51] Jun 26 10:32:38 diamond-collector kernel: [ 3778.011238] init: carbon/relay respawning too fast, stopped [10:51:51] Jun 26 10:32:38 diamond-collector kernel: [ 3778.078333] init: carbon/cache (d) main process (10731) terminated with status 1 [10:51:54] Jun 26 10:32:38 diamond-collector kernel: [ 3778.078437] init: carbon/cache (d) respawning too fast, stopped [10:51:56] Jun 26 10:32:38 diamond-collector kernel: [ 3778.123130] init: carbon/cache (f) main process (10727) terminated with status 1 [10:51:58] Jun 26 10:32:38 diamond-collector kernel: [ 3778.123176] init: carbon/cache (f) respawning too fast, stopped [10:52:00] Jun 26 10:32:38 diamond-collector kernel: [ 3778.124290] init: carbon/cache (b) main process (10728) terminated with status 1 [10:52:02] Jun 26 10:32:38 diamond-collector kernel: [ 3778.124330] init: carbon/cache (b) respawning too fast, stopped [10:52:03] !log stopping swift on ms-be3003 [10:52:04] Jun 26 10:32:38 diamond-collector kernel: [ 3778.245433] init: carbon/cache (g) main process (10732) terminated with status 1 [10:52:07] Jun 26 10:32:38 diamond-collector kernel: [ 3778.245485] init: carbon/cache (g) respawning too fast, stopped [10:52:08] Logged the message, Master [10:52:09] oops [10:52:10] matanya: ok, :) [10:52:13] should have pasted the paste link [10:52:15] matanya: carbon-cache disabled in /etc/default/graphite-carbon when trying to start [10:53:03] matanya: but how is there traffic on localhost port 2004 then? [10:53:05] has acks as well [10:54:06] PROBLEM - swift-object-server on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [10:54:06] PROBLEM - swift-object-updater on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [10:54:17] PROBLEM - swift-account-replicator on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [10:54:17] PROBLEM - swift-container-auditor on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:54:17] PROBLEM - swift-account-reaper on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [10:54:17] PROBLEM - swift-account-server on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [10:54:20] YuviPanda: when not? txstatsd is receiving on that port, but can't push to carbon on port 2003 ? [10:54:26] *why not [10:54:36] matanya: no, I was tcpdumping port 2003. [10:54:36] PROBLEM - swift-account-auditor on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [10:54:45] matanya: also I see carbon-cache instances running [10:54:46] PROBLEM - swift-object-replicator on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [10:54:46] PROBLEM - swift-container-server on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [10:54:56] PROBLEM - swift-container-updater on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [10:54:56] PROBLEM - swift-container-replicator on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [10:54:56] PROBLEM - swift-object-auditor on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [10:55:20] <_joe_> godog: can you silence that host while you're playing with it? [10:55:21] <_joe_> :) [10:55:26] YuviPanda: you just started it ? [10:55:43] matanya: I tried a start but that didn't quite work, it errored out [10:55:54] matanya: and the tcpdump was taken a while ago, even before pinging you [10:55:58] oops, sorry :) [10:56:06] matanya: wah, but it's getting metrics now! [10:56:07] wat. [10:56:12] do you want to try again YuviPanda? [10:56:17] RECOVERY - Disk space on ms-be3003 is OK: DISK OK [10:56:26] it seems to better now :) [10:56:35] *to be [10:56:57] i rest my case. [10:57:24] it is so much easier to debug when you have shell! [10:57:43] matanya: yeah :D [10:57:58] matanya: let me restart the host to make sure [10:58:06] matanya: does restart trigger a puppet run? [10:58:21] YuviPanda: just run puppet agent -tv [10:58:36] if you will reboot carbon will stop again [10:58:39] matanya: yeah, but I want to restart as well so it's not working by accident [10:58:47] matanya: yeah, that shouldn't be the case. carbon shouldn't have to be started manually [10:59:35] matanya: puppet reported no changes. rebooting now [11:00:27] YuviPanda: let me know if you ran to any other issue [11:00:33] matanya: ty! [11:01:04] np [11:01:07] !log Replacing operations-puppet-validate job with operations-puppet-pplint-HEAD which is faster and can run concurrently on multiple boxes. {{gerrit|142223}} [11:01:12] Logged the message, Master [11:02:59] RECOVERY - swift-object-server on ms-be3003 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [11:03:08] RECOVERY - swift-object-updater on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [11:03:19] RECOVERY - swift-account-reaper on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [11:03:19] RECOVERY - swift-account-server on ms-be3003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [11:03:19] RECOVERY - swift-container-auditor on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:03:19] RECOVERY - swift-account-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [11:03:34] (03PS2) 10Hashar: Simple docker module [operations/puppet] - 10https://gerrit.wikimedia.org/r/139388 [11:03:38] RECOVERY - swift-account-auditor on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [11:03:38] RECOVERY - swift-object-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [11:03:48] RECOVERY - swift-container-server on ms-be3003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [11:03:58] RECOVERY - swift-container-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [11:03:58] RECOVERY - swift-container-updater on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [11:03:58] RECOVERY - swift-object-auditor on ms-be3003 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [11:04:53] srsly icinga-wm ? I thought I silenced it [11:05:39] ah of course it'd be awesome to be able to silence icinga-wm from within icinga-wm [11:08:20] matanya: I still need to make sure that carbon cache starts on startup [11:08:27] (03PS1) 10Hashar: role::mha tabs to four spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/142225 [11:10:33] YuviPanda: line 210 in graphite.pp [11:10:45] you want me to prepare a patch ? [11:11:15] matanya: file { '/etc/apache2/sites-enabled/graphite': [11:11:16] ? [11:11:47] nope, sorry, wrong copy/paste [11:13:16] matanya: do you know where the problem might be? [11:14:42] i'm looking [11:15:38] (03Abandoned) 10QChris: Document that m2's configuration in coredb need not be accurate [operations/puppet] - 10https://gerrit.wikimedia.org/r/141899 (owner: 10QChris) [11:16:26] yes YuviPanda [11:16:39] found it: twisted.internet.error.CannotListenError: Couldn't listen on 127.0.0.1:2103: [Errno 98] Address already in use. [11:17:31] matanya: oh? I already started carbon cache, so if you start again that would happen, no? [11:17:43] i didn't touch it [11:17:52] matanya: oh? where's that error from? [11:18:32] it doesn't start at boot due to /etc/default/graphite-carbon [11:18:45] <_joe_> bbiab [11:18:47] matanya: right, but I switched that over to true and restarted, and didn't start either. [11:19:39] matanya: as in, i switched it to true and rebooted the machine [11:19:47] i see in the log [11:21:12] matanya: which log? [11:21:28] /var/log/syslog [11:22:00] aaah [11:22:30] see : Jun 26 10:52:46 diamond-collector puppet-agent[11746]: (/Stage[main]/Graphite/Service[carbon]/ensure) ensure changed 'stopped' to 'running' [11:22:30] Jun 26 10:52:46 diamond-collector puppet-agent[11746]: (/Stage[main]/Graphite/Service[carbon]) Unscheduling refresh on Service[carbon] [11:23:26] matanya: right. so should we manage the /etc/default/graphite-carbon file as well in puppet and set it to true? [11:23:39] yes [11:23:59] if you don't puppet and upstart seem to conflict [11:24:43] matanya: yeah. [11:24:50] matanya: can you submit a patch? :D [11:25:33] (03CR) 10Matanya: "dup of https://gerrit.wikimedia.org/r/#/c/140565/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142225 (owner: 10Hashar) [11:25:43] bah [11:25:46] yes YuviPanda, will [11:25:49] sorry hashar [11:25:51] they don't merge fast enough [11:25:53] matanya: woot, ty! [11:25:58] (03Abandoned) 10Hashar: role::mha tabs to four spaces [operations/puppet] - 10https://gerrit.wikimedia.org/r/142225 (owner: 10Hashar) [11:26:06] hashar: i say that all the time :) [11:26:15] matanya: I'll amend my diamond patch to set projectname as prefix in collector as well [11:26:42] YuviPanda: get your patch merged, i'll push one, hate rebasing for ever [11:27:04] actually, should conflict [11:27:10] matanya: I can rebase yours :) Plus I don't think they'd conflict [11:27:15] *shouldn't [11:27:55] YuviPanda: do you have access to prod to see the content of /etc/default/graphite-carbon ? [11:28:02] matanya: sadly not, no. [11:28:12] * matanya hunts an op [11:28:35] apergos: i choose you ^^ [11:31:45] ? [11:32:12] apergos: can you check contents of /etc/default/graphite-carbon on tungsten? [11:32:20] sec [11:32:24] what am I looking for? [11:32:26] what he said :) [11:32:31] apergos: it should be just two lines. [11:32:48] CARBON_CACHE_ENABLED= ? [11:33:13] yeah, if that's set to true or false [11:35:13] as soon as I get a prompt [11:35:16] :) [11:36:09] CARBON_CACHE_ENABLED=false [11:36:20] apergos: are there carbon-cache processes running? [11:37:40] 8 of these: /usr/bin/python /usr/bin/carbon-cache --config=/etc/carbon/carbon.conf --instance a --debug start from may 14 [11:38:04] apergos: ah, hmm. [11:38:07] matanya: ^ [11:38:11] apergos: thanks! [11:38:13] yw [11:38:20] these don't seem to autostart on labs even with the same configuration [11:38:30] I wonder if we're doing something wrong, or if there's some realm branching that's causing issues [11:40:23] i'm off for a few YuviPanda. i'll look at the code when back [11:40:35] matanya: ok! thanks for the help! \o/ [11:40:50] :) [11:45:01] (03PS1) 10Yuvipanda: diamond: Prefix metrics with project name [operations/puppet] - 10https://gerrit.wikimedia.org/r/142228 [11:49:28] YuviPanda: i'm playing a bit with carbon service, ok ? [11:49:33] matanya: ok :) [11:50:41] (03PS2) 10Yuvipanda: diamond: Prefix metrics with project name [operations/puppet] - 10https://gerrit.wikimedia.org/r/142228 [11:54:20] YuviPanda: i found out some of the problem [11:54:26] matanya: oh? [11:54:51] it is spawned right after being killed, but bot sure by who [11:55:00] i guess upstart, still debugging [12:00:05] (03PS29) 10KartikMistry: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [12:03:09] YuviPanda: this respawn seems to be the root cause, init and puppet conflict [12:03:37] YuviPanda: can i reboot it for a test ? [12:05:37] and the source is /etc/init/carbon/cache.conf [12:09:19] YuviPanda: poke me if you want details. [12:17:47] matanya: yeah, feel free to reboot [12:18:13] matanya: also is carbon running right now? I guess not. [12:18:39] it was, i just rebooted [12:20:41] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Thu 26 Jun 2014 09:19:56 UTC [12:21:23] matanya: ah ok! :) [12:21:42] rebooting again [12:23:41] YuviPanda: please test your flow, i think it is sorted out now [12:25:52] matanya: ah, ok! [12:25:58] matanya: figured out the issue? does it need a puppet patch? [12:26:21] i figured it out, no need for a patch [12:26:40] you could have fixed the issue in the first place with puppet agent -tv [12:27:13] maybe it is a good idea to add /etc/default/graphite-carbon to be managed by puppet [12:27:54] matanya: oh, so just running puppet agent -tv fixed it? [12:28:35] basically, the issue was carbon died, and a puppet run would bring it back [12:29:03] you enabled carbon at boot, and caused a race condition between puppet and upstart [12:30:13] disabling that fixed the race. [12:31:40] matanya: aaah, right [12:33:56] you are probably already aware of this: Wikipedia seems to be down for several people, including me ;) https://twitter.com/search?q=%23wikipediadown [12:34:11] (03PS3) 10Yuvipanda: diamond: Prefix metrics with project name [operations/puppet] - 10https://gerrit.wikimedia.org/r/142228 [12:35:33] mglaser: nothing spurious [12:36:01] mglaser: we are not! [12:36:12] mglaser: what is the error that you are seeing? [12:36:16] and there is apparently no drop of traffic on varnish http://gdash.wikimedia.org/dashboards/reqsum/ [12:37:07] paravoid, I get networt status "Aborted" [12:37:08] <_joe_> shit [12:37:21] got some italian "press" coverage http://tech.fanpage.it/wikipediadown-lo-strano-blackout-dell-enciclopedia-libera/?utm_source=twitterfeed&utm_medium=twitter [12:37:22] <_joe_> paravoid: some peering issues [12:37:27] it seems to happen mainly in italy, according to twitter [12:37:31] <_joe_> hashar: yes I'm reading that [12:37:33] and a varnish error at https://twitter.com/Alexsius_t/status/434265559845195778/photo/1 [12:37:34] with whom? [12:37:44] hashar: that's from february [12:37:48] ha [12:37:58] stupid twitter [12:38:05] mglaser: could you tell us your IP? [12:38:11] in private, if you're worried about your privacy [12:38:18] <_joe_> for the record, http://www.ilsussidiario.net/News/Hi-Tech/2014/6/26/WIKIPEDIA-DOWN-L-enciclopedia-libera-e-offline-cosa-sta-accadendo-Oggi-26-giugno-2014/510348/ [12:38:22] mglaser: also, do you know how to perform a traceroute? [12:38:24] <_joe_> this is from now [12:38:28] <_joe_> and also for the record [12:38:32] 82.135.30.117 [12:38:42] <_joe_> I'm in italy with an italian ISP and I've no problems browsing wp [12:38:56] mutante: woot, works for now :) [12:39:07] _joe_: oh that is just "an" italian ISP. not "the" only one :D [12:39:07] mutante: http://charcoal.wmflabs.org/ has per-project namespaced things now [12:39:11] mh same here (no problems) [12:39:12] _joe_: i can access wp from my mobile [12:39:15] <_joe_> hashar: the main one [12:39:20] wrong ping, but yay! :) [12:39:33] matanya: augh, yay :) [12:39:37] I just wondered as there are several reports on twitter as well. [12:39:40] (wiki works for me too, from India) [12:40:25] mglaser: this IP is in Germany, though? [12:40:35] paravoid, yes [12:40:40] <_joe_> mglaser: you're right, from the italian report seems they were getting an error message [12:40:47] <_joe_> mglaser: do you see an error page? [12:41:00] #Wikipedia #RIP -- Milan, Milan [12:41:04] so ISP related [12:41:04] <_joe_> mglaser: are you on linux/os x? [12:41:05] I guess [12:41:14] <_joe_> hashar: isp dns related I'd say [12:41:14] _joe_, no, windows [12:42:55] at least m-online.net (mglaser ISP) seems to be happy [12:42:55] http://lg.m-online.net/lg/ [12:43:03] (over IPV6) [12:43:24] tracert de.wikipedia.org [12:43:28] <_joe_> mglaser: are you confident with the windows command-line? [12:43:28] what do you need to know? [12:43:34] <_joe_> nslookup [12:43:36] I guess [12:43:42] <_joe_> nslookup de.wikipedia.org [12:43:45] but yeah a trace to our lb-text IPv4 address 91.198.174.192 timesout [12:44:07] Server: dc.hw.local [12:44:07] Address: 192.168.1.1 [12:44:07] Nicht autorisierende Antwort: [12:44:07] Name: text-lb.esams.wikimedia.org [12:44:07] Addresses: 2620:0:862:ed1a::1 [12:44:08] 91.198.174.192 [12:44:08] Aliases: de.wikipedia.org [12:44:09] wikipedia-lb.wikimedia.org [12:44:37] paravoid: seems there is some routing issue. http://paste.debian.net/106838/ [12:44:40] <_joe_> hashar: the thing that baffles me is - why that strange error page? [12:44:52] _joe_ I think that's older [12:44:59] who broke Italy... [12:45:31] paravoid: v4 versus v6 traces from mglaser ISP in germany: http://paste.debian.net/106840/ [12:45:40] gw-wikimedia.init7.net lost IPv4 [12:46:43] I took a peek at the LVS for that IP, things look healthy on our end [12:46:44] this is my tracert [12:46:45] http://paste.debian.net/106841/ [12:47:38] (03PS2) 10Yuvipanda: diamond: Enable diamond collection on labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/142210 [12:47:45] (03PS4) 10Yuvipanda: diamond: Prefix metrics with project name [operations/puppet] - 10https://gerrit.wikimedia.org/r/142228 [12:48:06] it's back again for me [12:48:06] !log Deactivated BGP session to AS13030 [12:48:08] is it better now :) [12:48:10] mglaser: I changed something, how does it look now? [12:48:12] Logged the message, Master [12:48:23] oh damn, we did two things at the same time [12:48:26] hehe [12:48:30] what did you do? [12:48:35] paravoid, I can access wp now [12:48:40] I + as-path Leaseweb-MOnline ".* 16265+ 8767+ .*"; [12:48:48] avoided that path [12:48:51] paravoid, thanks for the magic :) [12:48:52] 82.135.0.0/17 *[BGP/170] 47w6d 00:29:57, localpref 100 AS path: 38930 16265 8767 I [12:49:03] mglaser: we did two things, we have to find out what it was first :) [12:49:17] i can reinstate the bgp session [12:49:19] mark: init7 why? [12:49:31] mglaser: if you can twit to ask people to retry.. that might help [12:49:44] i'm not even sure it was init7 but disabling it was quick & easy [12:49:51] there was packet loss to mglaser's IP, i'm not sure if it was on the outbound or inbound though [12:49:54] probably on the return path then [12:50:07] ok, init7 session coming back up [12:50:28] mglaser: don't leave just yet, we need you to retry [12:50:36] my change would not fix the Italian issue though [12:50:48] unless there's some italian ISP buying from m-online, which I doubt [12:50:52] if you're looking at it i'll leave it to you [12:51:11] that funny email from Orange possibly-related? [12:51:28] _joe_/godog: can you ask italian friends of yours maybe? [12:51:32] irc channel or whatnot [12:51:34] <_joe_> I already did [12:51:37] Nemo_bis: ^ [12:51:38] <_joe_> no-one has issues [12:51:38] we could really use a traceroute [12:51:39] paravoid, I'm still there and asked some of the twitterers to retry [12:51:57] <_joe_> paravoid: if anyone had issues, I could help :/ [12:52:25] paravoid: yup I don't think anyone I know reported issues but asking again [12:52:32] _joe_: maybe ask in #wikipedia-it ? [12:52:49] <_joe_> someone reports me it's fastweb, but wait for confirmation [12:52:52] hmm [12:52:56] ripe atlas to the rescue? [12:53:03] logging in ;) [12:53:23] 194 probes in italy, hehe [12:55:34] !log Jenkins: updates jobs for extensions (phpunit and qunit) to use the mw-run-update-script.sh instead of update.php . That runs update.php twice, the first time logging sql to a file that can be archived. {{gerrit|141851}} [12:55:38] Logged the message, Master [12:55:46] got someone of the twitterers to reply, it seems to work for him (in Italy) [12:55:48] Peeps say site's up again [12:55:56] weird [12:56:11] got someone who could not connect from his mobile phone [12:57:35] * twkozlowski tweets a random person on the Internet [12:57:37] <_joe_> ok I have confirmation [12:57:38] (03CR) 10Alexandros Kosiaris: cxserver configuration for beta labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [12:57:47] <_joe_> it's fastweb the italian ISP with problems [12:58:05] <_joe_> and yes it's up again [12:59:38] another trace from AS6762 http://paste.debian.net/106846/ [12:59:50] Ciao Italia, Wikipedia è lavorare di nuovo! :) [12:59:57] I already checked seabone before, hashar [12:59:59] traffic is lost somewhere in telia [13:00:06] no it's not [13:00:21] well the trace ends at adm-b5-link.telia.net [13:00:59] yes it's back http://paste.debian.net/106848/ [13:01:48] Nemo_bis: you were on fastweb, weren't you? [13:02:23] sure, I duly reported the error this morning [13:02:36] both on #wikimedia-tech and to fastweb [13:02:43] it's been happening for weeks every now and then [13:03:06] <_joe_> Nemo_bis: it's the usual fastweb fuckup then? [13:03:38] Nemo_bis: sorry, I didn't see that :( [13:03:42] I usually do, but I missed it [13:03:53] please report such issues here (too), they have a better chance of being looked at [13:04:27] (03PS30) 10Alexandros Kosiaris: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [13:06:17] no worries, it was worse on May 22nd :) 16.|-- gw-wikimedia.init7.net 73.3% 15 24.8 56.0 20.2 99.3 39.5 [13:06:44] and whole night 5-6 May [13:06:48] (03CR) 10Alexandros Kosiaris: "Just added module unit tests. They fail unfortunately with" [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [13:06:56] according to my records [13:07:01] bleh [13:07:37] we peer directly with them [13:07:40] via the AMS-IX route servers [13:07:47] and we're not congested [13:08:10] so they're essentially congested at AMS-IX, that sounds pretty crappy [13:08:25] I can avoid that path and go via transit, as silly as that sounds [13:10:06] wouldn't it be nice if we had a few ripe atlas nagios tests [13:10:11] e.g. one per country or something :) [13:10:23] isn't it ridiculous we have to solve ISP's connectivity problems ? [13:10:41] makes you wonder what they are doing anyway [13:10:43] from the point of the customer? absolutely not. [13:11:07] I suspect we're a bit more contactable than some of the ISPs [13:11:12] point of view; they just wanna get Wikipedia back, and now! :-) [13:11:12] "It's all fine here!" [13:11:21] * matanya thinks he has paravoid and akosiaris at the same time in the channel, wouldn't it be a good timing to request review ? [13:11:28] Rather than speaking to script reading tech support [13:11:40] apergos: dataset1001 alert? :) [13:11:52] ottomata: an1013/an1020 alerts? [13:11:59] yes all that true, still ridiculous [13:12:12] again, makes you wonder what they are doing anyway [13:12:29] ottomata: also, elasticsearch alerts, still [13:12:31] and google fiber makes more sense [13:12:35] matanya: shoot [13:12:45] yeah, paravoid, nik and I are fixing elastic1017 today [13:12:47] akosiaris: https://gerrit.wikimedia.org/r/139836 [13:12:49] didn't want to do it end of day yesterday [13:13:01] Reedy: last time I called Fastweb I taught the operator about STARTTLS; the previous one they didn't believe me when I said Wikipedia was down, I had to take pasta out of water anyway [13:13:22] akosiaris: cache and swift as well, but they are touchy: https://gerrit.wikimedia.org/r/140678 and https://gerrit.wikimedia.org/r/140654 [13:13:25] and an13 and an20 are crying about gigs of space left [13:13:35] @twkozlowski yes, I'm browsing #wikipedia in the last 10 minutues. Thank you very much! [13:13:38] <_joe_> Nemo_bis: fastweb is so incompetent they have akamai down sometimes [13:13:40] and this one for hashar : https://gerrit.wikimedia.org/r/140565 [13:13:42] we don't have enough space, ja, the only thing I can do is delete more data really [13:13:45] <_joe_> not so rarely either [13:13:47] i'm trying to keep 31 days around [13:13:48] paravoid, hashar, that's for you guys ^^ [13:13:48] Saying that about ISPs, I really should move to http://aaisp.net.uk/ [13:14:08] although i'm not entirely sure why a couple of disks at times have more data in them than others [13:14:37] _joe_: I speculate that writing to peering@fastweb.it worked, the day Wikipedia has been down for 12 hours, but I can't prove it [13:15:24] (03CR) 10Alexandros Kosiaris: cxserver configuration for beta labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [13:15:36] as surprising as it may sound, it's possible that the only competent persons are reading abuse@fastweb.it :) so I was told once (no way to prove that either) [13:16:51] (03CR) 10Ottomata: Add backup role and scripts to wikimetrics (031 comment) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [13:16:53] (03CR) 10BBlack: "Yeah so I haven't found time to look at pulling data from OpenStack. But really, specifying the list here manually is no worse than speci" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138017 (owner: 10BBlack) [13:17:04] <_joe_> Nemo_bis: positions in telcos tend to change fast [13:17:36] i think that ironholds has a long running job, and a job picks a particular data directory in which to keep all of its logs [13:17:45] and, when most disks are almost full anyway [13:17:53] the long running jobs start to fill up certain disks [13:18:05] its pretty annoying that icinga generates alerts based on percentages for these disks [13:18:33] (03CR) 10Hashar: "BBlack that is fine to me. As long as we remember to:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138017 (owner: 10BBlack) [13:18:36] bblack: +1 :-) [13:18:57] bblack: i wasn't sure how much work would be needed to populate the DNS alias automatically. I agree that can be done later on [13:19:07] bblack: i will be more than happy to drop the beta::natfix hack [13:19:07] we could try alarming on how fast it is filling up ottomata [13:19:35] hm, naw, that would be fine too [13:19:40] i think for these datanode disks [13:19:49] i just want warnings when they get low [13:19:59] like, warn < 2G, critical < 500M or something maybe [13:20:00] (03CR) 10Hashar: [C: 031] Add public->private mappings for labs to dnsmasq aliases [operations/puppet] - 10https://gerrit.wikimedia.org/r/138017 (owner: 10BBlack) [13:20:47] hashar: let me at least refactor it a little to put the data in a puppet array and template that pair of files [13:20:54] but, yeah [13:21:20] ffs [13:21:24] what did I miss? [13:21:33] ACKNOWLEDGEMENT - Disk space on analytics1013 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/h 123296 MB (6% inode=99%): /var/lib/hadoop/data/j 4868 MB (0% inode=99%): ottomata 5G is fine, and this should mostly be a temporary state. [13:22:16] paravoid: nothing related to the net stuff [13:23:24] thanks [13:23:57] godog: ms-be300x isn't very happy [13:24:04] I think it's copying files around because of 3003 being down [13:24:08] and filing up 3004 [13:24:35] we have a [13:24:36] Unmerged changes on repository mediawiki_config [13:24:38] alert [13:24:50] paravoid: mmhh yeah 3003 is up but with sdk unmounted [13:24:56] I think that's _joe_'s check? Reedy know anything about it? [13:25:03] Just going to have a look [13:25:23] and I upgraded 3001 to icehouse this morning, before discovering that 3003 had its / filled up again and diagnosing thtat [13:25:26] that even [13:25:29] we should rename icinga.wm.org to christmastree.wm.org [13:25:40] *g* [13:25:43] haha [13:25:49] Change made for labs [13:26:09] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: Fetching readonly [13:26:30] twkozlowski Hey Tomasz, Wikipedia is down in Romania too... #wikipediadown [13:26:33] !log reedy Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/142142/ No-op for labs (duration: 00m 16s) [13:26:38] Logged the message, Master [13:26:39] What's up with Romania now... [13:26:40] (03CR) 10Reedy: "I note this was apparently merged, but not pulled/deployed (yes, I know it's a noop for production) on tin" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142142 (https://bugzilla.wikimedia.org/66094) (owner: 10Spage) [13:28:43] Looks like false alarm, that one ^^ [13:31:57] (03PS2) 10Ottomata: Add cp3015-cp3018 to list of esams upload caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/142087 [13:32:05] bblack, gonna do that now, [13:32:15] !log powering down dataset1001 -relocating to 10G rack [13:32:21] Logged the message, Master [13:32:31] (03PS3) 10Ottomata: Add cp3015-cp3016 to list of esams upload caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/142087 [13:32:33] that's just cp3015 and cp3016 [13:33:06] ok w you? [13:33:27] (03CR) 10BBlack: [C: 031] Add cp3015-cp3016 to list of esams upload caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/142087 (owner: 10Ottomata) [13:33:28] yup [13:34:27] (03PS4) 10Ottomata: Add cp3015-cp3016 to list of esams upload caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/142087 [13:34:35] (03CR) 10Ottomata: [C: 032 V: 032] Add cp3015-cp3016 to list of esams upload caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/142087 (owner: 10Ottomata) [13:35:00] RECOVERY - Unmerged changes on repository puppet on strontium is OK: Fetching origin [13:35:29] PROBLEM - Host dataset1001 is DOWN: CRITICAL - Host Unreachable (208.80.154.11) [13:38:30] <_joe_> hey, can someone working on dataset1001 ? [13:38:42] <_joe_> ehm s/can/is/ [13:39:15] _joe_: Yup, cmjohnson1 is moving it to 10G rack [13:39:33] <_joe_> oh ok :) [13:39:45] should have acked/scheduled that [13:40:42] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Thu Jun 26 13:40:36 UTC 2014 [13:44:49] hmm, bblack, the nodes that are set as ganglia aggregators for the esams upload_caches cluster have deaf = yes...which isn't right [13:44:58] i'm wondering why these new nodes are not in ganglia yet... [13:46:43] (03PS3) 10BBlack: Add public->private mappings for labs to dnsmasq aliases [operations/puppet] - 10https://gerrit.wikimedia.org/r/138017 [13:47:24] (03PS4) 10BBlack: Add public->private mappings for labs to dnsmasq aliases [operations/puppet] - 10https://gerrit.wikimedia.org/r/138017 [13:50:56] matanya was my commit okay? I responded to your comment about "automatically" -- I can add it back in if you want [13:54:22] !log remounted (broken) sdk1 on ms-be3003 [13:54:27] Logged the message, Master [13:58:14] (03CR) 10BBlack: [C: 032 V: 031] Add public->private mappings for labs to dnsmasq aliases [operations/puppet] - 10https://gerrit.wikimedia.org/r/138017 (owner: 10BBlack) [13:59:17] re: ms-be3003 / filling up, I don't think swift is expecting a disk to be unmounted by drive-auditor and then shortly remounted by puppet, with this ping/pong possibly happening many times [14:00:02] with that sdk unmounted now swift is trying to replicate the missing objects to ms-be3004 as paravoid described [14:03:09] (03CR) 10Nikerabbit: cxserver configuration for beta labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [14:04:18] dogeydogey: sorry, was away. I don't see your reply [14:04:30] to make things more interesting, 3001 and 3002 are in the same zone to swift's eyes [14:05:14] hashar: I pushed the dnsmasq thing, and it seems to have updated the config correctly. However, the dnsmasq daemon is started by nova with its own way of doing the args, not by initscripts (the system version of the service is disabled) [14:05:20] how does one tell nova to restart it? [14:06:33] bblack: i have absolutely no idea :-/ [14:06:56] does anyone know what TZ andrewbogott_afk is effectively in? [14:06:57] my best google hit is: https://ask.openstack.org/en/question/1442/how-does-dnsmasq-configuration-get-reloaded-when-dhcp_domain-is-set/ [14:08:07] I kinda wonder what all will get impacted by "restart nova-network" [14:08:24] (03PS2) 10Alexandros Kosiaris: Check puppet's last run [operations/puppet] - 10https://gerrit.wikimedia.org/r/141452 [14:08:44] I could also, of course, copy the current commandline from the process list and stop->start it manually myself. nova-net doesn't parent the process, so that's probably all it does, too. [14:08:55] let's merge and witness this biting us [14:09:37] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "Merging, let's see how it goes" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141452 (owner: 10Alexandros Kosiaris) [14:09:51] YuviPanda: https://en.wikipedia.org/wiki/User:Andrew_Bogott <-- utc -6 [14:09:59] Trminator: ah, cool :) [14:12:35] YuviPanda: yw. [14:16:09] hi papaul and welcome :) [14:16:32] Thanks [14:17:31] (03CR) 10Rush: [C: 031] "based on the previous template changes this should be gtg" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140022 (owner: 10Ori.livneh) [14:17:57] hello RobH [14:19:13] hey papaul :) [14:19:14] (03CR) 10Alexandros Kosiaris: cxserver configuration for beta labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [14:19:20] i don't think robh is around yet but he should be in an hour or so [14:19:30] papaul: welcome :-) [14:19:41] hello mark thank you [14:21:37] hi papaul welcome! [14:21:45] (03CR) 10Rush: diamond: Prefix metrics with project name (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142228 (owner: 10Yuvipanda) [14:24:08] (03PS1) 10Matanya: graphite: manage carbon service start time in puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/142245 [14:24:46] chasemp: can you please review this patch ^ ? [14:26:06] !log updated zuul cloner in git repo and deployed zuul ( tag is wmf-deploy-20140626-1 ) [14:26:11] Logged the message, Master [14:28:11] (03CR) 10QChris: Add backup role and scripts to wikimetrics (031 comment) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/139557 (https://bugzilla.wikimedia.org/66119) (owner: 10Milimetric) [14:30:26] (03CR) 10Rush: [C: 04-1] "one comment and can you make the commit msg something like: manage graphite-carbon onboot status" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142245 (owner: 10Matanya) [14:30:44] matanya: reviewed :) [14:31:49] (03CR) 10Yuvipanda: diamond: Prefix metrics with project name (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142228 (owner: 10Yuvipanda) [14:33:00] YuviPanda: can you help me understand a bit better, do you mean to send stats to a graphite instance that is for _all_ of labs...or just toollabs? [14:33:07] chasemp: all of labs [14:33:10] PROBLEM - Hadoop NodeManager on analytics1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:34:54] chasemp: will then add other specific collectors (nginx for the proxies, grid engine ones for toollabs, users for -login and -dev, etc) [14:35:06] when you say the permissions on diamond-collector, do you mean some kind of global security group settings? please excuse my ignorance, I'm not in labs as much as I'd like [14:35:19] chasemp: ah, no. so permission group settings don't apply to outbound traffic, only inbound [14:35:54] !log restarted nova-network on labnet1001 [14:35:59] Logged the message, Master [14:36:24] chasemp: and the inbound security group on the graphite project (where diamond-collector lives) is set to allow inbound traffic on those ports [14:39:58] (03PS2) 10Matanya: graphite: manage carbon service start time in puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/142245 [14:40:10] thanks you chasemp, i fixed it. [14:42:18] (03PS3) 10Matanya: graphite: manage graphite-carbon on boot status [operations/puppet] - 10https://gerrit.wikimedia.org/r/142245 [14:43:08] yurik: ping [14:45:07] RECOVERY - Hadoop NodeManager on analytics1020 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:46:45] ah, to establish my ignorance diamond-collect is a host yes? you setup for this purpose with statsd? [14:46:49] YuviPanda: ^ [14:46:56] sorry got 3 convo's at once going on [14:47:11] chasemp: yes. I looked at tungsten and applied the relevant roles (txstatsd and graphite) and that seems to work [14:47:20] chasemp: yes, diamond-collector is a host, yeah. [14:47:34] chasemp: :D 'tis ok! [14:50:21] ok assuming we use this as 'the' labs graphite, you are cool with everyone being able to see these results (I mean get access to some UI)? because they won't easily be able to stand up their own project specific stuff so it kind of all goes in this bucket [14:50:35] * anomie notes nothing is requested for SWAT this morning [14:50:43] chasemp: yeah, I'll probably remove the labs based auth in a followup patch [14:50:57] chasemp: there isn't anything sensitive here, and I don't think there'll be (prod's SQL is the reason it is private, IIRC) [14:51:14] I didn't mean private really :) more like peeps bugging you when it doesn't work [14:51:48] that is true for every service [14:52:01] chasemp: ah, yeah, that's fine :) [14:52:25] chasemp: I want to make it fully puppetized so everything mirrors production :) we already found some inconsistencies that matanya is fixing :) [14:53:07] PROBLEM - Hadoop NodeManager on analytics1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [14:53:17] chasemp: I also eventually want to setup grafana.org (there's already a patch in puppet) [14:53:18] (03CR) 10Rush: [C: 032] diamond: Prefix metrics with project name [operations/puppet] - 10https://gerrit.wikimedia.org/r/142228 (owner: 10Yuvipanda) [14:53:29] chasemp: there's a dependent patch as well :) [14:53:59] (03CR) 10Rush: [C: 032] graphite: manage graphite-carbon on boot status [operations/puppet] - 10https://gerrit.wikimedia.org/r/142245 (owner: 10Matanya) [14:54:41] bd808: is graphite-beta used at all? I don't see an instnace... [14:54:58] YuviPanda: gotcha, well seems fine to me, good luck let me know if I can help with diamond minutia [14:55:16] YuviPanda: i hereby volunteer to help you maintain graphite service :) [14:55:18] chasemp: :D ty! I see that diamond has built in logrotate, so that seems to be fine (lots of labs instances have tiny /var) [14:55:22] matanya: woot, ty :) [14:55:34] YuviPanda: yes, and teh large log file problem should be resolved [14:55:38] it was an oversite not a feature :) [14:55:39] yeah [14:55:41] :D [14:55:49] YuviPanda: It has a txstatsd that scap is sending data to, but something is wrong with that there and in prod. [14:55:51] although 5 days may be aggressive for labs [14:56:09] txstatsd is not great, hopefully we move away from it or fix it in teh near future [14:56:14] chasemp: it doesn't seem to be taking too much space, so 'tis ok I think? [14:56:30] chasemp: diamond can send directly to graphite, why do we have txstatsd? [14:57:11] the age old question :) [14:57:27] YuviPanda: The backend instance is deployment-graphite.eqiad.wmflabs if you want to mess with it [14:57:43] so assuming your interval is the same for flush from statsd and smallest retention in graphite, there isn't an obvious reason to put statsd in teh middle [14:57:47] bd808: ah, hmm. no I was going to remove the auth stuff from the role, but then realized it'd affect that [14:57:59] chasemp: right. removing a moving part sounds... nice :) [14:58:15] but the advantages of statsd come down to the inherent datatypes http://statsd.readthedocs.org/en/v0.5.0/types.html [14:58:28] and the consolidation aspect that comes with meta-metrics within statsd [14:58:40] ah, hmm. but are we using those? [14:58:46] and then teh issue of the moment you decide you want a more granular metric than your graphite bucket you are back to square one and statsd again [14:59:00] YuviPanda: yes in prod in places [14:59:22] my understanding is we are trying to use counters :) but it's funky in txstatsd [14:59:25] and then of course timers [14:59:35] (03PS1) 10Giuseppe Lavagetto: mediawiki: collect apc variables via diamond [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 [14:59:53] chasemp: ah, right. [15:00:05] manybubbles, anomie: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140626T1500) [15:00:11] chasemp: another thing I was wondering was just using this to somehow to setup alerts, instead of setting up icinga again from scratch [15:00:23] bd808: can you please say what is that something in deployment-graphite.eqiad.wmflabs ? [15:00:24] chasemp: esp. considering the state of our icinga puppet code, which is... very prod specific, to say the least [15:00:51] YuviPanda: well that's certainly being done at some places. Check out http://codeascraft.com/2013/06/11/introducing-kale/ [15:00:53] <_joe_> Krinkle: apc metrics work submitted, if you want to take a look [15:00:57] YuviPanda: you can poll graphite for alerts [15:01:13] but it is somewhat fanky [15:01:58] matanya: ? It's a carbon/graphite server with txstatsd. Similar to production tungsten.eqiad.wmnet host [15:02:16] chasemp: nice! [15:02:21] _joe_: yep, will take a look [15:02:24] It has a txstatsd that scap is sending data to, but something is wrong with that there and in prod. [15:02:25] honestly we have punted on a nice graphite dashboard so far and it's on the want-to-have list for prod as well [15:02:26] (later today) [15:02:30] so try out a few :) [15:02:45] I think this is in teh lead? http://grafana.org/ [15:03:03] i think ori likes it [15:03:20] <_joe_> Krinkle: no rush, pinged you as you asked about it [15:03:22] matanya: https://bugzilla.wikimedia.org/show_bug.cgi?id=62667 [15:03:46] chasemp: there's https://gerrit.wikimedia.org/r/#/c/133274/ [15:03:49] matanya: TL;DR is that txstatsd is writing weird metrics [15:03:57] chasemp: matanya ^ this needs more love :) [15:04:00] chasemp: matanya but that should let us use grafana [15:04:12] RECOVERY - Hadoop NodeManager on analytics1020 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:04:33] bd808: since i don;t have access to graphite it is somewhat hard to debug, but i can look [15:04:55] <_joe_> bd808: hey, thanks for writing apc_stats.php :) [15:04:59] heh YuviPandai even commented on that [15:05:08] matanya: :D [15:05:11] _joe_: yw. Did you find a place to put it? [15:05:16] <_joe_> bd808: it's exactly what I was expecting as an output [15:05:32] <_joe_> bd808: yes, a separated vhost on mediawikis [15:05:38] bd808: does deployment-graphite have anything that can't be public and shipped to charcoal? [15:05:41] <_joe_> one that will be protected from the outside [15:06:08] YuviPanda: I don't think so. It's basically empty at this point. [15:06:15] <_joe_> bd808: we could add any type of endpoints exposing metrics that we can then collect to graphite easily [15:06:18] bd808: right. do you mind if we just kill that code from the role? [15:06:33] YuviPanda: Nope. Make it cool and useful [15:06:42] <_joe_> bd808: https://gerrit.wikimedia.org/r/142250, up for review [15:06:44] bd808: oh, to be more specific [15:06:47] bd808: I meant the auth code :) [15:06:57] ok, i can't even see the url's, so i;m useless here :/ [15:07:04] YuviPanda: Yeah. I followed that. [15:07:28] * matanya apologizes to greg-g [15:08:44] (03CR) 10Reedy: mediawiki: collect apc variables via diamond (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 (owner: 10Giuseppe Lavagetto) [15:08:55] (03CR) 10Rush: "minor comment, otherwise good with me" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 (owner: 10Giuseppe Lavagetto) [15:10:29] chasemp: I want to cleanup the old data from graphite. is deleting the whisper files a good enough solution? [15:10:36] yes [15:10:49] chasemp: cool! :) [15:10:55] * YuviPanda does that [15:11:04] (03CR) 10Giuseppe Lavagetto: mediawiki: collect apc variables via diamond (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 (owner: 10Giuseppe Lavagetto) [15:11:17] matanya so the changes were okay? [15:11:19] chasemp: I'll be adding more metrics now (first nginx proxy metrics), can I ping you to review/merge? :) [15:11:48] YuviPanda: sure but I try to do the rounds on general stuff in the a.m. :) [15:12:00] chasemp: :D I'll also add the labs folks [15:12:00] dogeydogey: yes, but i didn't see your reply [15:12:11] chasemp: also I'm slightly confused about what we use ganglia for vs what we use graphite for [15:12:17] oh, I dunno if I replied correctly, I just replied directly to the email? [15:12:25] chasemp: from what I see graphite is a superset of ganglia, with a slightly(?) more confusing UI [15:12:27] And then i couldn't find that comment on gerrit [15:12:30] YuviPanda: they are duplicate, graphite is the future :) [15:12:37] no, on the web interface dogeydogey [15:12:38] chasemp: woot :) [15:12:46] andrewbogott: ^ chasemp graphite is the future, ganglia is being phased out [15:12:47] dogeydogey: https://gerrit.wikimedia.org/r/#/c/142186/ [15:13:12] PROBLEM - Hadoop NodeManager on analytics1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:13:33] matanya sorry if this is newb but i just see the comment nice first patch and then in my email i saw a question about the word automatic [15:13:34] ganglias slowly or quickly will die, but it's a long road as there is so much to going on there [15:13:42] and I don't see that comment in gerrit [15:14:03] scroll a bit above and you will see next to the diff [15:15:08] matanya well do you want me to change it back? [15:15:16] yes [15:16:49] hashar, paravoid: can we chat about labs/swift a bit? [15:17:23] gimme a sec [15:17:40] I'm wondering about VMs vs. bare metal and, if the latter, I need guidance about what kind of hardware [15:17:57] not urgent though, just trying to catch y'all before bed [15:18:00] andrewbogott: sure [15:18:09] I will not be there this evening [15:18:37] matanya git fetch ssh://scottlee@gerrit.wikimedia.org:29418/operations/puppet refs/changes/86/142186/1 && git checkout FETCH_HEAD -- correct? [15:18:40] just want to make sure [15:18:48] no need for that [15:18:58] you already have it locally [15:19:06] oh okay haha [15:19:19] fix that and push with git review -R [15:19:29] you have git review ? [15:19:41] yep, one sec [15:19:50] and in order to amend a patch when you commit, commit with git commit --amend [15:20:08] otherwise gerrit will think it is a new patch [15:21:00] bd808: oh, deployment-graphite is a project by itself? [15:21:00] what's the diff between git review and git review -R [15:21:20] bd808: hmm, it isn't. is that on a self hosted puppetmaster? [15:21:41] YuviPanda: It's part of the deployment-prep project (ie beta) [15:22:02] bd808: hmm, weird. it's sending data as deployment-graphite by itself instead of deployment-prep.deployment-graphite [15:22:28] dogeydogey: https://www.mediawiki.org/wiki/Git_review [15:23:15] shooot I messed up [15:24:35] RECOVERY - Hadoop NodeManager on analytics1020 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:24:36] !log shutting down holmium to replace disk [15:24:41] Logged the message, Master [15:25:28] (03CR) 10BryanDavis: "The basics look good but it should be tested (in beta!) to ensure that the vhost works as desired when nonexistent.conf from operations/ap" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 (owner: 10Giuseppe Lavagetto) [15:27:09] chasemp: ah, you have to +2 https://gerrit.wikimedia.org/r/#/c/142210/ as well for the patches to merge :) [15:27:45] didn't see that one, fyi I'm 'rush' in gerrit :) [15:27:54] chasemp: ah, ok :) I'll add you in the future [15:28:20] are you and bd808 gtg on where diamond stuff goes in labs? [15:28:27] seems like they have some stuff going on arleady [15:28:35] this isn't going to co-opt that? [15:28:40] chasemp: think so, yeah. he says the current one is pretty much empty.. [15:29:05] PROBLEM - Host holmium is DOWN: CRITICAL - Host Unreachable (208.80.154.49) [15:29:18] bd808: are you ok with a global "send diamond stats here" in labs? here being a specific project for it [15:30:22] chasemp: I think so. As long as we can find them and monitor beta I don't care where "there" is. hashar do you care? [15:30:33] (03PS2) 10Scottlee: Fixed text formatting and grammar. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142186 [15:30:41] matanya okay it should be good [15:31:41] dogeydogey: great, now lets find an op to review/merge [15:31:47] bd808: chasemp yeah YuviPanda was looking for a playground area to test diamond on labs. Beta cluster makes a lot of sense there [15:31:47] cool! [15:32:08] Hey, ops! who wants to review the first patch from dogeydogey? (Yay!) [15:32:19] * matanya points at ottomata [15:32:20] bd808: chasemp: YuviPanda: that will even let us test shinny new metrics such as diamond metrics for Varnish / Apache / MariaDB or whatever we have on beta. So yeah do it ! :-] [15:32:20] hashar: I think it's opposite :) they are pointing diamond away from beta cluster to somewhere else :) [15:32:34] ah yes but beta cluster can still log there [15:32:38] * matanya turns to chasemp [15:32:38] chasemp: no, it's gathering metrics from betalabs right now :) [15:32:43] chasemp: just pointing to elsewhere. [15:32:45] additional beers granted if we get a nice dashboard that shows us the diamond metrics and replace the ganglia frontend I have never managed to fix (on labs) [15:32:57] chasemp: ah [15:33:12] chasemp: well it is probably fine to have a centralized labs diamond collector. That was the same for ganglia. [15:33:12] hashar: yeah, that's my next target [15:33:25] I don't think beta cluster needs it is own separated graphite/diamond/ whatever [15:33:39] ok then I think we have reached....I'm going to say it....wait for it.....consensus [15:33:46] :) [15:34:08] I though consensus meant "decided not to change anything" [15:34:08] bd808: hashar does this mean that deployment-graphite can be killed? [15:34:26] dogeydogey == Scottlee? [15:34:36] ottomata yes [15:34:40] bd808: no no...consensus means we agree on who to blame :D [15:34:50] oh dear :) [15:35:39] YuviPanda: hmmm.... let's leave it for now. I may get around to using it to fix the statsd->graphite stuff for scap. [15:35:42] (03PS3) 10Rush: diamond: Enable diamond collection on labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/142210 (owner: 10Yuvipanda) [15:35:52] that is more change everything without anybody complaining (consequence: nothing change :D ) [15:35:56] bd808: hmm, ok. [15:36:07] (03CR) 10Rush: [C: 032] "talked it over, seems everyone is ok with one graphite for labs atm" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142210 (owner: 10Yuvipanda) [15:36:12] (03CR) 10Rush: [V: 032] "talked it over, seems everyone is ok with one graphite for labs atm" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142210 (owner: 10Yuvipanda) [15:36:16] bd808: do you want deployment-graphite to be password protected? :) [15:36:24] \o/ [15:36:44] YuviPanda: I'm fine with the password being removed. [15:36:44] (03PS5) 10Rush: diamond: Prefix metrics with project name [operations/puppet] - 10https://gerrit.wikimedia.org/r/142228 (owner: 10Yuvipanda) [15:36:54] bd808: cool! \o/ [15:37:08] (03CR) 10Rush: [V: 032] diamond: Prefix metrics with project name [operations/puppet] - 10https://gerrit.wikimedia.org/r/142228 (owner: 10Yuvipanda) [15:38:26] ok all merged I think [15:38:57] chasemp: \o/ ty. [15:39:20] ottomata: you are looking at the patch ? [15:41:47] (03CR) 10KartikMistry: "Ping for review!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [15:42:52] RECOVERY - Host holmium is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [15:43:37] matanya: in one place in that patch puppet is made lowercase and in two others it's made uppercase :) [15:44:00] I hate to go all full on pedantic but proper noun and all, but at least consistency [15:44:09] dogeydogey: ^ [15:44:29] have reviewed...in meeting now [15:44:51] chasemp can you be more specific? [15:45:02] when it's referred standalone I tried to make sure it was capitalized [15:45:10] when it was referred to as a command i left it lowercase [15:45:12] PROBLEM - Varnish HTTP blog on holmium is CRITICAL: Connection refused [15:45:12] PROBLEM - HTTP on holmium is CRITICAL: Connection refused [15:45:36] dogeydogey: ah makes sense, let me give it another once over. my brain is in two places now :) [15:46:09] chasemp sure, I mean it's totally possible I missed some so just let me know [15:47:56] (03PS1) 10Reedy: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142255 [15:47:58] (03PS1) 10Reedy: testwiki to 1.24wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142256 [15:48:00] (03PS1) 10Reedy: Wikipedias to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142257 [15:48:02] (03PS1) 10Reedy: group0 to 1.24wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142258 [15:48:07] (03CR) 10Reedy: [C: 032] Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142255 (owner: 10Reedy) [15:48:28] bblack, pong [15:48:55] paravoid, hashar, should we make an appointment for tomorrow? [15:49:26] can do now [15:49:30] (03CR) 10Rush: Fixed text formatting and grammar. (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142186 (owner: 10Scottlee) [15:49:30] i to [15:49:50] if this is about swift, we should probably involve godog as well [15:50:31] (03Merged) 10jenkins-bot: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142255 (owner: 10Reedy) [15:50:36] (03CR) 10Ottomata: Fixed text formatting and grammar. (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142186 (owner: 10Scottlee) [15:50:54] (03CR) 10Reedy: [C: 032] testwiki to 1.24wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142256 (owner: 10Reedy) [15:51:10] paravoid, hashar: So, the main question on my mind is -- do we want a production swift cluster for labs, or just a cluster running on VMs? [15:51:26] I know that paravoid advised production before; can you recap the argument for that? [15:51:46] the idea for that was to provide this as a generic service for labs users [15:51:51] as an alternative to e.g. labstore [15:52:32] PROBLEM - Hadoop NodeManager on analytics1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:53:02] from beta cluster point of view, we would need a Swift service to phase out the files being saved on NFS and reflect what is in production [15:53:12] (yeah that does not answer) [15:53:13] paravoid: sure. But it's not obvious to me that people outside of beta would use it, only because an object store is a pretty unfamiliar tool... [15:53:31] Might be that if we build it they will come :) [15:53:52] it's not _that_ unfamiliar I think [15:53:57] (03CR) 10Rush: "I will defer to otto here since two ppl's nitpicking is probably not needed :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142186 (owner: 10Scottlee) [15:53:58] but yes, I think in the end it's your call [15:54:05] yurik: the fix for labs split-horizon DNS went in today, so that mobile-beta can reach the mobile-beta-portal thingy [15:54:05] I am not sure we can setup swift on top of VM, that might not work very well. A dedicated hardware would give ops a nice pre production platforms which is exercised by beta. Might have some uses [15:54:14] (03Merged) 10jenkins-bot: testwiki to 1.24wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142256 (owner: 10Reedy) [15:54:32] curious why wouldn't swift work on a vm? [15:54:33] !log reedy Started scap: testwiki to 1.24wmf11 and build l10n cache [15:54:36] hashar: True, in a VM performance will probably be very slow and weird. [15:54:39] Logged the message, Master [15:54:42] yurik: but the mobile-beta-portal thingy is returning an empty set of proxies as json '[]', which is invalid, so the update still fails. empty sets should be '{} [15:54:43] it will work fine [15:54:57] yurik: (empty hash instead of empty array) [15:55:06] slow for what? :) [15:55:10] beta doesn't get all that much traffic [15:55:13] I'm sure it'll work fine [15:55:19] (03CR) 10Krinkle: mediawiki: collect apc variables via diamond (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 (owner: 10Giuseppe Lavagetto) [15:55:23] paravoid: swift hardware is quiet expensive, right? Or do we have spare hardware for this that you know of? [15:55:28] I am not sure how Swift compare to the shared NFS server we have, but that eventually ends up being easier to scale [15:55:29] !log reedy scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="testwiki" --list-file="/a/common/wmf-config/extension-list" --output="/tmp/tmp.EjEynr9oww" ' returned non-zero exit status 1 (duration: 00m 55s) [15:55:29] bblack, the [] vs {} is a well known general API bug, i will try to fix it for our case [15:55:31] s/quiet/quite/ [15:55:34] Logged the message, Master [15:55:37] it's not expensive, no [15:55:42] ah, ok. [15:55:47] yurik: either that or just define a proxy [15:55:48] the ones we buy for prod are, but that's because we get lots of disks [15:55:49] !log reedy Started scap: testwiki to 1.24wmf11 and build l10n cache [15:55:58] bblack: the bug is https://bugzilla.wikimedia.org/show_bug.cgi?id=10887 [15:55:59] swift doesn't need anything very special really [15:56:00] bblack, ok, will define a proxy [15:56:14] !log reedy scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="testwiki" --list-file="/a/common/wmf-config/extension-list" --output="/tmp/tmp.9SaYNRzegr" ' returned non-zero exit status 1 (duration: 00m 24s) [15:56:16] paravoid: Hm, ok then :) [15:56:27] it is just a regular server ? [15:56:28] I don't see any arguments from the performance side [15:56:30] so maybe a few commodity servers would do ? [15:57:03] (03CR) 10Krinkle: mediawiki: collect apc variables via diamond (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 (owner: 10Giuseppe Lavagetto) [15:57:03] I may see some arguments on the cost side, but this should be only put into perspective with whether this could be a useful service for the Labs platform [15:57:39] NFS isn't good enough, or it is wanted to replicate prod ? [15:57:42] it sounds good to me in principle but to tell you the truth I'm a bit detached from the labs community [15:57:45] so I can't say for sure [15:58:13] matanya: to replicate prod [15:58:29] that's for *beta*, not for the other thing [15:58:31] matanya: we can't test swift/mw interactions on beta [15:58:34] it's two separate goals really [15:58:43] yeah, I know that [15:58:46] I guess another question is -- even if we had produciton swift, would we want a VM swift inside deployment-prep for testing and staging purposes? [15:58:53] matanya how come the spacing on the all the puppet files is so bad [15:59:02] means ? [15:59:10] andrewbogott: depends [15:59:21] bblack, just created a test proxy [15:59:22] deployment-prep is really a fuzzy thing around here [15:59:34] people claim it's for testing everything but in truth it's really about testing mediawiki :) [15:59:47] bblack, sorry, hold on, i'm being slow [15:59:50] in that sense, it could use the labs-in-production swift cluster just fine [15:59:51] I don't think any devs are involved with the Swift infra. So there might be no use for a Swift setup inside beta cluster. [15:59:52] that's not totally true, I think hashar tests puppet patches there too. [15:59:55] dogeydogey: https://wikitech.wikimedia.org/wiki/Puppet_coding [16:00:14] not really [16:00:26] if he does, I'm not aware of which ones would those be [16:00:36] andrewbogott: yeah we do apply puppet patch on beta cluster. We have our own puppet master there that attempts to self rebase every hour [16:00:38] (03PS1) 10Reedy: Update extension-list to match skin rearrangements [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142262 [16:00:47] I just know that whenever I look at the beta puppetmaster it has diffs :) [16:00:48] puppet patches for what? [16:00:54] (03CR) 10Reedy: [C: 032] Update extension-list to match skin rearrangements [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142262 (owner: 10Reedy) [16:00:56] !log reedy Started scap: testwiki to 1.24wmf11 and build l10n cache [16:01:04] (03Merged) 10jenkins-bot: Update extension-list to match skin rearrangements [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142262 (owner: 10Reedy) [16:01:13] let me look [16:01:14] !log reedy scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="testwiki" --list-file="/a/common/wmf-config/extension-list" --output="/tmp/tmp.XetXfk5RPi" ' returned non-zero exit status 1 (duration: 00m 18s) [16:01:33] * akosiaris hates puppet [16:01:41] paravoid: so what are the ramifications of setting up a VM swift cluster and then later migrating it to production hardware? That's an established use case, right ? Same as migrating between DCs? [16:01:45] current puppet patches applied on beta cluster http://paste.debian.net/plain/106880 [16:02:06] andrewbogott: would we able to integrate a VM swift cluster with keystone? [16:02:11] with the regular keystone that is? [16:02:11] that might be some VCL tests, patches pending merging, scap related work etc [16:02:14] from a security perspective etc. [16:02:42] bblack, done [16:03:19] hm. [16:03:56] if yes, then yes, it'd be easy to move from VM to bare metal [16:03:57] (03CR) 10Krinkle: mediawiki: collect apc variables via diamond (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 (owner: 10Giuseppe Lavagetto) [16:04:21] paravoid: So there are two parts, right? The swift cluster itself talking to keystone, and users within labs authenticating via keystone in order to talk to the swift cluster. [16:04:25] The latter we have to do either way [16:04:31] RECOVERY - Hadoop NodeManager on analytics1020 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:04:37] (03PS1) 10Reedy: Move Nostalgia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142263 [16:04:50] the former… doesn't seem hard but maybe I'm overlooking something. [16:04:54] (03CR) 10Reedy: [C: 032] Move Nostalgia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142263 (owner: 10Reedy) [16:04:55] if you set up something in beta, just for beta, then there's no reason to mess with keystone [16:05:00] (03Merged) 10jenkins-bot: Move Nostalgia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142263 (owner: 10Reedy) [16:05:02] !log reedy Started scap: testwiki to 1.24wmf11 and build l10n cache [16:05:05] yep, true. [16:05:11] you can do the same as prod, just have a static account [16:05:17] (03CR) 10Krinkle: "I think it's the other way around. nonexistent.conf has to be the first, not the last. That's how apache-config/all.conf is structured. So" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 (owner: 10Giuseppe Lavagetto) [16:05:35] I definitely like the idea of having it available as a labs-wide service. Just don't want to set up a bunch of hardware and have it sit idle due to being unpopular :) [16:05:44] (03Abandoned) 10Reedy: Add apc.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141966 (owner: 10Reedy) [16:05:51] What is the spacing rule between classes and nested classes? [16:05:51] fair enough :) [16:05:57] (03PS2) 10Reedy: Wikipedias to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142257 [16:06:10] yurik: works! thanks. you should get your correct (beta) carriers/proxies data on beta now with updates [16:06:22] (03PS2) 10Reedy: group0 to 1.24wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142258 [16:06:31] bblack, i copied opera proxy data since that is public anyway [16:06:37] ok [16:07:22] bblack, thanks!!! [16:07:52] np [16:07:56] FWIW, sounds like having a replica swift cluster (on VMs) for beta is needed anyways and it is relatively easy now to do, not sure how it is meant to be populated though [16:07:57] (03PS1) 10Reedy: Remove extension-list-1.24wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142264 [16:08:57] So, now I'm back to a few questions ago: Do we really need /two/ swift implementations, one inside beta and one outside? If the labs-wide swift uses keystone, that's different from what mediawiki expects so might make for an ugly divergence in beta. [16:09:12] would it be? [16:09:43] why would keystone vs. simpleauth make a difference for mediawiki? [16:09:48] Hm... [16:09:53] dunno really [16:09:55] Oh, you're right. Either way it's just a single login/password [16:09:55] it might :) [16:10:00] so mediawiki wouldn't need to know the difference. [16:10:02] hopefully [16:10:52] (03PS31) 10Nikerabbit: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 [16:11:15] So -- ok. I will see if I can set up a cluster in labs that uses production keystone, and think harder about the security implications. That should get us what we need for beta in any case. [16:11:43] And if it needs to move to bare metal, that's probably something that I can sort out in Paravoid's absence anyway, since you don't seem to have strong feelings about hardware. [16:12:11] (03CR) 10jenkins-bot: [V: 04-1] cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [16:13:47] andrewbogott: so swift cluster inside the beta project? [16:14:07] hashar: No, it will be in a different project and use a public IP. [16:14:13] ah nice [16:14:17] And hopefully be useful to other projects as well. [16:14:24] so beta just have to point to the service IP and done (at least for us) [16:14:29] But this is all pending me figuring out how to use keystone from w/in labs [16:15:13] andrewbogott: you might want to ping https://bugzilla.wikimedia.org/show_bug.cgi?id=62835 "Setup a Swift cluster to match production" that will ping Greg [16:15:44] 'k [16:15:45] the other part that may need a modification is the custom middleware that we use in prod [16:15:55] we may end up ditching that altogether, or not [16:16:32] but if we don't, and beta needs it (not obvious if it will), then it might need some mods to work only for the "beta" account [16:17:09] bblack, when you get a moment, could you fix netmapper so that it accepts an empty [] as a valid empty value? Just in case [16:17:21] obviously no rush on that one :) [16:17:31] yeah [16:17:51] paravoid: I'll have to read more code to understand about all that. Is there more to it than just thumbnail generation? [16:18:12] it's url rewriting + 404 handler for thumbnails [16:18:38] the url rewriting is for thumbnails? Or are those two different things? [16:20:29] (03PS1) 10Reedy: Remove extension-list-1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142265 [16:21:38] (03PS1) 10Reedy: Remove extension-list-1.24wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142266 [16:26:16] matanya: for? [16:26:34] not being able to fix the bug you reported due to lack of rights :) [16:28:53] matanya: ah, yeah, you better apologize :) [16:29:01] matanya: thanks for trying, though [16:29:37] PROBLEM - Puppet freshness on stat1002 is CRITICAL: Last successful Puppet run was Thu 26 Jun 2014 13:28:52 UTC [16:29:48] sure. I will fix it in the future i hope. [16:29:56] yeah :/ [16:30:32] it seems trivial from here, but can't know for sure until i check [16:30:44] andrewbogott: need to get out for groceries before my daughter lands back home :/ [16:30:56] andrewbogott: we can follow up by email or tomorrow [16:30:57] hashar: ok, I think I know how to proceed. [16:31:03] great! [16:31:04] And, anyway, won't start on this for a few days at best. [16:31:11] yeah I can imagine [16:31:22] at least it is starting :-] [16:32:23] !log reedy Finished scap: testwiki to 1.24wmf11 and build l10n cache (duration: 27m 20s) [16:32:28] Logged the message, Master [16:33:37] PROBLEM - Puppet freshness on stat1003 is CRITICAL: Last successful Puppet run was Thu 26 Jun 2014 13:33:10 UTC [16:34:34] bblack, any way I can tell if it is safe to pybal 2 servers? [16:35:05] ottomata: just as long as it's been half an hour since they went into the backends list in puppet, so that we know the puppet update went everywhere [16:35:14] (because it goes somewhere that tells analytics about it, I donno where) [16:35:16] thanks _joe_ for the write up, and well done with marking it all up :) [16:35:19] ah ok, yeah been a couple of hours now [16:36:59] pybal is still fenari, right? [16:37:51] yeah [16:38:03] death to fenari [16:38:05] ok, saving... [16:38:14] http://noc.wikimedia.org/pybal/esams/upload [16:38:23] matanya: btw, see the ever growing list of fun times (and ways to make things less exciting in the future, hopefully): https://wikitech.wikimedia.org/wiki/Incident_documentation/QR201407 [16:38:31] fenari should be shipped to the office for nostalgia [16:39:22] greg-g: yeah, i followed your mail, and work to compile this, great work! [16:39:45] so, ok bblack [16:39:48] how can I tell if it is working [16:39:51] varnishncsa -n frontend ? [16:41:30] andrewbogott: what check do you mean in : https://wikitech.wikimedia.org/wiki/Incident_documentation/20140618-Wikitech ? [16:41:48] akosiaris: any more feedback on apache::site? [16:42:26] fenari is not nearly old and legendary enough for nostalgia :P [16:42:53] matanya: there's a freshness check. But puppet is sometimes registered as 'fresh' even if a run fails. I thought I understood when it was and wasn't but now I'm not so sure. [16:43:37] andrewbogott: does that depend on the check return code? where is that code? [16:43:56] matanya: it doesn't check the return code. I don't know much about how it works, it's kind of a mess [16:44:11] !log reedy Purged l10n cache for 1.24wmf8 [16:44:12] Anyway, i think I know how to write a proper check. We have one on labs already [16:44:16] Logged the message, Master [16:44:19] the interesting part is just getting it to report back to icinga. [16:44:58] !log reedy Purged l10n cache for 1.24wmf7 [16:45:03] Logged the message, Master [16:45:32] !log reedy Purged l10n cache for 1.24wmf6 [16:45:38] Logged the message, Master [16:45:52] !log reedy Purged l10n cache for 1.24wmf5 [16:45:57] Logged the message, Master [16:45:57] Reedy: is bd808 purge of old (30+day) wmfXX versions still working? [16:46:03] I think akosiaris did sth similar in https://gerrit.wikimedia.org/r/#/c/141452/2 andrewbogott [16:46:06] !log reedy Purged l10n cache for 1.24wmf4 [16:46:08] put a "'s" up there [16:46:11] Logged the message, Master [16:46:29] greg-g: ? [16:46:50] godog: oh, maybe that's done then? [16:46:59] In which case… virt1008 should be screaming [16:47:10] Reedy: bd808 added a step in.... somewhere, that would purge old wmfXX versions from the apacahes after the 30 day "just in case for varnish cache static assets" thing [16:47:10] it was merged today [16:47:20] ok then! [16:47:31] andrewbogott: not sure what's the status, I remember there was some discussion in an RT too [16:47:37] (03PS1) 10Ottomata: Add cp3017-cp3018 to list of esams upload caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/142276 [16:47:38] andrewbogott: isn't puppet on virt1008 disabled ? [16:47:44] bd808: where is that purge old wmfXX's step? [16:48:04] (03CR) 10Ottomata: [C: 032 V: 032] Add cp3017-cp3018 to list of esams upload caches [operations/puppet] - 10https://gerrit.wikimedia.org/r/142276 (owner: 10Ottomata) [16:48:04] !log reedy Started scap: l10nupdate for 1.24wmf11 for Skins I9395b0e1983122b12bedf003d6398da5ddfd5651 [16:48:07] Logged the message, Master [16:48:10] matanya: no, I merged an intentional double-definition [16:48:12] as a canary [16:48:16] so far icinga has told be nothing :( [16:48:22] *me [16:48:37] greg-g: It's manual, but documented in https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#New_branch_.22Thursday.22_deploy [16:48:41] ahhh [16:48:53] i think akosiaris can clarify here [16:48:56] for some reason i thought it was automatic [16:49:12] I never figured out a safe way to automate it [16:49:19] We've currently still got 1.24wmf1 in play.. [16:49:53] could that be causing the apc crap? [16:50:08] I don't think so... [16:50:15] Nothing *should* be touching the php files [16:50:20] kk [16:50:38] probably time to kill some of them off again [16:50:45] * greg-g nods [16:51:16] (03PS1) 10Ottomata: Making sure mariadb module is on Sean's last commit [operations/puppet] - 10https://gerrit.wikimedia.org/r/142278 [16:51:32] springle: is that correct ^ [16:51:35] Personally I'd vote for "add a branch; delete a branch" in the train deploys [16:51:44] somehow my previous commit changed that module sha [16:51:47] That'd probably make more sense [16:51:49] bd808: agreed [16:51:50] making sure that is the right one [16:51:52] I think 1-4 can go? 5 needs to stay? [16:52:13] * greg-g looks at a calendar [16:52:34] yeah [16:52:40] looks right to me [16:52:48] 5 went til 5/29 [16:53:07] (03PS1) 10BBlack: Allow empty JSON array as valid input [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/142279 [16:53:59] (03CR) 10Ottomata: [C: 032 V: 032] Making sure mariadb module is on Sean's last commit [operations/puppet] - 10https://gerrit.wikimedia.org/r/142278 (owner: 10Ottomata) [16:55:47] _joe_: apache::site? :) [16:56:00] (03CR) 10BBlack: [C: 032 V: 032] Allow empty JSON array as valid input [operations/software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/142279 (owner: 10BBlack) [16:56:09] akosiaris, godog, looks like that test isn't actually applied anyplace… is there a manual step to refreshing icinga tests? [16:59:45] andrewbogott: mhh I don't know how that works, haven't been playing with icinga and puppet [17:00:16] ok, I will come back after lunch and investigate if it hasn't already fixed itself by then. [17:00:47] <_joe_> ori: what about that? [17:00:56] <_joe_> ori: I thought I gave +1 yesterday [17:01:21] <_joe_> didn't I? [17:02:01] you did? [17:02:22] he did? [17:02:30] oh yes, you did [17:02:31] my bad [17:02:37] sorry! [17:02:46] <_joe_> :) [17:03:00] i'm going to merge it then, since the only change between the last time akosiaris reviewed it and the current PS is that i took his suggestions [17:03:05] so i can't imagine he'd be unhappy with that [17:03:14] <_joe_> oh wait! [17:03:21] * ori oh waits [17:03:25] <_joe_> I was unhappy with akosiaris suggestions! -1! [17:03:27] <_joe_> :D [17:03:30] heheh [17:03:47] <_joe_> go on [17:04:13] there are two follow-ups if you're disappointed that my nag was about something you already reviewed ;) [17:04:27] <_joe_> (I'm convincing myself we should get rid of apache-config and use a more debiany config structure even for mw) [17:04:39] !log reedy Finished scap: l10nupdate for 1.24wmf11 for Skins I9395b0e1983122b12bedf003d6398da5ddfd5651 (duration: 16m 35s) [17:04:44] _joe_: i think we should! [17:04:45] Logged the message, Master [17:05:26] !log reedy Synchronized php-1.24wmf11/resources/Resources.php: I1237909d7e058137d55e5de9fa4d64fe1f7f9472 (duration: 00m 14s) [17:05:31] Logged the message, Master [17:05:47] <_joe_> ori: ok so with the excuse of HAT, I'll try to make our apache config fit into puppet and look much more common to any poor soul sysadmin seeing it for the first time [17:05:47] _joe_: actually can i ask you to merge it? [17:05:52] aude: just replied to that email from Lydia_WMDE [17:06:09] <_joe_> ori: yep, just let me finish answering to comments on the apc patch [17:06:10] i'm allowed to merge it at this stage but it'd be nicer for a patch like this to have buy-in expressed in a +2 [17:07:03] <_joe_> ok [17:09:12] _joe_: there are some easy steps to take there, like replacing the ugly block of LoadModules at the top of modules/mediawiki/templates/apache/apache2.conf.erb (which is already managed by puppet) with include apache::mod::* statements in the manifest [17:09:32] (03PS1) 10BBlack: varnish (3.0.5plus-wm6) unstable; urgency=low [operations/debs/varnish] (3.0.5-plus-wm) - 10https://gerrit.wikimedia.org/r/142283 [17:09:45] <_joe_> ori: yes and put global confs under conf.d, and vhosts in sites-available [17:09:52] <_joe_> and... [17:10:06] yep! [17:10:09] <_joe_> ori: we can work on that step-by-step [17:10:29] <_joe_> I've transitioned much worse apache configs in the past [17:10:41] <_joe_> worse as in more complicated [17:10:42] hard to imagine but encouraging to hear [17:10:49] <_joe_> this is not so bad actually :) [17:11:02] <_joe_> ori: If I only could show you... [17:11:14] <_joe_> trust me, this is a *very fine* apache config [17:12:08] <_joe_> it just has things in unusual places, but you don't have gotos in the config made with rewrite rules [17:12:24] <_joe_> just to say one of the creative WTFs I've seen in the past [17:12:33] is mod_rewrite turing-complete? [17:12:55] <_joe_> not sure :P [17:13:00] (03PS5) 10Dzahn: deployment,replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137993 (owner: 10Rush) [17:13:18] ori: yes [17:13:18] <_joe_> we could rewrite mediawiki with mod_rewrite then :P [17:13:20] http://beza1e1.tuxen.de/articles/accidentally_turing_complete.html [17:13:24] <_joe_> lol [17:13:36] ahahaha [17:13:37] awesome [17:14:06] oh look , and here we are [17:14:12] MediaWiki Templates [17:14:12] In MediaWiki you can define templates. Since they provide recursion, you can apparently implement lambda calculus. [17:14:15] <_joe_> ori: now you have your pet project for the next few years [17:14:33] (03PS1) 10Reedy: Remove 1.24wmf[1-4] symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142286 [17:14:50] (03CR) 10Reedy: [C: 032] Remove 1.24wmf[1-4] symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142286 (owner: 10Reedy) [17:15:05] (03Merged) 10jenkins-bot: Remove 1.24wmf[1-4] symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142286 (owner: 10Reedy) [17:16:00] known in some circles as https://en.wikipedia.org/wiki/Turing_tarpit [17:16:36] (03PS2) 10Giuseppe Lavagetto: mediawiki: collect apc variables via diamond [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 [17:17:00] "Recreational mathematics" :) [17:17:51] (03PS8) 10Giuseppe Lavagetto: Add a lightweight apache::site resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 (owner: 10Ori.livneh) [17:18:00] <_joe_> it sounds like "educational porn" [17:18:36] <_joe_> I think maths, as any form of hard thinking, can be ejoyable, not recreational [17:19:02] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add a lightweight apache::site resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/140242 (owner: 10Ori.livneh) [17:19:03] "recreational mathematics" sounds like something that would be in the movie Pi; it makes the practitioner experience drug like experiences [17:19:41] <_joe_> ori: here it is [17:19:44] _joe_: thanks! [17:19:51] <_joe_> and now, I'm off again :) [17:19:58] have a good night! [17:26:06] manybubbles: yt? [17:26:18] ottomata: yeah [17:26:26] let's do it? elastic1017? [17:27:07] sure! [17:27:11] that puppet patch is merged [17:27:14] i should just be able to run puppet [17:27:15] then deploy [17:27:22] then restart es, ja? [17:27:50] hey ori! saw http://charcoal.wmflabs.org/? [17:27:57] puppetized and running in all of labs :) [17:28:04] (guest/guest, need to remove the password soon) [17:28:52] didn't we use to have graphite.wmflabs.org? [17:29:19] ori: yeah, dunno what the status of that is. Need to figure that out and steal the domain [17:29:29] bd808 would know presumably? [17:29:40] it works in fact [17:29:42] he setup graphite-beta.wmflabs.org [17:29:47] oh [17:29:49] right [17:30:06] ori: which we'll leave alone for now, but take out of auth as well [17:30:17] RECOVERY - Puppet freshness on elastic1017 is OK: puppet ran at Thu Jun 26 17:30:12 UTC 2014 [17:30:58] ori: next step is to get https://gerrit.wikimedia.org/r/#/c/133274/ done, I guess. [17:31:40] ottomata: run puppet - deploy plugins, restart elasticsearch [17:31:42] that should be it [17:32:41] oh yeah [17:32:44] i should update it [17:32:53] k puppet is running [17:32:54] YuviPanda: i'm amenable to being nagged about this [17:33:56] * YuviPanda nags ori about https://gerrit.wikimedia.org/r/#/c/133274/ :) [17:36:56] RECOVERY - Disk space on elastic1017 is OK: DISK OK [17:37:16] RECOVERY - check configured eth on elastic1017 is OK: NRPE: Unable to read output [17:37:16] RECOVERY - RAID on elastic1017 is OK: OK: no disks configured for RAID [17:37:17] RECOVERY - check if dhclient is running on elastic1017 is OK: PROCS OK: 0 processes with command name dhclient [17:37:36] RECOVERY - puppet disabled on elastic1017 is OK: OK [17:37:56] RECOVERY - DPKG on elastic1017 is OK: All packages OK [17:39:04] plugins heading in [17:43:14] ottomata: jars showed up empty [17:44:38] haven't deploye dyet [17:44:41] had to get salt right [17:44:51] how about now? [17:44:57] wait one sec [17:45:02] ok now? [17:45:16] looks good, ja? [17:45:33] also good: [17:45:34] [2014-06-26 17:42:11,411][ERROR][bootstrap ] {1.2.1}: Initialization Failed ... [17:45:34] - ElasticsearchException[Missing mandatory plugins [analysis-icu, experimental highlighter]] [17:45:37] on initial start :) [17:46:09] ottomata: looks like it should start ok now [17:46:15] cool [17:46:16] starting [17:46:40] think its going [17:46:54] bd808: ok with "beta: Small scap fixes", to be merged ..like now [17:47:06] https://gerrit.wikimedia.org/r/#/c/140045/1,publish [17:47:08] mutante: Sure. [17:47:17] (03CR) 10Dzahn: [C: 032] beta: Small scap fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/140045 (owner: 10BryanDavis) [17:47:23] RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2157: active_shards: 6470: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [17:47:56] bd808: there we go [17:48:14] ottomata: looks good - lets let it get some shards migrated to itself and then we can declare that method of setup great [17:48:46] it does that automatically, ja? [17:48:46] greg-g: replied [17:48:53] manybubbles: should I go ahead and do the same for 1018 and 1019? [17:48:56] or shoudl we wait? [17:48:58] ottomata: yeah - its doing it now. it has 5 shards on it now [17:49:04] cool [17:49:18] ottomata: it has the ssd, rgith? [17:49:30] yup [17:49:32] they all should [17:49:36] chris said they did... [17:49:43] how did you check before? [17:50:29] sudo hdparm -I /dev/sda [17:50:35] and, yeah, it has the ssd in sda [17:51:03] RECOVERY - NTP on elastic1017 is OK: NTP OK: Offset -0.00514960289 secs [17:51:19] ja looks good [17:51:20] all 3 do [17:51:25] sweet - go ahead [17:51:27] cool [17:51:53] RECOVERY - Host dataset1001 is UP: PING OK - Packet loss = 0%, RTA = 0.38 ms [17:54:17] bblack, the other two hosts have been in backend conf for a while now [17:54:22] i'm going to go ahead and do pybal [17:54:52] aude: thanks much [17:54:53] PROBLEM - Host dataset1001 is DOWN: CRITICAL - Host Unreachable (208.80.154.11) [17:55:05] ottomata: ok. might want to take those slow as well [17:55:07] aude: is the new table taken care of? do you need any help from Ops for that for today? [17:55:12] bblack? [17:55:25] like, do one? [17:55:27] rather than 2? [17:55:34] in pybal? [17:55:53] 2+2 should be fine, and just wait for the graphs to settle to stability between [17:55:53] RECOVERY - Host dataset1001 is UP: PING OK - Packet loss = 0%, RTA = 0.62 ms [17:56:05] http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=network_report&s=by+name&c=Upload+caches+esams&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [17:56:32] not totally sure what you mean [17:56:33] RECOVERY - Puppet freshness on stat1002 is OK: puppet ran at Thu Jun 26 17:56:22 UTC 2014 [17:56:36] i've already done the first two in pybal [17:56:39] ok [17:56:43] that was done hours ago [17:56:50] ah, I misunderstood [17:57:02] yeah, 3015 and 3016 are in for both backend and frontend [17:57:04] greg-g: i can do it [17:57:09] aude: cool [17:57:14] i added 3017 and 3018 in puppet an hourish ago [17:57:20] in that case, I'm wondering why the net traffic discrepancy? [17:57:27] hm [17:57:27] did the tables for wikiquote etc... this one is trivial [17:57:28] (between cp301[56] and the older hosts) [17:57:41] i see [17:57:43] what you are saying [17:57:45] (03PS1) 10ArielGlenn: remove eth bonding for dataset1001, now has 10gb nic [operations/puppet] - 10https://gerrit.wikimedia.org/r/142294 [17:58:04] http://noc.wikimedia.org/pybal/esams/upload [17:58:07] they are weighted the same [17:58:20] I figured the 80M vs 200M thing there was backend-only vs backend+frontend [17:58:42] (03CR) 10Dzahn: [C: 04-1] "typo, "stmp" = "smtp" and some smaller comments" (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142059 (owner: 10Rush) [17:58:43] lvs don't need kicking, do they? the pybal conf change is dynamic, ja? [17:59:05] RECOVERY - Puppet freshness on stat1003 is OK: puppet ran at Thu Jun 26 17:59:00 UTC 2014 [17:59:38] (03PS3) 10Scottlee: Fixed text formatting and grammar. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142186 [18:00:05] Reedy, greg-g: The time is nigh to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140626T1800) [18:00:28] (03CR) 10ArielGlenn: [C: 032] remove eth bonding for dataset1001, now has 10gb nic [operations/puppet] - 10https://gerrit.wikimedia.org/r/142294 (owner: 10ArielGlenn) [18:01:54] bblack, lvs doesn't need kicking, right? [18:03:35] PROBLEM - Disk space on holmium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:03:55] PROBLEM - check if dhclient is running on holmium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:03:56] PROBLEM - puppet disabled on holmium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:04:05] PROBLEM - DPKG on holmium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:04:25] PROBLEM - Memcached on holmium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:04:25] PROBLEM - check configured eth on holmium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:04:34] ottomata: yeah it should be dynamic, I'm not sure yet [18:04:57] !log aware of holmium issue (old varnish), in process of repair, blog is down [18:05:02] Logged the message, RobH [18:06:12] it was schedualed wasn't it? [18:06:23] the downtime for the disk swap was yep [18:06:39] but now the reboot has shown we didnt upgrade its varnish instace, so gotta do that before it can run and blog can work [18:06:56] fair enough [18:07:26] and when raid rebuild triggered at same time as an apt pull [18:07:35] it seems to have killed the OS responsiveness [18:07:39] \o/ [18:07:52] everything that touches the wordpress server has to always be complicated ;] [18:07:54] (03PS3) 10Reedy: Wikipedias to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142257 [18:07:55] ok manybubbles lookin good on 1018 and 1019 [18:07:58] its cursed [18:07:59] ok to start elasticsearch there? [18:07:59] (03CR) 10Reedy: [C: 032] Wikipedias to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142257 (owner: 10Reedy) [18:08:06] (03Merged) 10jenkins-bot: Wikipedias to 1.24wmf10 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142257 (owner: 10Reedy) [18:08:20] ottomata: at this point if elasticsearch allows itself to start up and we have ssds we're probably good [18:08:43] ok,they are starting [18:08:47] RobH: just wondering, no hot swap for drives on servers? [18:08:57] !log starting new elasticsearch nodes 1017,1018,1019 [18:09:03] Logged the message, Master [18:09:05] PROBLEM - Host holmium is DOWN: CRITICAL - Host Unreachable (208.80.154.49) [18:09:09] matanya: depends on server [18:09:14] but for misc services, nope [18:09:20] those are smaller lighterweight servers [18:09:33] that may change though as we are investigating other options for misc services hosting [18:09:46] but right now its distriburted across small (relatively) cheap boxes [18:10:15] cmjohnson1: holmium was software raid right? [18:10:22] the 310 is bitching in reboot about ocnfig changes [18:10:23] no h/w raid [18:10:30] ..... arghhhhhhh [18:10:48] just continue through it [18:10:49] mistake for that on 310s [18:10:54] well that is suprising. swap bays are cheap [18:10:59] i dont wanna just ignore then it bugs folks every reboot [18:11:10] matanya: thats debatable, but not right now ;] [18:11:10] for some reason that was purchased with a raid card [18:11:18] sure. sorry [18:11:24] np [18:11:52] cmjohnson1: can you console it the output on serial is borked [18:12:00] and fix it so it doesn't prompt for the raid issue when it reboots? [18:12:16] give me a min [18:12:17] ottomata: yeah, LVS is not even checking health on cp301[56], so they're effectively not frontends right now. I'm not sure why yet [18:12:18] i cannot read all of the output clearly and i dont want to guess and lose data [18:12:18] RobH: eta on blog, per chance? [18:12:19] k [18:12:23] greg-g: none yet [18:12:28] kk [18:12:41] the downtime has resulted in a few issues with puppet/varnish and we're still trying to get the raid to work [18:12:50] unfortunately its a 310 controller which we hate [18:13:03] tmi for communications people ;) [18:13:05] yeah, bblack, i was really only seeing purge requests from varnishncsa log, even on -n frontend [18:13:06] so it means once we get the blog really back online i get to migrate it to another server to undo the 310 cruft [18:13:15] RobH: ah, "great" [18:13:18] but that migration wont be downtime [18:13:26] just my time =P [18:13:31] plenty of GETs + more on backend instance [18:13:52] tl;dr - server borked, we have a solution, but it's taking a while [18:13:58] bblack: :) [18:14:00] greg-g: You can tell communications that we have two opsen directly working the issue with the rest in standby to help if needed [18:14:01] indeed [18:14:12] word [18:14:45] (and the meaning of "a while" will be updated in another 2.43 whiles) [18:15:16] :) [18:15:40] !log reedy Synchronized php-1.24wmf10/includes/api/ApiQueryRecentChanges.php: Id9c316733896a27ce3f6c3e0e5efdf62f7d1ff1b (duration: 00m 14s) [18:15:45] Logged the message, Master [18:16:00] ottomata: those three machines have twice the cpu as the others [18:16:04] not a bad thing [18:16:07] just a thing [18:16:28] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.24wmf10 [18:16:33] Logged the message, Master [18:17:40] * RobH decides the blog will never actually migrate and puts it on his list of shit he has to really fix for real [18:17:54] damn blog! [18:19:02] (03PS3) 10Reedy: group0 to 1.24wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142258 [18:19:06] (03CR) 10Reedy: [C: 032] group0 to 1.24wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142258 (owner: 10Reedy) [18:19:14] (03Merged) 10jenkins-bot: group0 to 1.24wmf11 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142258 (owner: 10Reedy) [18:19:24] RECOVERY - Host holmium is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [18:19:31] (03PS2) 10Reedy: Remove extension-list-1.24wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142264 [18:19:38] (03CR) 10Reedy: [C: 032] Remove extension-list-1.24wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142264 (owner: 10Reedy) [18:19:44] (03Merged) 10jenkins-bot: Remove extension-list-1.24wmf9 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142264 (owner: 10Reedy) [18:19:44] PROBLEM - NTP on holmium is CRITICAL: NTP CRITICAL: No response from NTP server [18:19:50] robh: are you on console with holmium [18:20:01] nope, was d/c but can go back [18:20:14] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.24wmf11 [18:20:16] oh wait, wasnt d/c [18:20:19] Logged the message, Master [18:20:19] cmjohnson1: off console [18:20:54] ottomata: 2014-06-26 16:38:57.222234 [uploadlb_80] Could not load configuration URL http://noc.wikimedia.org/pybal/esams/upload: invalid syntax (, line 1) [18:21:08] cmjohnson1: meh, im going to have to move it, so lets just get it booted and online. if it has to rebuild the raid to do so, it sucks but so be it [18:21:09] ottomata: there are missing single-quotes in the new lines, just before the word weight [18:21:23] i just couldnt read the screen to pick any option [18:21:26] it was all jumbled [18:21:33] (I found that in /var/log/pybal.log on lvs3002) [18:21:43] ideally we'd have it not prompt the raid error, but I thought that it wasn just configured jbod in 310 [18:21:44] not raid [18:21:49] it didn't come back [18:22:09] checking something [18:22:14] PROBLEM - SSH on holmium is CRITICAL: Connection timed out [18:23:01] AH [18:23:02] bblack [18:23:03] thanks [18:23:05] sorry [18:23:37] (03CR) 10Rush: [C: 031] "nice" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137993 (owner: 10Rush) [18:23:55] man i looked at those lines a bunch of times [18:23:57] didn't see it [18:24:33] me either until I saw the pybal message, that made me look harder :) [18:24:43] linting! [18:24:45] version control! [18:25:09] robh: taking this here..disk 0 is now in foreign config [18:25:44] PROBLEM - Host holmium is DOWN: CRITICAL - Host Unreachable (208.80.154.49) [18:25:47] so when doing import, if it blows up, take a bit of comfort in knowing all the blog post content resides in the database. [18:25:52] disk 0 is showing foreign I belive if i import the cfg it should come back [18:26:05] any uploaded media not linked from commons is going to be lost if it dies, but thats not as important as content [18:26:15] (plus folks should link from commons in their posts!) [18:26:26] for sure clearing it will obiliterate [18:26:31] ja lost more requests on frontend now [18:26:43] don't see so much more network usage yet... [18:26:44] okay going to import ... [18:26:47] cmjohnson1: yea, do whatever you think best to get it booting with its data intact [18:26:52] it doesnt even need to use both disks at this point [18:26:54] ottomata: it's starting to show up in the 1hr graphs now [18:27:01] i didnt realize it was using the h310 controller in hw raid mode [18:27:08] or i would have changed how we handled this, sorry =[ [18:27:11] oh i'm looking at 2h [18:27:19] or 4h [18:27:23] jaa thereit is [18:27:37] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=cp3015.esams.wmnet&m=network_report&s=by+name&mc=2&g=network_report&c=Upload+caches+esams [18:27:38] looks like it prolly worked...booting now [18:27:44] Now if we can simply get it propped up and back online, I'll setup a new blog server this afternoon [18:27:58] cool, lemme know [18:28:14] (ideally we'd use same server in non hw raid mode but thats more downtime, hence new server) [18:28:41] I may also relocate it from its own varnish service to the misc-web-lb [18:28:47] I dislike having a one-off odd varnish server. [18:29:44] all these things we(me, anyone, me) should have done awhile ago but didnt since blog has been 'in the process' of moving off our servers now for a year. [18:30:39] that was weird it went to boot installer [18:30:45] eww [18:31:04] * RobH starts to get worried and goes to pull a new server. [18:31:25] (03PS2) 10Ori.livneh: diamond: clean out collector enabled => true settings [operations/puppet] - 10https://gerrit.wikimedia.org/r/140022 [18:32:31] bios settings were set to pxe boot first [18:33:05] good times. [18:34:34] yep...good times...re-booting now [18:37:54] RECOVERY - puppet disabled on holmium is OK: OK [18:37:59] good times are good :-D [18:38:04] RECOVERY - Host holmium is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [18:38:04] RECOVERY - SSH on holmium is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [18:38:09] holmium is back but httpd dameons failed to start [18:38:14] RECOVERY - DPKG on holmium is OK: All packages OK [18:38:14] RECOVERY - Memcached on holmium is OK: TCP OK - 0.000 second response time on port 11000 [18:38:14] RECOVERY - check configured eth on holmium is OK: NRPE: Unable to read output [18:38:38] RECOVERY - Disk space on holmium is OK: DISK OK [18:38:48] RECOVERY - check if dhclient is running on holmium is OK: PROCS OK: 0 processes with command name dhclient [18:39:03] cmjohnson1: yep, known until i upgrade varnish [18:39:10] which i'll do now [18:39:16] so don't fix disk 2? [18:39:55] nope, butttt [18:40:00] it has to reboot now that i upgraded varnish [18:40:01] (03CR) 10Ori.livneh: [C: 032] diamond: clean out collector enabled => true settings [operations/puppet] - 10https://gerrit.wikimedia.org/r/140022 (owner: 10Ori.livneh) [18:40:03] said it has to [18:40:04] =P [18:40:08] RECOVERY - Varnish HTTP blog on holmium is OK: HTTP OK: HTTP/1.1 200 OK - 633 bytes in 0.001 second response time [18:40:09] (to apply all changes) [18:40:15] We're hosting the blog on Windows? :P [18:40:16] cmjohnson1: so i have to reboot, will it do the prompt? [18:40:18] RECOVERY - HTTP on holmium is OK: HTTP OK: HTTP/1.1 200 OK - 80800 bytes in 0.534 second response time [18:40:30] ureadahead will be reprofiled on next reboot [18:40:37] no it's good now [18:40:42] cmjohnson1: cool, rebooting [18:40:44] (03PS4) 10Hashar: beta: Add mediawiki/core/vendor to beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/137463 (owner: 10BryanDavis) [18:40:50] wait what? [18:40:59] bblack: well, too late now, damn it [18:40:59] heh [18:41:04] ureadahead will be reprofiled on next reboot [18:41:06] (03CR) 10Hashar: "Merge me! :) Manually rebase since [rebase] button did not work." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137463 (owner: 10BryanDavis) [18:41:09] during the varnish update [18:41:13] figured i should do what it says. [18:41:22] but, perhaps i was overzealous [18:41:22] yeah I have no idea what that means [18:41:23] (03PS2) 10Ori.livneh: role::deployment: port apache::vhost to apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/142205 [18:41:25] me either! [18:41:26] heh [18:41:28] hahaha [18:41:34] bblack: thank you for making me feel not as stupid =] [18:41:44] I think ureadahead is just an optimization for boot times [18:41:50] (03PS2) 10Ori.livneh: mediawiki_singlenode: port apache::vhost to apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/142206 [18:41:51] so, we're rebooting to make sure reboots are faster [18:41:56] woooo [18:41:58] \o/ [18:42:03] <_joe_> RobH: no reason to reboot for that message, really [18:42:10] andrewbogott: can i ask you to review https://gerrit.wikimedia.org/r/#/c/142206/ and https://gerrit.wikimedia.org/r/#/c/142205/ ? [18:42:12] I'll have communications tweet that out "from Ops: 'I have no idea what that means.' 'Me either!' we're professionals, don't worry." [18:42:12] duly noted for the future =] [18:42:34] ignore ureadahead [18:42:38] PROBLEM - Host holmium is DOWN: CRITICAL - Host Unreachable (208.80.154.49) [18:42:47] cmjohnson1: if you are on console, holmium should be posting [18:42:59] i'm off ..all yours [18:43:10] paravoid: Will do in future, didnt know what it was before so was just playing it safe [18:43:38] cmjohnson1: i meant physical console, heh [18:43:54] oh..i left the cage [18:43:54] but its ok [18:43:57] ahh [18:44:00] in the conference room...not as cold [18:44:02] heh, no worries, i see os booting now [18:44:18] RECOVERY - Host holmium is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [18:44:38] ok, pushing puppet update for all the varnish stuff that failed before varnish upgrade [18:45:24] already running from reboot, and blog seems back [18:45:33] lets see when it finishes if its still good [18:46:44] greg-g: well, atleast you didnt get split [18:47:13] rehi [18:47:22] well, i was about to say blog looks good [18:47:31] but puppet is failing for new diamond collector stuff [18:47:37] so puppet cannot finish run, blog is online [18:47:43] but system isnt in ideal state yet [18:48:21] Jun 26 18:47:30 holmium puppet-agent[3426]: Could not retrieve catalog from remote server: Error 400 on SERVER: Must pass settings to Diamond::Collector[Network] at /etc/puppet/modules/diamond/manifests/init.pp:95 on node holmium.wikimedia.org [18:48:39] 11:41 < grrrit-wm> (CR) Ori.livneh: [C: 2] diamond: clean out collector enabled => true settings [18:48:41] at least we aren't in full downtime [18:48:42] ori: ^ [18:48:43] <_joe_> bblack: that's ori [18:48:56] indeed, i was just looking in commit history in email and saw [18:49:50] sorry [18:49:57] (03PS1) 10Ori.livneh: diamond::collector: make settings default to undef [operations/puppet] - 10https://gerrit.wikimedia.org/r/142308 [18:50:02] chasemp, bblack ^ [18:50:09] etc. [18:50:21] we arent in downtiem if you think you can tweak a fix for it in a short time [18:50:24] rather than just undoing [18:50:29] (your call =) [18:50:36] but i get not wanting to rush [18:50:38] that's the fix [18:50:42] ^^ [18:51:11] what does the template do with undef then just enabled=true since it's undeclared? [18:51:28] chasemp: Hash["enabled" => "true"].update(@settings.is_a?(Hash)? @settings : {}).sort.map { |k,v| [18:51:32] so, {} [18:51:47] ah ok [18:51:47] or rather, {enabled=>true} [18:52:00] (03CR) 10Rush: [C: 031] "let's give this a whirl" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142308 (owner: 10Ori.livneh) [18:52:08] (03CR) 10Ori.livneh: [C: 032] diamond::collector: make settings default to undef [operations/puppet] - 10https://gerrit.wikimedia.org/r/142308 (owner: 10Ori.livneh) [18:52:22] i'll merge on palladium [18:52:33] or osmeone beat me to it already [18:52:33] heh [18:52:37] Hash["enabled" => "true"].update(@settings.is_a?(Hash)? @settings : {}).sort.map { |k,v| [18:52:50] ^ and people say that perl code looks ugly :P [18:53:07] woo, puppet is applying stuff [18:53:17] sorry about that again [18:53:27] no worries [18:53:31] thx for fixing quickly [18:53:39] looks good on argon [18:53:40] holmium runs puppet now just fine [18:54:18] cmjohnson1: So its not setup to use both disks right now right? [18:54:25] i can break other things if you guys are disappointed [18:54:26] so i still need to have a high priority on migration [18:54:33] the 2nd disk is not configured [18:54:33] but we're out of downtime [18:54:35] ok [18:54:41] thats fine, im going to allocate a new system now [18:54:49] last time we set it to configure it locked the damn thing, heh [18:54:50] i can rebuild if you want [18:54:56] well, maybe was due to that and apt [18:54:57] hrmm [18:55:03] (03PS6) 10Dzahn: deployment,replace generic::systemuser with user [operations/puppet] - 10https://gerrit.wikimedia.org/r/137993 (owner: 10Rush) [18:55:04] cmjohnson1: yea, lets give it a shot now that we are good [18:55:16] if it can rebuild, it still needs migration but i can wait for mark's approval [18:55:18] rather than assume and do [18:55:37] (and possibly have to undo with said approval doesn't happen ;) [18:55:47] <_joe_> ori: don't break boring stuff like diamond [18:55:53] hello RobH [18:55:55] okay...doing it now so don't do anything [18:55:56] <_joe_> break something exciting like pybal [18:55:58] hiya papaul [18:56:04] what;s up [18:56:07] * ori sets to it [18:56:15] cmjohnson1: I'm logged off of holmium entirely [18:56:29] me too [18:56:53] papaul: blog downtime was going on, but nothing now. need anything? [18:57:32] <_joe_> ori: the condition is, you must find a subtle way to break it, with a change that seems legit and will defy basic debugging logic [18:57:40] cmjohnson1: hrmm, seems blog cannot do rebuild [18:57:44] cuz now its no longer loading in browser [18:57:47] _joe_: it's like you're describing my work [18:57:52] it's timing out [18:58:00] dapter: 0: Set Physical Drive at EnclId-N/A SlotId-1 as Hot Spare Success. hangs right here [18:58:11] ok, lets reboot it and get it bakc online [18:58:12] and not rebuild [18:58:14] <_joe_> ori: that' everyone's job description in our industry I think [18:58:19] i'll just make the migration a high priority item [18:58:25] and do it immediately once we're not down. [18:58:29] okay [18:58:41] you reboot and babysit it coming back pls? i'll start allocating a new server [18:58:56] k [18:58:58] PROBLEM - check if dhclient is running on holmium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:58] PROBLEM - Disk space on holmium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:58:58] PROBLEM - puppet disabled on holmium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:59:00] greg-g: ^ so we lied, its down for reboot but should come back [18:59:14] i think it's going to take another shit [18:59:22] and we'll be migrating it to a new server over the next day or two [18:59:22] yet another reboot? [18:59:22] PROBLEM - DPKG on holmium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:59:22] PROBLEM - Memcached on holmium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:59:26] greg-g: trying to rebuild the raid froze the system [18:59:30] PROBLEM - check configured eth on holmium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:59:32] gotcha [18:59:43] something that doesn't happen on decent hw raid controllers [18:59:55] but this system is using a h310, normally we set those to bypass and not use them [19:00:01] but this one was configured to use it (unfortunately) [19:00:09] So we're rebooting it again and won't rebuild the raid [19:00:25] I'll manually copy the data i may want and we'll migrate it to a new server over the next 24 to 48 hours [19:00:36] the migration will be purely dns switch witht he same backend database [19:00:50] so should be no effect to users [19:01:00] convey our apologies =P [19:01:04] hi ops, http://blog.wikimedia.org/ is giving me and Tomasz a 503. [19:01:13] bearND: we are aware and workign the issue [19:01:22] thx for the report though [19:01:24] RobH: awesome, thanks [19:01:50] oh right, android app blog post :) [19:02:10] man i hate the blog [19:02:18] and it hates me, see how it behaves? ;] [19:02:19] yup, thanks to our new release [19:02:30] ignored children do misbehave [19:02:44] * greg-g says as a father [19:02:50] PROBLEM - Host holmium is DOWN: CRITICAL - Host Unreachable (208.80.154.49) [19:03:06] if the servers are kids, then this is sparta and the blog should have been tossed in the pit at birth [19:03:09] bearND: ^^ that, holmium is the blog server [19:03:19] RobH: ..... wow [19:03:21] (03PS1) 10Odder: Add an Erasmus University domain to whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142309 (https://bugzilla.wikimedia.org/67120) [19:03:28] i only blog about important things...ya know cooking tips, how to clean with ketchup ...etc [19:03:30] greg-g: too strong? [19:03:35] RobH: :) [19:03:37] <_joe_> I see the blog as a perfect candidate for virtualization.... [19:03:40] RECOVERY - check if dhclient is running on holmium is OK: PROCS OK: 0 processes with command name dhclient [19:03:43] _joe_: +++ [19:03:50] RECOVERY - Disk space on holmium is OK: DISK OK [19:03:50] RECOVERY - Host holmium is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [19:03:50] RECOVERY - puppet disabled on holmium is OK: OK [19:03:51] yea, it only has its own server for security reasons [19:03:52] "too bad" we're getting rid of it [19:03:54] robh: coming back up now....let's not that again [19:04:00] indeed, no raid rebuild [19:04:10] RECOVERY - DPKG on holmium is OK: All packages OK [19:04:10] RECOVERY - Memcached on holmium is OK: TCP OK - 0.000 second response time on port 11000 [19:04:10] just prop it up and we'll migrate off the hw, then set it to not be hw raid [19:04:18] and holmium can go into my spare pool when its migrated off [19:04:20] RECOVERY - check configured eth on holmium is OK: NRPE: Unable to read output [19:04:32] ori, MaxSem: do you need me here to deploy that ^^ during the SWAT window later today? [19:04:38] or should I schedule it for Monday? [19:04:44] Ok, blog is back online [19:04:52] <_joe_> and this recovery line... [19:04:54] RobH: no more reboots? :) [19:04:55] and we're done trying to fix it so it should stay that way [19:05:05] its still just one disk system with no raid [19:05:09] so its not permanent. [19:05:17] ugh [19:05:18] * RobH will migrate it to another box shortly [19:05:27] its now my afternoon project [19:05:30] twkozlowski, nope [19:05:33] (03PS1) 10Andrew Bogott: Revert "Intentionally break puppet compile for virt1008" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142311 [19:05:54] jouncebot: next [19:05:55] In 0 hour(s) and 54 minute(s): Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140626T2000) [19:06:44] !log blog is back online after a number of reboots due to raid rebuild issues [19:06:48] Logged the message, RobH [19:06:55] thanks for working on the blog, guys! james just switched off the centralnotice banners linking to the ters of use blog post, which should remove a large portion of the current traffic, but yes, it would be great to e.g. have the android app post accessible ;) [19:07:06] (03PS2) 10Odder: Add an Erasmus University domain to whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142309 (https://bugzilla.wikimedia.org/67120) [19:07:08] HaeB: you can return it to normal use [19:07:13] it should be ok. [19:07:21] its caching is running [19:07:45] and when i migrate it, it'll be a dns change of the frontend, so no downtime when folks dns entries change over time [19:08:33] Unfortunately, the downtime went from a simple disk swap to upgrading services on the machine and dealing with a disk controller we didn't expect to be configured in the way it was [19:08:46] then every rebuild attempt afterwards locked the system [19:08:51] non-ideal. [19:09:40] RobH: lessons learned? [19:09:53] RobH: I'm judging outage report worthy, they're on my mind ;) [19:09:58] it is [19:09:59] worthiness, that is [19:10:09] it went over the window allotted, so its an outage [19:10:18] word [19:10:24] thanks [19:10:38] check disk controller for possible issues on rebuild [19:10:46] as we know the h310 is a problem controller and we no longer purchase them [19:10:52] RobH: great, thanks! [19:10:58] and we configure existing to bypass, this was just a system that was installed far too long ago [19:11:11] we should also check for puppet having run successfully on a system before taking it offline for repair [19:11:17] that would have solved the varnish issue without downtime. [19:11:49] mutante: do you understand icinga well enough to know why this test is not appearing on the icinga console? It looks like it should be on every host… https://gerrit.wikimedia.org/r/#/c/141452/2 [19:12:09] As we just complicated a hardware issue with an additional software upgrade [19:12:11] heh [19:12:47] <_joe_> andrewbogott: it's an exported resource [19:13:05] <_joe_> takes time to propagate [19:13:17] <_joe_> at least 40 minutes at the actual puppet run frequency [19:13:18] _joe_: hours? [19:13:25] <_joe_> between 40 mins and 1 hour [19:13:32] It's been three hours hasn't it? [19:13:36] <_joe_> andrewbogott: maybe puppet is failing on neon? [19:13:43] oh, good point [19:13:44] * andrewbogott checks [19:14:09] <_joe_> or, the icinga config is broken [19:14:11] !log reedy Synchronized php-1.24wmf11/extensions/OAuth/: (no message) (duration: 00m 14s) [19:14:15] <_joe_> that is also possible [19:14:15] Logged the message, Master [19:14:19] <_joe_> so it's not reloading [19:14:20] the outage report list for this quarter just keeps growing and growing and growing, i should cut it off soon [19:14:20] RobH: that last one (puppet run) is a good one [19:14:37] <_joe_> greg-g: :( [19:15:07] <_joe_> at least most are not serious outages [19:15:44] _joe_: yeah, learning experiences [19:15:44] i dunno, most blog outages tend to send communications into seizures ;] [19:16:06] <_joe_> RobH: then we should do HA for the blog? [19:16:29] the plan was to pay someone else to worry about it, not sure where that ended up [19:16:32] <_joe_> not saying we must, but if a problem on the blog makes people cry, we probably should [19:16:33] possibly but first step is not neglecting it as much [19:16:39] but as greg points out [19:16:41] RobH++ [19:16:43] it was supposed to migrate away [19:16:48] but i keep asking and its always pending [19:16:55] so im just going to pretend from now on its never going away [19:16:59] _joe_: Quick refresh on how I can test the icinga config load? [19:17:03] we need to choose, either migrate away *now* or take care of it, we're in purgatory right now [19:17:09] RobH: good plan [19:17:32] indeed, i plan to move to a new server with properly configured raid [19:17:44] and then put it behind misc-web-lb to eliminate the one-off odd varnish server [19:17:50] RobH: thanks, btw, for jumping in as god-father after the absentee parents, to continue the metaphor ;) [19:17:51] should cut down on possible neglected services [19:17:53] (03PS2) 10Reedy: wgMemoryLimit from 235 to 245MB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137730 [19:18:02] (03CR) 10Reedy: [C: 032] wgMemoryLimit from 235 to 245MB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137730 (owner: 10Reedy) [19:18:08] <_joe_> andrewbogott: try service icinga reload [19:18:10] (03Merged) 10jenkins-bot: wgMemoryLimit from 235 to 245MB [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137730 (owner: 10Reedy) [19:18:20] <_joe_> andrewbogott: it's in the manual however [19:18:21] greg-g: i think im potenially one of the absentee parents though ;] [19:18:30] potentially even [19:18:32] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 14s) [19:18:37] Logged the message, Master [19:18:38] RobH: ok, well then, thanks for cleaning up your act ;) [19:18:42] heh [19:18:42] RobH: can you just ping tilman once again first? [19:18:56] HaeB: ping [19:19:04] hah [19:19:09] I was going to do both in parallel, heh [19:19:10] thanks Nemo_bis :) [19:19:12] _joe_: Ah, yeah, I did that already; thought maybe there was a verbose option. But, I'll dive into the logs. [19:19:14] unless the blog moves like, in the next week [19:19:18] uh [19:19:19] ori: I suppose we can mess around with that osmium a little [19:19:20] i dont like it sitting on a system with only one disk [19:19:33] <_joe_> andrewbogott: if it reloaded, then the config is ok [19:19:42] so the server migration needs to happen either way, but if it is moving soon then i wont bother with the varnish fold into misc-web [19:19:44] why did freenode change our topic? due to the split? [19:20:05] interesting [19:20:17] paravoid: that sounds reasonable? [19:20:34] (move system due to disk, but dont migrate varnish from blog server to misc-web unless we're keeping it for awhile) [19:20:34] _joe_: Well… that brings me back to my original question then :) [19:20:50] are we keeping it for a while? [19:20:59] <_joe_> andrewbogott: which someone in a more comfortable TZ may assist with :) [19:20:59] dunno, going to email about it shortly [19:21:16] my point was the first step (move off holmium due to raid config) has to happen no matter what [19:21:25] imo [19:21:25] ping comms explaining we have a situation and it's been too long enough already, and let's see [19:21:38] as its sitting there with one really old disk for its entire state. [19:21:52] <_joe_> make backups now. [19:22:06] (03PS1) 10Odder: Disable local uploads on Malay Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142318 (https://bugzilla.wikimedia.org/67152) [19:22:07] ori, MaxSem: That as well ^^ but it can wait till Monday, I guess [19:22:23] I wouldn't wanna stay up till 01:00 AM my time just to see it deployed [19:23:23] (03PS1) 10BryanDavis: Set wgGitInfoCache to point to scap managed location [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142320 (https://bugzilla.wikimedia.org/53972) [19:24:23] RobH, greg-g : well, the migration is definitely progressing, and i would say we are just weeks away from the relaunch... but the problem is that i have been saying that (based on the contractors' input) for months now ;) [19:24:44] (03PS32) 10Nikerabbit: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 [19:25:11] seems that code review isn't just a bottleneck in the mediawiki world only [19:25:20] can i quote you on that in ticket? [19:25:29] i just want to have a firm reasoning for me to spend time on this [19:25:55] "weeks away" sounds like "take care of it for real for now" to me [19:26:04] RobH: and you can quote me if you want more ammo [19:26:07] :) [19:26:21] andrewbogott: let me take a look now.. [19:26:22] RobH: yes [19:26:38] mutante: I guess it's possible that the test just doesn't work :/ [19:26:49] Although even then I'd expect it to show up on the icinga panel [19:28:09] twkozlowski, morning swats should be good for you? [19:28:10] andrewbogott: yea, it should show up and i actually run that script, i think rather the nrpe command isnt defined yet [19:28:40] Ah, so there's a server-side component we're missing? [19:28:59] MaxSem: morning SF time, yeah. I prefer that other patch to be deployed on Monday anyway, just added it to the deployment page on Wikitech [19:29:05] andrewbogott: it's called command[check_puppet_checkpuppetrun] [19:29:21] wait [19:30:23] -bash: /usr/local/lib/nagios/plugins/check_puppetrun: No such file or directory [19:30:40] HaeB: cool, i updated ticket with that info so mark knows why im suddenly setting up a new blog server, heh [19:30:43] thx for info [19:31:47] mutante: that's on neon or elsewhere? [19:31:49] andrewbogott: missing a file {} to actually put that on all servers. [19:31:54] ah! [19:32:02] andrewbogott: on a random other box [19:32:07] let me try a fix [19:32:20] thanks... and don't kill me if we're requesting that final migration dump in 2 weeks already ;) [19:34:10] (03PS1) 10Dzahn: add file{} definition for new puppet run check [operations/puppet] - 10https://gerrit.wikimedia.org/r/142361 [19:35:05] andrewbogott: no, that was already there ..arg [19:35:59] mutante: is it? I don't see check_puppetrun on my server either [19:36:20] where do you see it getting installed? I don't see the file definition anywhere [19:36:23] (03PS2) 10BryanDavis: Set wgGitInfoCacheDirectory to point to scap managed location [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142320 (https://bugzilla.wikimedia.org/53972) [19:36:33] andrewbogott: no, it's not , just confused, i still think we want that fix [19:37:40] (03CR) 10Dzahn: [C: 032] add file{} definition for new puppet run check [operations/puppet] - 10https://gerrit.wikimedia.org/r/142361 (owner: 10Dzahn) [19:38:02] hey can someone help me out? I got this issue after trying to do a git review: http://pastie.org/9327818 [19:38:26] dogeydogey: look for the string "HEAD" in any file [19:38:47] mutante huh? [19:38:53] then manually fix it, git add , git rebase --continue [19:39:04] what do you mean any file? [19:39:05] dogeydogey: do you understand what's happening overall? Your local patch conflicts with some other patch that has been merged in the meantime. [19:39:13] nope [19:39:14] i'm lost [19:39:31] Do you know what I mean by 'conflict'? [19:39:40] andrewbogott: it's adding the file, fwiw Notice: /Stage[main]/Base::Monitoring::Host/File[/usr/local/lib/nagios/plugins/check_puppetrun]/ensure: defined content [19:39:40] yeah [19:39:49] mutante: so I see, I'm updating things [19:40:09] dogeydogey: so, that's all that warning means -- your change is running up against another change that happened at the same time. You have to resolve the conflict by hand [19:40:24] dogeydogey: so it failed merging some files automatically, then it just marks the places where it had problems with "HEAD" and ">>>>" etc [19:40:25] Basically, update your patch so that it is applied to the very latest upstream. [19:40:37] it's not showing me where the issues are [19:40:52] git status ? [19:44:37] (03CR) 10Andrew Bogott: [C: 032] Revert "Intentionally break puppet compile for virt1008" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142311 (owner: 10Andrew Bogott) [19:47:41] dogeydogey: Are you able to see the conflicts now? [19:49:09] (03CR) 10Andrew Bogott: [C: 04-1] mediawiki_singlenode: port apache::vhost to apache::site (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142206 (owner: 10Ori.livneh) [19:50:42] ugh https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=809cf422f5f9ac4f610d07d5a4cd672b&host=^db&user=wikiuser&schema=wik&hours=24 [19:51:17] ^d: https://gerrit.wikimedia.org/r/#/c/141890/ [19:52:15] andrewbogott yes [19:52:20] AaronSchulz: thanks a lot for the entity suggester review [19:52:25] dogeydogey: cool :) [19:52:47] mutante: ok, I have forced puppet runs on virt1008 and neon, and reloaded icinga, and yet... [19:52:48] dogeydogey: so once you fix that file(s), git add filename, git rebase --continue [19:52:49] still nothing? [19:53:07] andrewbogott: yea, i don't know why yet, it seems to be identical to the older "puppet disabled" check [19:53:30] nrpe::monitor_service should just add it to icinga config [19:53:39] the problem now is it's not in icinga config yet [19:53:52] root@neon:/etc/icinga# grep -r "puppet last run" * [20:00:04] aude: The time is nigh to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140626T2000) [20:00:13] :) [20:00:36] heh :) [20:03:06] mutante: what generates the icinga config? Not puppet? [20:03:31] mutante after git rebase --continue [20:03:37] do I git review? [20:06:01] dogeydogey: yes [20:06:15] i got another error again from a different file.... [20:06:20] andrewbogott: if it's not puppet, that would be new to me [20:11:00] mutante: don't we need something in checkcommands.cfg.erb [20:15:30] andrewbogott: no, i don't think so, "puppet disabled" isn't in there either, and since Akos revamped the NRPE setup into base module... [20:16:10] papaul: Are you about? [20:16:18] I want to discuss the labeling item [20:16:18] !log aude Started scap: Update Wikidata, for enabling property suggester on testwikidata [20:16:22] Logged the message, Master [20:16:29] _joe_: '$type' isn't reserved in puppet right? [20:16:48] <_joe_> ori: why should it? [20:17:03] i didn't think it was but it tickled me somehow, thanks for confirming [20:17:05] <_joe_> they have that horrible dollar sign in front of variables [20:19:40] I never got the point of the punctuation before variables in PHP; it looks like they copied scalar syntax from perl without actually understanding that it's meaningless unless you have the matching semantics. :-) [20:19:53] andrewbogott ah I see what happened I was amending a previous commit and I think it got mixed in with this one, what do I do? [20:20:10] http://pastie.org/9327912 [20:21:05] dogeydogey: probably start over :( [20:21:10] NOOOOO [20:21:14] But first, let's preserve your current work by moving you to a new branch. [20:21:26] I can't just undo it [20:21:32] dogeydogey: are you working on a single patch or a series? [20:21:40] And, what state is your git in currently? [20:21:51] current state is that pastie link [20:21:58] I worked on https://gerrit.wikimedia.org/r/142186 yesterday which is in revision 3 [20:22:08] Are both those patches yours? [20:22:09] and I started a separate branch for what I'm trying to commit right now [20:22:15] andrewbogott yes, they're both mine [20:22:18] ok. [20:22:29] Do they generally apply to the same files, or to different ones? [20:22:48] generally different ones [20:23:06] ok. So how would you feel about losing all the patch information but preserving the actual changes to the files? [20:23:20] Then you can add the files and create new separate patches. And cut/paste the commits and IDs into the new patches. [20:23:28] Would that be OK? That's a pretty easy thing to accomplish. [20:24:00] Or do you want to the two patches to be on separate branches? [20:24:10] I'm lost [20:24:14] OK :) [20:24:27] So, start by telling me the problem again [20:24:29] it's important to me that I preserve the changes from today [20:24:37] cause it's like hours of work [20:24:53] there is "git stash" where you can temp. "park" some changes and later get them back [20:25:01] yes, I'm not threatening to lose your changes, only to lose the exact history of which change is in which patch. [20:25:02] wonder if we might want to backporot https://gerrit.wikimedia.org/r/#/c/141056/ to wmf10 [20:25:09] greg-g: ^ [20:25:12] andrewbogott yeeah that's fine [20:25:17] quite a bit in the logs [20:25:44] unrelated to our stuff, except that many people transclude the special page on their wikidata user pages [20:26:30] aude: just pinged Nikerabbit re testing results on translatewiki [20:26:38] ok [20:27:04] j.mp/wmfatal [20:29:43] also see a bit of Fatal error: Call to a member function getId() on a non-object from Cirrus [20:29:54] ^d: manybubbles assume you are aware [20:30:09] <^d> Should be fixed in master... [20:30:11] Updater.php [20:30:21] aude: locally or on ze cluster? [20:30:26] on cluster [20:30:41] bleh - ^d man, I'm not having a good week [20:30:50] <^d> We aren't :\ [20:30:51] it's the predominant thing in exception log [20:31:03] err fatal [20:31:44] logstash isn't showing it to me.... [20:32:12] greg-g: responded [20:32:39] manybubbles: and yeah the fatals disappeared on twn after applying your patch [20:34:11] ^d: Its this one: https://gerrit.wikimedia.org/r/#/c/141691/ [20:34:18] aude: that'd be a good SWAT then [20:34:29] <^d> Yep. [20:34:34] +1 [20:34:34] <^d> That one [20:35:12] i recommend also https://gerrit.wikimedia.org/r/#/c/141056/ [20:35:29] will let more experienced folks decide [20:35:55] it's icky if you can't load your user page [20:36:43] ^d: ok, almost got another one - fixes some weird errors we're seeing in Cirrus-failed.log and fixes a problem where I can't reindex meta [20:37:02] (03CR) 10BryanDavis: Enable ContentTranslation extension on beta labs (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [20:40:27] (03CR) 10BryanDavis: role::deployment: port apache::vhost to apache::site (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142205 (owner: 10Ori.livneh) [20:42:44] http://en.wikipedia.org/w/api.php?action=opensearch&format=json&search=Special:Log/move/Kr [20:44:40] (03CR) 10Nikerabbit: Enable ContentTranslation extension on beta labs (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [20:48:14] !log aude Finished scap: Update Wikidata, for enabling property suggester on testwikidata (duration: 31m 57s) [20:48:19] Logged the message, Master [20:50:57] aude: Is it working? [20:51:01] (03PS1) 10Ori.livneh: Add apache::conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/142400 [20:51:20] _joe_: i know it's late, but i'm eager to see if you like ^^ :) [20:51:32] RobH [20:51:40] no need to +/- but have a look if you're around [20:51:44] Hey, i was about to answer the power plug question [20:51:44] populated [20:51:49] now to enable it [20:51:49] papaul: I can answer here isntead =] [20:52:31] yes [20:52:33] i am here [20:52:55] my question is do you want me to move all servers in ps2 [20:53:06] the 40servers [20:53:14] So on PS1 for the bottom server, it gets the bottom plug. Leave the ps2 bottom plug empty. Then the server in the second U get the second plug 'pair'. It should plug into ps2, port2. Then you just go up that way, balancing the # of servers in each phase as well. So if there are 40 servers in a rack, 40/3=13.3, So put in 14 servers in bottom phase, 13 in the other two phase groups. [20:53:34] papaul: right now you have half the servers in ps1 [20:53:38] and half in ps2 correct? [20:53:59] (03PS2) 10Ottomata: Support logging via GELF, for sending to Logstash [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/140676 (owner: 10Gage) [20:54:05] yes [20:54:08] Ok, thats fine [20:54:11] (03PS2) 10Ori.livneh: Add apache::conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/142400 [20:54:18] normally each server has two power suplies, but on these they dont [20:54:24] so we want half on ps1, and half on ps2 [20:54:25] (03PS1) 10Aude: Enable property suggester on testwikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142402 [20:54:27] (03CR) 10Ottomata: [C: 032 V: 032] Support logging via GELF, for sending to Logstash [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/140676 (owner: 10Gage) [20:54:33] that way if we lose a pdu, we dont lose the entire rack of servers [20:54:44] yes that is the reason i have half on ps1 and halp on ps2 [20:54:51] yep, so you did that correctly, my correction is this [20:54:58] each server gets a 'pair' [20:55:12] so if server A has a pair, and B has the pair below [20:55:20] just shift them so the power plugs are staggered [20:55:32] it shows each pair in use better for when we replace those with dual psu systems [20:55:35] make sense? [20:55:47] no [20:55:55] (03CR) 10Aude: [C: 032] Enable property suggester on testwikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142402 (owner: 10Aude) [20:56:00] ok, on the photo you sent [20:56:08] the two top servers are in the two top power plugs [20:56:10] right? [20:56:11] i am now woking only with servers with 1 ps [20:56:15] i know [20:56:16] (03CR) 10Ori.livneh: role::deployment: port apache::vhost to apache::site (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142205 (owner: 10Ori.livneh) [20:56:21] (03PS1) 10BryanDavis: labs: role::deployment - port apache::vhost to apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/142407 [20:56:23] (03Merged) 10jenkins-bot: Enable property suggester on testwikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142402 (owner: 10Aude) [20:56:36] yes [20:56:48] so take the server plugged into ps2's top power plug [20:56:52] and move it to the second plug down [20:56:53] on ps2. [20:57:04] ok [20:57:09] you need to think of each set of plugs as a pair [20:57:16] ps1 port 1 and ps2 port 1 are a pair [20:57:19] bd808: thanks! [20:57:22] make sense? [20:57:42] so if one side of the pair is used, then you dont plug something else into the other side (unless its the same device's second power supply) [20:57:58] !log reedy Synchronized php-1.24wmf11/extensions/OAuth/: (no message) (duration: 00m 15s) [20:58:01] k let me do it and send you a pic [20:58:03] Logged the message, Master [20:58:06] So when servers swap out later on, and the new ones have dual power supplies, you dont have to move an existing server to add it [20:58:23] so when the top server swaps out for dual psu system, it can just plugin, and the second server down is not in its way [20:58:23] !log reedy Synchronized php-1.24wmf11/extensions/WikimediaMessages/: (no message) (duration: 00m 15s) [20:58:25] got you [20:58:28] Logged the message, Master [20:58:36] its amazingly hard to articulate the simple idea, heh [20:59:03] !log aude Synchronized wmf-config/InitialiseSettings.php: Enable property suggester on testwikidata (duration: 00m 07s) [20:59:06] and its important that servers with dual powersupplies always plug into the same # port on both pdu strips [20:59:08] Logged the message, Master [20:59:14] like you did with the switch in the photo already =] [20:59:32] done :) [21:00:17] understood let me work on it and send you a pic in a minute [21:00:25] cool [21:00:47] (03PS1) 10Ottomata: Add CDH5 support, drop CDH4 support [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/142411 [21:00:52] (03CR) 10BryanDavis: role::deployment: port apache::vhost to apache::site (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142205 (owner: 10Ori.livneh) [21:01:05] (03CR) 10jenkins-bot: [V: 04-1] Add CDH5 support, drop CDH4 support [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/142411 (owner: 10Ottomata) [21:01:20] (03Abandoned) 10Ottomata: Add CDH5 support, drop CDH4 support [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/142411 (owner: 10Ottomata) [21:03:22] hey RobH i just send you the pic [21:04:32] papaul: i see it, your showing the ones with metal clips as done right? [21:04:38] cuz the ones with metal look right to me [21:05:06] (the ones below it where there are three in a row on ps2 looks not right, but no clips so i assume its not done ;) [21:05:07] yes i juust work in those [21:05:09] cool [21:05:15] i have to change the others [21:05:15] yea, thats exactly what i meant! [21:05:18] =] [21:05:19] cool [21:05:30] so now when we replace one server, its power can go in a match pair [21:05:35] and we dont have to move another servers connection [21:05:39] future planning =] [21:05:44] what about the switches [21:05:54] the switches will eventually have dual psus [21:05:54] how do you want to pulg in the switches [21:05:57] those i just need to order and add [21:06:07] so just put both switches on ps1 [21:06:10] ok [21:06:18] mgmt will only ever have a single power connection [21:06:27] but the access switches will have two power supplies [21:06:33] ok [21:06:35] the fact they werent ordered with it inplace was a mistake =[ [21:06:41] (03PS7) 10Ottomata: Add CDH5 support, drop CDH4 support [operations/puppet/cdh4] (cdh5) - 10https://gerrit.wikimedia.org/r/135494 [21:06:46] but, unlike the servers, can be fixed without downtime or replacement =] [21:07:05] papaul: you could even run the second power cable now if you want [21:07:05] (03CR) 10jenkins-bot: [V: 04-1] Add CDH5 support, drop CDH4 support [operations/puppet/cdh4] (cdh5) - 10https://gerrit.wikimedia.org/r/135494 (owner: 10Ottomata) [21:07:12] so when it does arrive in awhile (i have to make an order for it) your owrk is 90% done [21:07:17] ok i will [21:07:25] (03CR) 10BryanDavis: "Cherry-picked to deployment-salt and applied on deployment-bastion. Vhost looks right and `git deploy` works as expected." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142407 (owner: 10BryanDavis) [21:07:33] save yourself a lot of time later =] [21:07:33] one on ps1 and the other in ps2 [21:07:39] for the access switches [21:07:44] yes [21:07:50] all mgmt switches on ps1 [21:07:53] ok [21:07:56] got it' [21:07:57] RobH: just out of curiosity: what do you guys use as access switches? [21:07:58] cool [21:08:24] Trminator: We're using a mixture of QFX5100 and EX4300 [21:08:31] nice [21:08:39] we have the QFX5100's as uplink switches from each row stack to routers [21:08:47] and we tie each row into a single stack [21:09:04] (03PS8) 10Ottomata: Add CDH5 support, drop CDH4 support [operations/puppet/cdh4] (cdh5) - 10https://gerrit.wikimedia.org/r/135494 [21:09:17] those models are new for codfw, we use ex4200s and ex4550s in older deployments [21:09:46] and core routers are MX480s [21:09:55] (or mx80s in caching centers) [21:10:33] but the new stuff with fancy 40Gb has been a learning experience in differing standards, heh. [21:12:02] looks nice :) [21:13:00] (03CR) 10Ottomata: [C: 032 V: 032] Add CDH5 support, drop CDH4 support [operations/puppet/cdh4] (cdh5) - 10https://gerrit.wikimedia.org/r/135494 (owner: 10Ottomata) [21:13:02] (03PS1) 10Scottlee: Fixed spacing and puppet-lint issues. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142417 [21:13:53] yo ^d, yt? [21:14:02] <^d> sup? [21:14:09] i want to rename a gerrit repository, i forget...possible? [21:14:47] <^d> Have history? Or just created and made a typo type situation? [21:15:39] yeah history [21:15:49] i want operations/puppet/cdh4 -> operations/puppet/cdh [21:16:06] <^d> Not easily, requires manual database wrangling. [21:16:23] (03PS4) 10Ottomata: Fixed text formatting and grammar. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142186 (owner: 10Scottlee) [21:16:27] https://groups.google.com/forum/#!topic/repo-discuss/ltIxBipUPKI [21:16:32] there are no outstanding reviews [21:16:39]   UPDATE changes SET dest_project_name = 'new_name' WHERE [21:16:39] dest_project_name = 'old_name'; [21:16:39] ? [21:16:41] that it? [21:16:43] or [21:16:45] (03CR) 10jenkins-bot: [V: 04-1] Fixed spacing and puppet-lint issues. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142417 (owner: 10Scottlee) [21:16:50] shoudl I just create a new one and push everything to it? [21:17:16] (03CR) 10Ottomata: [C: 032 V: 032] "Thanks!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142186 (owner: 10Scottlee) [21:17:23] what!? [21:17:26] <^d> ottomata: That's the big field. Also want to update people's watched projects. [21:17:47] <^d> Might be something else. [21:17:48] oh that's not the patchset i just merged, phew [21:18:20] so, ^d, are you recommending I don't try? :p [21:18:28] especially if I want to close my laptop in ~30 mins? [21:18:43] <^d> Do it and flee? Yeah I wouldn't recommend that ;-) [21:18:47] haha [21:18:52] ok, i think I will just rename and push then [21:20:22] <^d> account_project_watches, changes, submodule_subscriptions. [21:20:52] <^d> Yeah, that's it. [21:24:21] ottomata https://gerrit.wikimedia.org/r/#/c/142186/ -- what's wrong? [21:24:54] (03PS1) 10Ottomata: Pointing master at cdh5 branch [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/142424 [21:25:10] dogeydogey: nothing! its merged, eh? [21:25:25] got a failure message [21:26:07] (03Abandoned) 10Ottomata: Pointing master at cdh5 branch [operations/puppet/cdh4] - 10https://gerrit.wikimedia.org/r/142424 (owner: 10Ottomata) [21:26:13] i think on a different patch, dogeydogey [21:26:21] this one [21:26:21] https://gerrit.wikimedia.org/r/#/c/142417/ [21:26:37] you can see some FAILURE links from jenkins jobs at the last comment [21:26:40] !log Zuul/Jenkins stalled apparently [21:26:41] click on them to see what's wrong [21:26:45] Logged the message, Master [21:27:13] ottomata https://gerrit.wikimedia.org/r/#/c/142186/ says merged failed as well [21:27:24] oh hm [21:27:25] you are right [21:27:39] i think its lying though [21:27:54] !log restarting Jenkins [21:27:58] Logged the message, Master [21:28:04] git log origin/production [21:28:04] commit 0caade6fffbf700be5984223d0213a0586ee788a [21:28:04] Author: Scott Lee [21:28:04] Date: Wed Jun 25 22:06:59 2014 -0400 [21:28:10] Fixed text formatting and grammar. [21:28:24] https://github.com/wikimedia/operations-puppet/commit/0caade6fffbf700be5984223d0213a0586ee788a [21:28:25] looks good [21:28:43] okay cool [21:28:46] :) [21:32:17] !log enabled cp301[78] frontends in pybal [21:32:22] Logged the message, Master [21:32:35] thanks bblack [21:35:58] ^d, hm [21:36:12] how can I push my current master to this new remote? [21:36:21] i'm getting committer email address christian@quelltextlich.at [21:36:21] does not match your user account. [21:36:44] ottomata: pong :-) [21:36:48] haha [21:36:48] <^d> Need to grant yourself Forge Committer on the destination repo. [21:36:51] ah [21:36:51] k [21:36:54] <^d> Happens when you're pushing old history. [21:37:04] <^d> (Not a permission we grant by default for obvs reasons) [21:37:51] aye [21:38:02] ottomata so is this really bad or is it still going to be reviewed? https://gerrit.wikimedia.org/r/#/c/142417/ [21:38:04] aude: still online? will entity suggestor need anything more from you/us on tuesday? [21:38:11] aude: config change, i assume, right? [21:38:18] dogeydogey: you should probably fix the jenkins error before you get a review [21:38:19] greg-g: add the table on wikidata [21:38:24] populate and enable [21:38:45] !log restarting Zuul it has a bunch of stalled changes [21:38:49] Logged the message, Master [21:38:49] particularly: https://integration.wikimedia.org/ci/job/operations-puppet-pplint-HEAD/40/console [21:38:51] andrewbogott: thoughts re: https://bugzilla.wikimedia.org/show_bug.cgi?id=66751#c12 ? [21:38:52] won't take much time [21:39:51] aude: kk, I'll have it just be a part of the train window, and you can go after reedy's done (which is quick) [21:40:00] ok [21:40:16] no more than 5-10 min, based on testwikidata [21:40:30] andrewbogott: that Puppet snippet will provision an additional directory, the contents of which Puppet will _not_ manage, but which Apache will glob upon initialization, much as it does with sites-enabled. [21:40:41] scap was the long part and not needed on tuesday [21:43:01] (03PS1) 10Ottomata: Replicate operations/puppet/cdh to puppet-cdh [operations/puppet] - 10https://gerrit.wikimedia.org/r/142428 [21:43:32] !log hardkilled Zuul :-( 6 events lost. [21:43:38] Logged the message, Master [21:44:00] aude: cool [21:45:11] andrewbogott: there's no easy way to do what you're asking (make puppet fail upon encountering unmanaged files). and note that the decision to manage to sites-enabled recursively was made upstream; it was always the behavior of the apache module to do this. it had just never been applied to labs. [21:45:56] yep, I understand that it's upstream. But… isn't there code someplace that actually wipes out the dir? Or does it happen somehow more indirectly? [21:46:14] ensure => directory, recurse => true, purge => true [21:46:21] nothing more than that [21:46:23] (03PS2) 10Ottomata: Replicate operations/puppet/cdh to puppet-cdh [operations/puppet] - 10https://gerrit.wikimedia.org/r/142428 [21:46:49] Ah, so purge => true means it… wipes the directory and refills it on every puppet run [21:46:50] ? [21:47:38] I guess we can't fail on nonempty if part of why it's nonempty is due to things that it's about to recreate [21:47:45] sort of, except it doesn't actually touch the disk; it computes what it ought to do by compiling together the recursively-managed directory and any file declared to be contained within it [21:48:11] if local state matches what puppet expects, it doesn't delete files just to recreate them [21:48:35] i can totally see how this behavior may seem aggressive or surprising, but it's actually coherent if you think about it [21:48:49] (03CR) 10Ottomata: [C: 032 V: 032] Replicate operations/puppet/cdh to puppet-cdh [operations/puppet] - 10https://gerrit.wikimedia.org/r/142428 (owner: 10Ottomata) [21:49:21] if you declare file { '/etc/myapp/myconfig.conf': } and its contents differ from what puppet agent encounters on the target node, puppet just clobbers it (well, filebuckets it) and replaces it with the new file, right? [21:49:22] it's reasonable in most cases. Certainly in prod having /exactly/ what puppet defines seems correct. [21:49:38] It's just the transition that was awkward :) As I said, probably moot now. [21:49:50] Offering a specific checkbox on labs that allows for local config seems good. [21:50:06] recursively-managed config.d/ style directories extend that same approach to configuration "files" that are laid out on disk in fragments [21:50:25] <_joe_> andrewbogott: I explicitly stop puppet on labs hosts where I do things manually [21:50:52] andrewbogott: well, the transition wasn't awkward; i'd go further and describe it as an actual screw-up, and one that's entirely my own [21:50:53] _joe_: Sure, but I need volunteer-managed VMs to be mostly puppetized. In order to roll out upgrades and such. [21:51:13] <_joe_> eh, true as well. [21:51:14] !log Zuul/Jenkins back up and operational. [21:51:15] _joe_: so, allowing volunteers to manage bits of their systems while having them still be mostly puppetized is useful. [21:51:17] Logged the message, Master [21:51:18] so i totally accept blame for that, didn't anticipate that particular consequence [21:51:39] (03PS1) 10RobH: addming francium to dns [operations/dns] - 10https://gerrit.wikimedia.org/r/142430 [21:52:37] ori: It's ok -- seems to not have burned too many users. Go ahead and write a patch with the local-config option and I'll add an entry to the puppet checklist [21:52:57] andrewbogott: btw, i did let a few affected users that the purged config files were filebucketed and hence recoverable [21:53:12] Hi. :) [21:53:14] but i neglected to document that anywhere. do you know if there are any users that are still affected (i.e., who have lost config files)? [21:53:23] So I'm trying to access stat1003 and I basically don't know what I'm doing. [21:54:19] ori: I haven't heard from any outside of that bug. [21:54:38] Are there any instructions I can follow? [21:55:07] I found this but it's not very helpful: https://wikitech.wikimedia.org/wiki/Stat1003 [21:55:34] (03CR) 10RobH: [C: 032] addming francium to dns [operations/dns] - 10https://gerrit.wikimedia.org/r/142430 (owner: 10RobH) [21:55:50] Deskana: do you know if you have an account or not? [21:55:51] * andrewbogott checks [21:55:58] Pretty sure I do. [21:56:04] As Deskana. [21:56:26] uppercase? [21:56:50] greg-g: Oh, no. It'd be deskana. [21:57:01] Instance shell account name: deskana [21:57:07] * greg-g nods [21:57:13] Deskana: that's your labs account though, right? [21:57:14] i don't think you do [21:57:17] yes [21:57:22] there's no corresponding account for prod [21:57:24] Oh, well, that'd explain it. I wasn't aware they were different. [21:57:29] Deskana: I don't see any evidence that you have a prod account at all. [21:57:34] wmflabs != production :) [21:57:36] Well, that'll be that then. [21:57:48] Can I have one? :) [21:57:50] Deskana: you will need to file an RT ticket and get a +1 from your manager. [21:58:07] Howie's on holiday. [21:58:14] I guess I'll get James_F to +1 it since he's his delegate. [21:58:27] https://wikitech.wikimedia.org/wiki/Requesting_shell_access [21:58:35] Deskana: ^ [21:58:46] Deskana: or Toby since that box is kind of in his domain. [21:58:56] So I guess I don't have access to RT either. [21:59:07] you can email ops-requests@rt.wikimedia.org [21:59:08] Deskana: just send it an email [21:59:12] what ori said [21:59:19] Thank you guys. :) [21:59:22] "If you do not have access to this system, you should discuss your access request in freenode IRC in #wikimedia-tech or #wikimedia-operations, or send an e-mail to ops-requests. " [21:59:26] ;) [21:59:28] andrewbogott: where do generic bits of labs puppet code go? [21:59:43] andrewbogott: the module itself should not know about labs/production [21:59:51] ori: role::labs::whatever? [21:59:55] ori: If you want it applied to all labs boxes, then in manifests/roles/labs.pp [22:00:01] or, yeah, as bd808 said [22:00:22] but if it's a role::labs::whatever there's a chance it won't be included by a user who includes the apache module [22:00:27] no? [22:00:43] yes, it'll have to be specifically selected. [22:01:29] I'm not sure there's a better way. It can be in the apache module with a $::realm check. Or it can be applied on all labs hosts with or without apache... [22:01:32] neither of those are great [22:01:37] Request sent. Thanks everyone. :) [22:02:29] Deskana: there's a traditional 3-day wait for shell access. So you will want to check in on Monday or Tuesday and nag whoever is marked in the topic as being on RT duty at the time. [22:03:00] Bugger. I guess I should've got around to requesting this sooner. [22:04:16] Thanks for letting me knowl [22:04:49] Yeah, there's a process and a delay, to avoid social hacking and to verify are-who-you-say-you-are, etc. [22:04:58] But once the three days are up you can nag with abandon :) [22:05:08] says the dude who wont be on rt duty [22:05:09] ;p [22:05:38] * andrewbogott steeples fingers contentedly [22:05:58] well, it has to sit in access-requests for 3 days [22:06:04] however long it sits in ops-requests doesnt count [22:06:39] Deskana: so im on rt duty next week, but what i said applies, so you could also specifically instead email access-requests@rt.wikimedia.org [22:06:47] Does it need a managerial +1 before it gets moved? Otherwise I'll just move it right now [22:06:50] then the 3 day starts right away [22:07:00] It sounds like he already has an ops-requests ticket [22:07:03] just move it right away [22:07:06] ok, lets see [22:07:08] So we can just move it [22:07:46] It's already had a +1 from my manager. [22:07:59] bah, I can't remember how to merge two tickets [22:08:02] Well, his current delegate, but that's effectively the same. [22:08:04] * andrewbogott flounders with RT [22:08:45] Deskana: do you have the ticket #? [22:08:51] andrewbogott: merges are under 'link [22:08:53] RobH: 7759 I think. [22:08:53] ' [22:09:03] ok, then im looking at it now [22:09:04] yeah, found it but I think you got there first [22:09:12] and i moved to access requests [22:09:13] heh, yep [22:09:20] Ah, you need me public key. Let me email that. [22:09:31] Deskana: if you dont have any kind of shell access, correct [22:09:38] and it has to be different than your labs key [22:09:47] best place is put on your userpage on officewiki [22:09:50] then just link it in ticket [22:10:03] email spoofing is easy, hence not ideal for key transfer [22:10:14] someone can say they are you and provide their pub key =P [22:10:43] alternatively, if you have gerrit access you can simply commit your own patchset for us to review, whichever is easier for you [22:12:30] I'll generate a new key and put it on my officewiki page. [22:12:33] Deskana: also, if you dont have labs user yet, you should get one [22:12:40] cuz we use your userid from labs to make your production id [22:12:54] since 99.99% of use cases need labs first anyhow [22:13:04] but you can get that signign up on wikitech. [22:13:12] (so anyone can do that anytime, staff or not) [22:13:18] Pretty sure I'm on labs as deskana already. [22:13:35] cool, then yea upload the pub key to officewiki and link and yer good [22:13:49] already has your mgmt approval. so i'd handle merging it next week if no one objects [22:14:01] you can email your updates directly into the ticket [22:14:16] so email the link to 7759@rt.wikimedia.org [22:14:21] Well I have to confess I don't really know how to have multiple key pairs at once. [22:14:26] I generated a new one and it gave me the same thing. [22:14:35] ahh, i was about to link you to some docs [22:14:48] you'll need to understand and run two differnt keys, but i just gerenated two [22:14:52] and changed the default names when prompted [22:15:08] so i have wmf_prod and wmf_lab [22:15:13] Yeah, for some reason when I generated the second the public key was exactly the same. [22:15:16] wmf_prod.pub, etc... [22:15:25] also make sure your keys have passphrases [22:15:36] and i'll have it on the ticket later but you'll want to review https://wikitech.wikimedia.org/wiki/Server_access_responsibilities [22:16:00] there is a link on that page on managing multiple agents, but if you are using os x i can help ya [22:16:01] (03PS1) 10Ori.livneh: On Labs, provision an Apache config dir that is not managed by Puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/142439 (https://bugzilla.wikimedia.org/66751) [22:16:19] andrewbogott: ^ [22:16:42] Deskana: I promise I'm not trying to overwhelm you, even if it seems like it =] [22:17:12] Wait, I'm stupid. I checked only the first few characters of the public key; they were the same, but they are actually different keys. [22:17:25] It changes further down. [22:17:39] Oops. [22:18:46] (03CR) 10Andrew Bogott: "This seems good, although I'm a bit unsure about order of operations. Isn't it possible that the 'if defined' bit will get traversed befo" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142439 (https://bugzilla.wikimedia.org/66751) (owner: 10Ori.livneh) [22:21:33] RobH: Alright, I sent an email linking to my officewiki userpage which has my public keys on. [22:22:27] RobH: Thanks for your help. :) [22:22:54] Yes, I was helping you, and not selfishly avoiding having to do these steps next week! [22:22:56] * RobH is nice [22:22:59] ;] [22:23:25] Deskana: I'm pretty sure every rsa pub key starts with 4 A's ;) [22:23:29] but yea, if you have issues with the ssh key stuff i can help out if you cannot get it workin [22:24:00] also you must do a tweak to your ssh terminal to make it work with different keys and not combine them all [22:24:11] or else you can end up loading both labs and prod key in the same ssh-key session [22:24:14] which is not good. [22:24:15] More stuff in .ssh/config ? [22:24:22] using os x? [22:24:25] Yeah [22:24:27] if so its a preferences in terminal [22:24:39] https://wikitech.wikimedia.org/wiki/Managing_Multiple_SSH_Agents#OS_X_Solution [22:24:46] i wrote those, so its right! =] [22:24:54] RobH: Cool, I'll follow those and if it doesn't work I'll poke you again. [22:25:02] if you are using default terminal, though i give credit to ryan lane for the solution [22:25:17] without that tweak, it just loads all the keys into one session (even on multiple tabs) [22:25:26] where that fix makes each tab or window its own ssh-agent session [22:25:54] and its really stupid its not the default for os x [22:26:25] * RobH is also amused the os x fix is one line of preferences where the linux fix is a ton of stuff [22:26:32] RobH: on osx, convenience >> * [22:26:37] or something [22:26:40] heh [22:27:00] yes, someone has paid for that simplicity [22:27:08] both in software freedoms and in money. [22:27:15] * RobH still has os x desktop [22:27:17] Cool. Well, my labs shell still works with that change, so I guess it worked. [22:27:29] Deskana: if you want to test if its working [22:27:32] load your key in one tab [22:27:39] then open another tab and type in ssh-add -L [22:27:51] if the other tab says it has no identities [22:27:58] when the first has your key [22:27:58] then its good [22:28:28] everytime you start a window or tab it'll also show that eval `ssh-agent` command run [22:28:40] beta cluster is 503ing [22:29:01] and then you get the warm fuzzy feeling of having a secured ssh-agent session for whatever you are doing, with no accidental leaking of unrelated keys [22:30:13] jackmcbarn: hmm, I don't get 503s, but I do get no css/js [22:30:55] they're sporadic. here's what i got exactly: Request: GET http://en.wikipedia.beta.wmflabs.org/wiki/Help:Table, from [my IP] via deployment-cache-text02 frontend ([10.68.16.16]:80), Varnish XID 1620898939 Forwarded for: [my IP] Error: 503, Service Unavailable at Thu, 26 Jun 2014 22:29:19 GMT [22:31:51] jackmcbarn: it's back for me [22:31:59] might have just been a big job running [22:32:08] why do big jobs cause 503s? [22:32:27] virtualized hardware and database upgrades etc [22:32:49] beta cluster runs on the same hardware as wmflabs [22:32:59] oh, i thought you meant job queue jobs. because there was just a big job queue job [22:33:21] ah [22:39:45] (03PS1) 10RobH: francium as new blog server [operations/puppet] - 10https://gerrit.wikimedia.org/r/142447 [22:40:40] (03PS2) 10RobH: francium install params as new blog server [operations/puppet] - 10https://gerrit.wikimedia.org/r/142447 [22:42:21] (03CR) 10RobH: [C: 032] francium install params as new blog server [operations/puppet] - 10https://gerrit.wikimedia.org/r/142447 (owner: 10RobH) [22:47:56] as i sit here, installing another host for the blog [22:48:04] i think this will be the 5th or 6th now over time. [22:48:25] _joe_'s point of this being prime virtulization territory is very on point [22:48:25] heh [22:49:05] the fact im not scrambling after its been really borked or messed up is odd. [22:49:14] very relaxing install. [22:52:27] (03PS2) 10Ori.livneh: On Labs, provision an Apache config dir that is not managed by Puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/142439 (https://bugzilla.wikimedia.org/66751) [22:53:33] (03CR) 10Ori.livneh: "Yes, it's possible. Amended to work around that. It's not what I'd call an elegant solution, but I'll make it neater once https://gerrit.w" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142439 (https://bugzilla.wikimedia.org/66751) (owner: 10Ori.livneh) [22:56:01] I'm sure this is documented somewhere and I've just not found the page, but what's the plan re. PHP 5.4 release to the cluster? IIRC it was waiting for Trusty – does that mean it's a few weeks off, or a few months? [22:57:33] matanya: btw, have you seen https://wikitech.wikimedia.org/wiki/Volunteer_NDA ? Quim created that last week during our internal review of how to do things. It's not long term official official yet, but it's something. [22:57:37] (03PS4) 10Ori.livneh: On Labs, provision an Apache config dir that is not managed by Puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/142439 (https://bugzilla.wikimedia.org/66751) [22:58:19] James_F: As far as I know, the plan for trusty to get everything running under HHVM [22:58:33] Which is making progress [22:59:04] James_F: yes, we're going to leapfrog over 5.4 [22:59:28] ori: Does HHVM have traits support? (Is there somewhere I can self-serve so I don't waste your time?) [22:59:48] James_F: HHVM targets parity with 5.6 [22:59:52] so yes [22:59:55] Aha. Awesome. [23:00:04] mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140626T2300) [23:00:18] i'll do it [23:00:22] hah [23:00:23] i haven't in ages [23:00:26] Only 11 patches today :P [23:00:35] fun times [23:00:36] :) [23:00:41] OK really 7 [23:00:43] * James_F coughs about 8. [23:00:43] "soft limit" we'll call it [23:00:53] It's 11 if you double-count cherry-picks to both branches [23:00:57] Is there a decision about MW-core support target? (If we keep requiring 5.3 support, we can't use all those new features, but…) [23:01:05] * James_F grumbles. [23:01:48] RoanKattouw: sorry - yeah, lots of stuff [23:01:55] Its not been a good week for me [23:01:58] Hey it's not me today :) [23:02:05] James_F, will happen with trusty [23:02:10] greg-g: for deployment of odder's change, can i ask you to update https://bugzilla.wikimedia.org/show_bug.cgi?id=67120 to ask Mr. Muller to err on the side of over-communicating before large uploads? [23:02:24] * greg-g looks [23:02:28] MaxSem: So MW decisions are made by WMF's Operations availability of tools? Eww. [23:02:37] (03PS3) 10Ori.livneh: Add an Erasmus University domain to whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142309 (https://bugzilla.wikimedia.org/67120) (owner: 10Odder) [23:02:41] (03CR) 10Ori.livneh: [C: 032] Add an Erasmus University domain to whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142309 (https://bugzilla.wikimedia.org/67120) (owner: 10Odder) [23:02:48] ori: yep [23:02:51] (03Merged) 10jenkins-bot: Add an Erasmus University domain to whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142309 (https://bugzilla.wikimedia.org/67120) (owner: 10Odder) [23:03:11] James_F, [Tim Starling] Let's drop support for PHP 5.3 after we switch the WMF servers to 5.4 or later. I think it would be inconvenient to be unable to run our software on our own servers. [23:03:15] James_F: Well if you made 5.5 required today it would be tricky to deploy your code to enwiki :) [23:03:55] bd808: Oh, absolutely, I agree that we can argue about not allowing 5.4 before WMF can run it. [23:04:08] IIRC the two questions (which version the WMF runs in prod and which version is the minimum requirement) are separate [23:04:12] bd808: My concern is the other way around – if we switch to HHVM, is it OK to use 5.6-only things? [23:04:14] and IIRC 5.4 was already agreed upon [23:04:21] * James_F nods. [23:04:22] * aude cringe that we're still using http://www.php.net/manual/en/function.mysql-fetch-object.php in core [23:04:25] Seems reasonable. [23:04:26] deprecated in 5.5 [23:04:31] so once we switch to HHVM, 5.4 would make sense as the minimum standard [23:04:32] to be removed soon [23:05:00] what's in 5.5 and 5.6 that you're lusting after? [23:05:20] * ori re-focuses on deployment [23:05:23] BLEEDINGEDGENESS!! [23:05:50] ori: Nothing, I was worried about 3rd parties. We're after traits from 5.4. :-) [23:06:25] i never liked parties [23:06:52] are traits that useful? i learned php at the wmf so i never paid much attention to anything that isn't in 5.3 [23:06:59] !log ori updated /a/common to {{Gerrit|Ie96265c4f}}: Add an Erasmus University domain to whitelist [23:07:03] Logged the message, Master [23:07:11] ori: commented [23:07:25] ori: Traits are ruby's mixins basically [23:07:30] !log ori Synchronized wmf-config/InitialiseSettings.php: Ie96265c4f: Add an Erasmus University domain to whitelist (duration: 00m 05s) [23:07:35] Logged the message, Master [23:07:38] greg-g: thanks [23:07:59] np [23:08:14] ori: i made up those numbers, the first is reasonable, the second is totally made up [23:09:20] twkozlowski: btw, hey look! I updated the deploy calendar on a thursday! [23:09:25] ori: I code php like ^d because I learned it wihle working with him. I code java totally differently.... [23:09:51] twkozlowski: thanks for the whitespace fixes :) [23:14:16] andrewbogott: sorry, i missed your update to https://gerrit.wikimedia.org/r/#/c/140828/ (wikitech role). it LGTM; any reason not to merge it? [23:14:57] no, please do. I may have a few ongoing changes but it'll be easier if the base is merged. [23:15:10] nod [23:15:16] feel free to nag me in the future if i overlook a patch [23:16:52] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0] [23:18:22] there was a blip in 5xx: http://gdash.wikimedia.org/dashboards/reqerror/ [23:18:24] !log ori Started scap: CirrusSearch updates: Iefe340729, Ie12418e54, Ie21fb352 [23:18:30] Logged the message, Master [23:18:54] not mw errors [23:18:56] hm [23:19:49] lots of errors in log - some apc looking stuff and some database looking stuff [23:20:56] PHP Warning: Recursion detected in RequestContext::getLanguage in /usr/local/apache/common-local/php-1.24wmf10/includes/context/RequestContext.php on line 320 [23:21:10] thats forever [23:21:26] oh, that's yeah [23:23:23] !log ori Finished scap: CirrusSearch updates: Iefe340729, Ie12418e54, Ie21fb352 (duration: 04m 59s) [23:23:29] Logged the message, Master [23:23:35] manybubbles: ^ [23:23:46] ori: I see - [23:24:13] we'll know if it worked the next time we see an update error [23:26:23] ori: did scap pick up https://gerrit.wikimedia.org/r/#/c/142418/ as well? [23:27:20] manybubbles: it should have; let me verify [23:27:40] ori: thanks - everything looks right and proper from my side [23:27:41] manybubbles: yes, it did [23:27:45] cool [23:27:59] !log Previous scap included I2cfcfaf06 as well [23:28:04] Logged the message, Master [23:29:50] manybubbles: :) (re notdone to done on outage repot) [23:29:52] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [23:29:57] ori: thanks for the sync [23:30:03] (03CR) 10Andrew Bogott: [C: 032] On Labs, provision an Apache config dir that is not managed by Puppet [operations/puppet] - 10https://gerrit.wikimedia.org/r/142439 (https://bugzilla.wikimedia.org/66751) (owner: 10Ori.livneh) [23:30:04] greg-g: just synced it [23:30:04] +r, unless you're also up here in the north bay [23:30:11] yeah [23:30:29] was looking at warmers when more stuff blew up this afternoon [23:30:41] onw I'm gonig to go hang out with my family for a few hours while wikis rebuild [23:30:56] manybubbles: enjoy sir [23:31:20] !log Cirrus rebuild progress - big wikis in group1 are finished with in place reindex and well into from mediawiki rebuild. [23:31:24] Logged the message, Master [23:31:54] !log Cirrus rebuild progress - alphabetical wikis in group2 are 2/3 of the way done with reindex - from mediawiki rebuild is maybe 20% done there [23:31:58] Logged the message, Master [23:32:28] !log Cirrus rebuild progress - started large/high cirrus visibility wikis in group2 - enwiki, cawiki, and itwiki. [23:32:34] Logged the message, Master [23:55:44] huh, puppet starting disabled by default [23:55:55] thats a sensible change, i like. [23:56:02] (new installs)