[00:13:37] akosiaris: ping? [00:15:20] thanks MaxSem seems to be working now :) [00:19:34] !log springle synchronized wmf-config/db-eqiad.php 's4 depool db1042 while replag catches up' [00:19:39] Logged the message, Master [00:25:09] Notice: Undefined variable: wmgMediaViewerLoggedIn in /usr/local/apache/common-local/wmf-config/CommonSettings.php on line 1775 [00:25:24] any ideas? [00:27:06] apergos: Sure :P [00:27:27] It either needs to be set in InitialiseSettings or a isset has to be used [00:27:29] want a patch? [00:27:45] I want it not to be broken like it just became... [00:28:18] (I'm not going to +2 r deploy anything in my current state, I am waay too sleepy to be trusted) [00:29:10] I guess ori's push maybe [00:29:17] oh, wow [00:29:24] typo! :D [00:29:46] fun fun. I'll debug. rdwrer, hi. [00:29:53] thank you [00:30:02] ori: Howdy [00:30:08] see above [00:30:12] * rdwrer looks [00:30:15] Ah christ. [00:30:30] * greg-g removes "SUCCESS" from today's SWAT window [00:30:34] awwww [00:30:36] I'll write the followup [00:30:38] :) [00:30:45] to be re-added after this is fixed ;) [00:30:50] * rdwrer hugs greg-g [00:31:08] I might comein to my virrtual office late tomorrow... = today. at the rate my ay is going [00:31:27] (03PS1) 10Ori.livneh: Correct typo in $wmgMediaViewerLoggedIn var name [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119425 [00:31:34] (03PS1) 10Hoo man: Fix a typo in the MediaViewer config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119426 [00:31:38] damn, org [00:31:39] * ori [00:31:42] heh [00:31:51] (03CR) 10Ori.livneh: [C: 032 V: 032] Correct typo in $wmgMediaViewerLoggedIn var name [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119425 (owner: 10Ori.livneh) [00:32:00] ... [00:32:03] !log ori updated /a/common to {{Gerrit|Ib83df6e31}}: Correct typo in $wmgMediaViewerLoggedIn var name [00:32:06] test. *everything*. :-P [00:32:08] Logged the message, Master [00:32:18] Oh, typo. [00:32:21] Oh, ori beat me. [00:32:27] (03Abandoned) 10Hoo man: Fix a typo in the MediaViewer config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119426 (owner: 10Hoo man) [00:32:37] :) [00:32:46] thanks again, ori [00:32:56] !log ori synchronized wmf-config/InitialiseSettings.php 'Ib83df6e31: Correct typo in $wmgMediaViewerLoggedIn var name' [00:33:01] Logged the message, Master [00:33:14] let's see how that is now [00:33:58] Warning: file(/usr/local/apache/common-local/php-1.23wmf18/../mediaviewer.dblist): failed to open stream: No such file or directory in /usr/local/apache/common-local/wmf-config/CommonSettings.php on line 196 [00:33:58] Warning: array_map(): Argument #2 should be an array in /usr/local/apache/common-local/wmf-config/CommonSettings.php on line 196 [00:33:58] Warning: in_array() expects parameter 2 to be array, null given in /usr/local/apache/common-local/wmf-config/CommonSettings.php on line 197 [00:34:13] ouch [00:34:15] ya mind stomping on those too? [00:34:55] I'll fix. [00:35:01] thanks much [00:35:03] then, I don't this time [00:35:57] $ time echo ':q' | vim InitialiseSettings.php [00:35:57] real 0m7.884s [00:36:09] That file's just to big... :/ [00:36:51] !log ori synchronized mediaviewer.dblist [00:36:56] Logged the message, Master [00:37:30] should I check again or is ther more to come yet? [00:37:37] could you check again? [00:37:41] yup sec [00:38:11] running [00:38:15] thank you for the fixes [00:38:34] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [00:39:01] thanks for reporting [00:39:46] real 0m3.499s [00:39:59] on a server box of mine... I wonder why vim's so slow locally [00:40:33] try vim -u NONE -N LocalSettings.php [00:42:21] that makes it *much* faster [00:42:35] even with just the -N it's "only" 3.5s [00:44:54] rm -rf .vim/ .viminfo did the trick... that's weird [00:48:46] (03PS2) 10Tim Landscheidt: WIP: labsdeprepo: Allow more than one local repository [operations/puppet] - 10https://gerrit.wikimedia.org/r/118796 [00:53:20] andrewbogott: no you can kill it [00:53:31] or it can be left until deletion [00:59:16] (03PS1) 10Tim Landscheidt: WIP: Tools: Use labsdeprepo [operations/puppet] - 10https://gerrit.wikimedia.org/r/119428 [01:00:25] (03CR) 10jenkins-bot: [V: 04-1] WIP: Tools: Use labsdeprepo [operations/puppet] - 10https://gerrit.wikimedia.org/r/119428 (owner: 10Tim Landscheidt) [01:30:22] (03CR) 10Faidon Liambotis: "The confusion has been there for some time. This sounded familiar so I went digging :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119339 (owner: 10BryanDavis) [01:30:36] bd808|BUFFER: ^^ [01:32:01] cmjohnson1: ping? [01:33:31] nevermind [01:33:49] what is dbstore1001? [01:35:09] springle: hi? [01:35:24] I can't find an RT or anything on manifests/site.pp and the switch port even had no description [01:35:50] we had an unnamed switch port saturate a gigabit, so I went a little digging :) [02:02:37] (03CR) 10Aaron Schulz: [C: 031] Beta cluster MemcachedPeclBagOStuff: use PHP serialization [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119411 (owner: 10Ori.livneh) [02:03:55] argh [02:04:08] I spent something like 2-3 days to make igbinary build [02:04:11] and properly work [02:04:26] well igbinary + libmemcached + pecl memcached, but still [02:05:19] I wonder if it was for nothing... [02:07:15] !log LocalisationUpdate completed (1.23wmf17) at 2014-03-19 02:07:15+00:00 [02:07:21] (03CR) 10Faidon Liambotis: "Is there anywhere I can read more about this casual testing/igbinary savings comparison? I'm just curious :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119411 (owner: 10Ori.livneh) [02:07:22] Logged the message, Master [02:08:20] * AaronSchulz looks at https://bugzilla.mozilla.org/show_bug.cgi?id=842623 [02:09:42] * paravoid looks at http://vanilla-js.com/ [02:12:42] !log LocalisationUpdate completed (1.23wmf18) at 2014-03-19 02:12:42+00:00 [02:12:47] Logged the message, Master [02:23:27] paravoid: hi [02:26:00] springle: hi, see above [02:26:13] regarding dbstore1001 [02:28:17] paravoid: dbstore100[12] are new. still doing the setup [02:29:41] paravoid: where are switch ports named or unnamed? [02:41:14] on the switches themselves [02:41:28] that's something that typically chris does when cabling, but it can obviously be easily missed [02:41:30] is that something i should have done? [02:41:33] oh ok [02:42:31] we need to add dbstore to at least https://wikitech.wikimedia.org/wiki/Server_naming_conventions & puppet though [02:42:42] role class, site.pp etc. [02:42:51] yep [02:42:53] not urgent, but since you asked :) [02:43:31] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Mar 19 02:43:28 UTC 2014 (duration 43m 27s) [02:43:37] Logged the message, Master [02:43:51] will do :) [02:44:21] * springle does the wiki bit now [02:46:10] puppet will have to wait until i actually figure out how things will work [02:48:52] (03PS1) 10BryanDavis: Fix order of graphs for dashboards [operations/puppet] - 10https://gerrit.wikimedia.org/r/119437 [03:06:21] (03PS1) 10Tim Landscheidt: Labs: Provide symbolic links to dumps for compatibility [operations/puppet] - 10https://gerrit.wikimedia.org/r/119438 [03:29:04] (03PS2) 10BryanDavis: Ensure that status is always defined in deploy.checkout [operations/puppet] - 10https://gerrit.wikimedia.org/r/119232 [03:39:34] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [03:44:33] greg-g: bd808: can someone confirm lots of 503s at https://gdash.wikimedia.org/dashboards/reqerror/deploys ? [03:44:47] (i.e. the graph images themselves are 503) [03:45:18] jeremyb: Yes. Uncheck the "show deploys" box [03:45:30] right, but i checked it for a reason :) [03:45:42] jeremyb: https://bugzilla.wikimedia.org/show_bug.cgi?id=62667 [03:46:26] well then i'm confused about how RT 6970 is closed [03:47:06] bd808 [03:47:12] When we closed it we thought the only problem was the port configuration [03:47:27] https://gerrit.wikimedia.org/r/#/c/119339/ [03:47:35] Since then I've found other issues [03:48:08] I'm not sure of the source of the wild list of ampersands yet [03:48:12] so, where do you want to track it? is it now mostly an ops thing to fix or not ops? [03:48:32] hah, wild list! [03:48:35] i see list now [03:48:58] 4188 in the url I counted them for [03:49:27] I'm currently doing some timeout handling testing in parsoid and am getting quite a lot of timeouts and connection failures from the API, which is handy for testing but maybe not ideal for the API [03:50:10] is there any data on used sockets, connection errors etc for the text varnishes and the api cluster? [03:50:11] jeremyb: both,link them,don't care it's duplicate,wait for general tool review discussion.. /me hides again [03:50:33] * jeremyb gets out his fishing pole [03:50:38] * jeremyb hooks mutante|away [03:50:44] paravoid: the switch ports may still be labeled db1064/65...looking now [03:50:57] bd808: but it turned into 200 after a few tries (apparently the page autorefreshes) [03:51:12] jeremyb: It may be useful for debugging to know what the contents of /etc/gdash/gdash.yaml on tungsten are [03:51:30] what's the trick to get on new eqiad labs instance [03:51:35] bd808: 301 FOUND mutante|away [03:51:42] :) [03:52:16] 503 is back! [03:57:45] {"graphite": "https://graphite.wikimedia.org", "options": {"deploy_addon": "alias(color(dashed(drawAsInfinite(deploy.sync-common-file)),\"gold\"),\"sync-common-file\")&alias(color(lineWidth(drawAsInfinite(deploy.sync-common-all),2),\"gold\"),\"sync-common-all\")&alias(color(lineWidth(drawAsInfinite(deploy.scap),2),\"black\"),\"scap deploy\")\n", "graph_columns": 1, "graph_height": 500, "graph_width": 1024, "hide_legend": false, "title [03:57:52] there [03:58:03] you asked for the file contents, but i have no idea besides that:) [03:58:19] mutante|away: thanks [03:59:10] mutante|away: that's truncated [03:59:19] bd808: look like enough or you want the whole thing? [04:00:16] jeremyb: I think it's got the bit I was interested it. The "deploy_addon" clause was what I was after. [04:00:36] http://paste.debian.net/plain/88493 [04:01:50] bd808: i'm starting to think that file's not actually yaml? [04:02:26] Actually it is :) JSON is a propper subset of YAML [04:07:08] well it's not yamlithic? as in python could be pythonic. [04:08:25] True enough. [04:10:24] * bd808 has bleeding eyes from reading ruby code [04:24:21] http://en.wikipedia.org/wiki/HTTP_451 [04:45:55] (03PS1) 10BryanDavis: Trim newline from gdash config [operations/puppet] - 10https://gerrit.wikimedia.org/r/119440 [04:47:27] (03CR) 10Ori.livneh: [C: 032] "Bless your heart" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119437 (owner: 10BryanDavis) [04:48:03] (03CR) 10Ori.livneh: [C: 032] Trim newline from gdash config [operations/puppet] - 10https://gerrit.wikimedia.org/r/119440 (owner: 10BryanDavis) [04:51:09] bd808: merged & forced a puppet run on tungsten; could you verify? [04:52:03] ori: Order on https://gdash.wikimedia.org/dashboards/reqerror/ looks right now [04:52:37] thanks very much for doing that; i had a python script for renumbering graphs at one point but i lost it [04:52:50] i really dislike gdash [04:53:10] ori: If you get bored tonight and want a mystery to trace: https://bugzilla.wikimedia.org/show_bug.cgi?id=62667 . See if you can figure out why turing on the deploy markers adds a ton of ampersands to the graph url. https://gdash.wikimedia.org/dashboards/reqerror/deploys does it the worst from what I've seen. [04:54:00] The really weird thing is that the number of ampersands seems almost random from page load to page load. [04:54:11] i note in passing that you used to gently mock me for working at this hour and warned me of burnout :P [04:54:25] Or… maybe changes based on time? [04:54:42] It's only 11PM here :) and I'm a burnout [04:54:45] * ori takes a moment to read the bug to understand what you're talking about [04:56:16] ori: Here's the url I got a few minutes ago: http://p.defau.lt/?YnKTOSQ94WEy4qTaQQBwEw [04:56:50] Then it changed to http://p.defau.lt/?b2KGiWU1tROzBp_J1bYe9g [04:56:54] it's valley girl talk [04:56:57] and, and, and, and, and [04:58:02] Something is going goofy with a join somewhere in a big pile of ruby [04:58:09] yeah, i think i know where [05:00:33] I want to setup https://github.com/kenhub/giraffe somewhere to see if it is as nice to use as kibana is for elasticsearch dashboards [05:02:56] Hello Mr/Ms. I have a problem that I don't know where to ask for help and a steward direct me here. My problem is that I have just changed my password yesterday and now I can't remember the new password. [05:03:08] I also have not added an valid email for my account (my bad) so I am hopeless now. Could you help me or direct me to a appropriate place to ask for assistance? [05:11:14] PROBLEM - SSH on lvs1001 is CRITICAL: Server answer: [05:12:14] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [05:25:50] (03PS1) 10Ori.livneh: Fix graph URL generation when deploy addon is enabled [operations/software/gdash] - 10https://gerrit.wikimedia.org/r/119444 [05:25:52] (03PS1) 10Spage: Enable Popups (Hovercards) on mediawiki.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119445 [05:26:14] bd808|BUFFER: faidon had giraffe set up on noc.wikimedia.org at one point [06:24:03] (03CR) 10Ori.livneh: [C: 032 V: 032] "Tested in production." [operations/software/gdash] - 10https://gerrit.wikimedia.org/r/119444 (owner: 10Ori.livneh) [06:40:34] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [07:58:31] (03PS1) 10Ori.livneh: Use libvmod-header to set GeoIP cookie [operations/puppet] - 10https://gerrit.wikimedia.org/r/119448 [07:59:39] (03PS2) 10Ori.livneh: Use libvmod-header to set GeoIP cookie [operations/puppet] - 10https://gerrit.wikimedia.org/r/119448 [08:37:26] (03PS1) 10ArielGlenn: space out page titles and media titles generation on snapshots [operations/puppet] - 10https://gerrit.wikimedia.org/r/119449 [08:39:05] (03CR) 10ArielGlenn: [C: 032] space out page titles and media titles generation on snapshots [operations/puppet] - 10https://gerrit.wikimedia.org/r/119449 (owner: 10ArielGlenn) [08:54:29] !log springle synchronized wmf-config/db-eqiad.php 's4 repool db1042' [08:54:34] Logged the message, Master [09:04:22] morning hashar! https://gerrit.wikimedia.org/r/#/c/119411/ is for you :D [09:05:00] ori: good morning [09:05:20] hello hello [09:07:48] I am still warming up sorry :( [09:08:19] i know, i feel a little guilty for jumping on you like that [09:08:36] no rush, do other things, have coffee, read the news! :P [09:08:56] ori: well you can probably merge it in since it only impacts beta [09:09:20] during our monday calls I was talking about the multiwrite setup causing slowness [09:09:40] unrelated to that serialization stuff [09:10:02] yeah. is the slowdown significant? [09:10:12] with multi-write? Yeah [09:10:18] aka writing to both pmtpa and eqiad [09:10:25] * ori will merge, by the way, if you don't object [09:10:27] it is not stalling the cluster, but definitely very slow [09:10:33] note that beta does not have twemproxy [09:11:00] also parser cache is in memcached iirc [09:11:11] yep [09:11:21] (03CR) 10Hashar: "I guess you can merge it in since it only impact beta." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119411 (owner: 10Ori.livneh) [09:11:38] (03CR) 10Ori.livneh: [C: 032] Beta cluster MemcachedPeclBagOStuff: use PHP serialization [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119411 (owner: 10Ori.livneh) [09:13:08] !log ori synchronized wmf-config/mc-labs.php 'I46a9d180b: Beta cluster MemcachedPeclBagOStuff: use PHP serialization' [09:13:13] Logged the message, Master [09:13:55] seems to work [09:14:00] hashar: why no twemproxy? [09:14:12] cause nobody has set it up yet ? :-] [09:14:19] there is no swift either [09:14:23] nor a database parser cache [09:14:30] maybe I should fill bugs for them hehe [09:15:36] class { "mediawiki": twemproxy => false } [09:15:47] do you remember what was the reason? [09:16:56] the commit message might have details [09:18:16] heh: https://gerrit.wikimedia.org/r/#/c/65604/ [09:18:21] "twemproxy proc monitor, don't run in labs " [09:18:34] very helpful commit message :D [09:18:57] I looked at that last week [09:19:03] it should be safe to set that to true; it will install the package and service [09:19:03] since we use a bunch of "include mediawiki" [09:19:12] the twemproxy ends up being installed on the beta instances [09:19:13] it won't do anything until you change the mediawiki configs to match [09:19:26] AND [09:19:30] generic_upstart is bugged [09:19:39] i hate that thing [09:19:42] and never install the conf nor starts the service hehe [09:19:57] some lint change modified the generic_upstart() parameters to be boolean [09:20:03] when its called using strings :] [09:20:07] so it is always false [09:20:21] https://gerrit.wikimedia.org/r/#/c/118714/ :] [09:20:36] * ori debugs [09:20:45] oh, you already have [09:20:48] look at https://gerrit.wikimedia.org/r/#/c/118714/ and the follow up changes [09:21:11] yeah all patches are pending and I have split the changes by module (lvs, openstack, twemproxy) [09:21:18] so ops can safely review / apply / deploy each changes [09:21:24] gotta wait now :] [09:21:37] (03PS2) 10Ori.livneh: Restore generic::upstart_job parameters [operations/puppet] - 10https://gerrit.wikimedia.org/r/118714 (owner: 10Hashar) [09:23:24] (03PS2) 10Hashar: salt: qualify site and realm [operations/puppet] - 10https://gerrit.wikimedia.org/r/119403 (owner: 10Matanya) [09:23:41] hashar: hrm, is the first patch in the series really worth it, then? [09:23:56] (03CR) 10Hashar: [C: 031] salt: qualify site and realm [operations/puppet] - 10https://gerrit.wikimedia.org/r/119403 (owner: 10Matanya) [09:23:58] it's all broken now, so merging the module patches should only improve things [09:24:09] ori: yup it adds back compatibility with boolean/strings [09:24:16] to avoid having something broken later on [09:24:26] thanks hashar [09:24:49] matanya: :-] [09:24:56] i guess you're right, being pedantic just sets us up for pain later on [09:25:13] (03CR) 10Ori.livneh: [C: 032] Restore generic::upstart_job parameters [operations/puppet] - 10https://gerrit.wikimedia.org/r/118714 (owner: 10Hashar) [09:25:43] ori: that comes from my experience of running decade long projects hehe :] [09:27:08] in one occasion a soft had the french holidays hardcoded for the next 4 or 5 years [09:27:20] including easter (and you dont want to compute the date of easter) [09:27:53] (03CR) 10Nikerabbit: Beta cluster MemcachedPeclBagOStuff: use PHP serialization (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119411 (owner: 10Ori.livneh) [09:27:55] and I have been very firm: lets keep the 4 years hardcoded this way the app will crash in 4 years and we can use that excuse to leverage its replacement [09:28:14] that happened exactly like that, on easter 4 years later, the service died, we got budget and replaced it [09:28:21] but I am distressing with non sense. [09:28:50] (03CR) 10Ori.livneh: Beta cluster MemcachedPeclBagOStuff: use PHP serialization (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119411 (owner: 10Ori.livneh) [09:29:11] hashar: haha, that's brilliant [09:29:17] ori: plus you can add that igbinary serialization stuff on wikitech. [09:30:01] ori: in big corporation, it is all about political and advancing your project under cover. It is quite fun and rewarding [09:30:01] though exhausting on the long term [09:31:21] ori: and thanks for the generic_upstart() merge :] [09:32:35] np, i ran puppet on an app server to confirm it was a no-op (given that the config file was already there from before the regression) [09:34:07] cigarette time brbr [09:35:20] hashar, did you know that a drop of nicotine can kill a horse?:P [09:38:25] today I learned something new [09:41:34] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [09:46:14] (03PS1) 10Ori.livneh: Git::clone: correct onlyif checks [operations/puppet] - 10https://gerrit.wikimedia.org/r/119454 [09:47:48] hashar: deployment-bastion:~$ service twemproxy status [09:47:48] twemproxy start/running, process 2674 [09:47:57] !! [09:48:43] running puppet on deployment-apache32.pmtpa.wmflabs [09:48:48] it has [09:48:54] hashar@deployment-apache32:~$ service twemproxy status [09:48:54] twemproxy stop/waiting [09:48:55] waiting for puppet [09:51:15] (03PS1) 10Ori.livneh: applicationserver: don't set beta's twemproxy to false [operations/puppet] - 10https://gerrit.wikimedia.org/r/119455 [09:51:35] MaxSem: How do you liquefy nicotine? :-) [09:51:57] twkozlowski: first, you must acquire a horse.. [09:52:40] Haha, true :-) [09:53:33] googling around reveals that liquid nicotine is in fact a thing; it is actually sold as medication to help smokers quit [09:54:48] Yes; I asked the question before consulting Wikipedia [09:54:53] I apologize. [09:55:01] http://www.myfreedomsmokes.com/unflavored-nic-liquid/unflavored-nicotine-smoke-juice-e-liquid/ [09:55:15] i'm amazed that there are people who find that product name enticing [09:55:22] "UnFlavored Nicotine Smoke Juice E-Liquid" [09:55:28] it sounds like a wet ashtray [09:56:01] But, it's "diluted to a safer and more maneageable level" [09:56:11] precisely what some people need [09:56:13] tell that to the horses [09:56:45] I wonder if there might a 'drop' of nicotine in the whole bottle as MaxSem said [09:57:07] s/might/will be/ [09:57:40] "Very good nicotine,mixes well with flavores" [09:57:46] :-D [09:58:22] (03CR) 10Ori.livneh: [C: 032] Git::clone: correct onlyif checks [operations/puppet] - 10https://gerrit.wikimedia.org/r/119454 (owner: 10Ori.livneh) [10:02:44] akosiaris: how to fix the validation error found after I used the syntax you suggested in https://gerrit.wikimedia.org/r/#/c/117250/13/manifests/misc/maintenance.pp ? [10:12:48] (03PS1) 10Ori.livneh: Git::clone: set core.sharedRepository=group for shared repos [operations/puppet] - 10https://gerrit.wikimedia.org/r/119458 [10:14:24] Nemo_bis: enabled => $enabled [10:14:41] ori, how much sleep do you usually have? [10:14:55] and $enabled = abset [10:15:04] (03CR) 10Ori.livneh: [C: 032] Git::clone: set core.sharedRepository=group for shared repos [operations/puppet] - 10https://gerrit.wikimedia.org/r/119458 (owner: 10Ori.livneh) [10:15:25] abset? [10:15:27] absent [10:15:29] ok [10:15:40] MaxSem: "some" [10:16:18] not very well defined type [10:21:44] It's not yet 11 AM UTC [10:22:02] so ori has to stay awake for a little while [10:22:07] :-D [10:24:30] ori: still have libmemcached conflicting :( [10:24:59] ori: I guess I should remove hhvm from the beta application servers :] [10:25:24] hashar: could i have one more day to try and fix it? [10:25:32] yeah sure [10:25:40] is hhvm used for anything right now ? [10:25:55] yes, creating apt conflicts in beta [10:26:14] :-] [10:36:37] !log Jenkins attempting to bring up integration-slave1001.eqiad.wmflabs (aka migrating Jenkins slave nodes in labs from pmtpa to eqiad) [10:36:42] Logged the message, Master [10:41:43] stupid nodejs [10:43:12] !log Jenkins pooling integration-slave1002.eqiad.wmflabs [10:43:16] Logged the message, Master [10:47:20] akosiaris: paravoid: any idea how I could get nodejs 10.x installed ? [10:51:58] (03PS1) 10Ori.livneh: Set serializer to 'php' for production memcache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119461 [11:24:00] !log Jenkins: all npm based jobs are broken due to nodejs self signed certificate being outdated on contint labs instances (see {{bug|61508}} and mail to ops list) [11:24:05] Logged the message, Master [11:28:19] (03CR) 10Hashar: [C: 031] "Change can be merged in :-] twemproxy is already installed anyway." [operations/puppet] - 10https://gerrit.wikimedia.org/r/119455 (owner: 10Ori.livneh) [11:43:38] matanya: I tried applying your proposed fix to the puppet patch but it doesn't validate; maybe I misinterpreted your words [11:43:50] I guess I have to study the puppet docs, hmpf [11:44:58] Nemo_bis: what did you try to do? [11:45:17] I tried half a dozen possible interpretations of your suggestion [11:46:05] Nemo_bis: look in admins.pp as an example [11:46:16] a bad file, but shows the point [11:48:19] matanya: which part? the first example I find is the syntax akosiaris told me to change [11:49:12] unixaccount { $realname: username => $username, uid => $uid, gid => $gid, enabled => $enabled } [11:49:26] $enabled = absent [11:54:04] PROBLEM - MySQL InnoDB on db1038 is CRITICAL: CRIT longest blocking idle transaction sleeps for 603 seconds [11:57:04] RECOVERY - MySQL InnoDB on db1038 is OK: OK longest blocking idle transaction sleeps for 0 seconds [11:57:32] (03PS15) 10Nemo bis: Add cron job to run characterEditStats.php on multilingual wikis weekly [operations/puppet] - 10https://gerrit.wikimedia.org/r/117250 [12:00:15] (03CR) 10Nemo bis: Add cron job to run characterEditStats.php on multilingual wikis weekly (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/117250 (owner: 10Nemo bis) [12:00:35] (03PS5) 10Hashar: Configuration for beta cluster caches in eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/115629 [12:00:37] (03PS5) 10Nemo bis: Use $wgTranslatePageTranslationULS [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115879 [12:04:36] (03CR) 10Hashar: "Added a few more reviewers, this is merely cache configuration for the beta cluster Varnishes." [operations/puppet] - 10https://gerrit.wikimedia.org/r/115629 (owner: 10Hashar) [12:15:40] (03PS2) 10Hashar: beta: sent HTCP purges to eqiad varnishes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/116788 [12:21:25] (03PS3) 10Hashar: Convert syslog-ng conf to templates [operations/puppet] - 10https://gerrit.wikimedia.org/r/119255 [12:26:17] ori: still awake? [12:29:54] apergos: it struck me that if you are still looking for information about search logs you should try ottomatta [12:29:59] he has done some work on them [12:33:41] (03CR) 10Matanya: [C: 031] Tools: Install package joe [operations/puppet] - 10https://gerrit.wikimedia.org/r/118595 (owner: 10Tim Landscheidt) [12:34:06] (03PS1) 10Matanya: coredb_mysql: qualify vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/119468 [12:35:07] (03CR) 10Nikerabbit: [C: 031] Use $wgTranslatePageTranslationULS [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115879 (owner: 10Nemo bis) [12:38:55] !log reindexed all group0 wikis now that we have the Cirrus SWAT deploy (from yesterday). Once the reindex is done we can deploy code to improve performance. Yummy. Starting on commons and eventually doing the rest of group1. [12:39:00] Logged the message, Master [12:42:34] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [12:45:35] manybubbles: cirrus is default on group0 and group1? [12:46:06] matanya: group0: yes [12:46:12] group1: only some of them [12:46:26] we honestly should probably make it the default on all of them at this point [12:47:10] i still get weird results from time to time on wikitech manybubbles [12:47:21] matanya: then complain about them [12:47:26] i hvae [12:47:28] *have [12:47:41] chad said he will look into it [12:47:42] did I file a bug about it that I haven't fixed? [12:47:48] ah [12:47:53] not should what happened at the end [12:47:56] *sure [12:48:05] kind of stalled, i guess [12:48:27] is it something with reproduction steps or is it a comes and goes sort of thing? [12:48:33] either way, I'd like a bug for it [12:48:42] if you could file a bug I'd love it [12:48:43] come and goes [12:48:52] if you just want to tell me about it here I'll turn it into a bug [12:49:12] let me gre my logs to find the exact error message [12:49:15] *grep [12:49:30] wikitech is weird too because we (chad and I) don't maintain it [12:49:42] so we _can't_ be careful with it [12:51:15] we're in the stage where we have to know what is going with every release because releases need prior reindexes or enable later features that we have yet to merge, that sort of thing [12:51:39] this can be trouble for wikitech [12:51:49] so can upgrading Elasticsearch without upgrading cirrus to support it [12:52:15] normally Elasticsearch is backwards compatible but we went from 0.X to 1.X a few weeks ago and that broke a few things [12:52:50] new code which we deployed to the cluster a week before the Elasticsearch deploy smoothed everything over but wikitech didn't get it [12:53:03] i see [12:53:38] (03PS1) 10Hashar: beta: jobQueueAggregator for eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119469 [12:53:48] (03CR) 10Hashar: [C: 032] beta: jobQueueAggregator for eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119469 (owner: 10Hashar) [12:53:56] (03Merged) 10jenkins-bot: beta: jobQueueAggregator for eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119469 (owner: 10Hashar) [12:59:57] [17:52:44] manybubbles: An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later. [13:00:25] [17:52:55] on wikitech [13:00:25] [17:53:05] matanya: gar wikitech [13:00:25] [17:53:35] manybubbles: is this useful error message for you? [13:00:25] [17:53:43] matanya: can you send me a link? [13:00:25] [17:53:51] https://wikitech.wikimedia.org/w/index.php?search=submodule&title=Special%3ASearch&go=Go [13:00:27] [17:54:00] matanya: not really. Its a catch all that says "go read the logs" but I don't have access to wikitech's logs [13:00:28] matanya: have a date stamp? I might be able to attribute it [13:00:53] yes, this is the ES0.X -> ES1.0 incident [13:01:16] ok, so i can assume solved, right? [13:01:56] that issue is solved [13:02:02] that process issue around wikitech isn't [13:02:06] I'm starting an email about that now [13:02:32] ok, so i contribute my cent [13:02:47] * matanya closes the case [13:03:45] matanya: would add you do the email but don't know your email. can you put your irc name on https://office.wikimedia.org/wiki/Contact_list [13:04:34] no manybubbles not employed by wmf [13:04:48] matanya: ah, hard to tell sometimes:) [13:04:55] ah? [13:05:03] just for folks in general [13:05:10] especially the super helpful ones [13:05:15] oh, ok :) [13:05:24] anyway, if you msg me your email I'll bcc you on my complaining [13:05:26] but if you mail ops, i'll see it [13:05:30] or you can trust me that I'm complaining [13:16:32] (03PS16) 10Alexandros Kosiaris: Add cron job to run characterEditStats.php on multilingual wikis weekly [operations/puppet] - 10https://gerrit.wikimedia.org/r/117250 (owner: 10Nemo bis) [13:24:29] hi there mutante [13:25:18] (03CR) 10Alexandros Kosiaris: [C: 032] Add cron job to run characterEditStats.php on multilingual wikis weekly [operations/puppet] - 10https://gerrit.wikimedia.org/r/117250 (owner: 10Nemo bis) [13:33:03] akosiaris: thanks! Was this a manual run you made? https://meta.wikimedia.org/w/index.php?title=Meta:Babylon/Translation_stats&action=history [13:35:25] Nemo_bis: nope [13:35:29] so automatic ? [13:41:51] (03PS11) 10Dzahn: lint admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/118794 [13:42:37] (03CR) 10jenkins-bot: [V: 04-1] lint admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/118794 (owner: 10Dzahn) [13:43:58] mutante: At best you just checkout master, do the changes from sratch and then copy in the change id... [13:45:28] (03PS12) 10Dzahn: lint admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/118794 [13:45:48] akosiaris: I'm not sure; if you want to verify that nothing unexpected happens you could run it manually, otherwise I can wait for cron :) [13:46:12] (03CR) 10jenkins-bot: [V: 04-1] lint admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/118794 (owner: 10Dzahn) [13:46:55] Nemo_bis: grrr ma bad [13:47:04] it broke puppet on terbium. fixing [13:47:17] i should have paid more attention to the call [13:48:45] (03PS13) 10Dzahn: lint admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/118794 [13:49:39] (03CR) 10jenkins-bot: [V: 04-1] lint admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/118794 (owner: 10Dzahn) [13:51:43] !log Jenkins : applying crazy node/npm upgrade hack on Jenkins labs instances integration-slave1001 and integration-slave1002 ( ref: https://bugzilla.wikimedia.org/show_bug.cgi?id=61508#c2 and ops list) [13:51:48] Logged the message, Master [13:52:00] sorry :( [13:53:10] (03PS14) 10Dzahn: lint admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/118794 [13:53:52] (03PS1) 10Alexandros Kosiaris: Fix bug introduced in 68e354d [operations/puppet] - 10https://gerrit.wikimedia.org/r/119477 [13:54:52] (03CR) 10Alexandros Kosiaris: [C: 032] Fix bug introduced in 68e354d [operations/puppet] - 10https://gerrit.wikimedia.org/r/119477 (owner: 10Alexandros Kosiaris) [13:58:28] (03CR) 10Nemo bis: "Followup on I6b2ed821d090ef01ec50012567d4855e47cbd6d3 (sorry)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117250 (owner: 10Nemo bis) [14:09:46] (03PS1) 10Hashar: beta: fill in swift backend for upload varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/119478 [14:13:18] ottomata: we can't move the an boxes yet. we need to setup an analytics vlan for row d [14:13:37] oh ha [14:13:39] haha, ok [14:13:44] RIiighhhhht [14:13:54] glad i checked first [14:13:56] and get the network ACLs all worked out [14:14:09] man I should learn how to do all that, eh? [14:14:30] add it to the list [14:14:47] paravoid: my networking mentor, would you be willing to show how to do that? [14:17:10] (03PS2) 10Hashar: beta: fill in swift backend for upload varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/119478 [14:18:51] (03PS15) 10Dzahn: lint admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/118794 [14:22:43] (03CR) 10Alexandros Kosiaris: [C: 032] Lint misc/logging.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/119251 (owner: 10Hashar) [14:24:15] akosiaris: hurrah :] [14:24:19] (03PS2) 10Dzahn: let yuvipanda and tfinc upload mobile tarballs [operations/puppet] - 10https://gerrit.wikimedia.org/r/119336 [14:24:32] (03CR) 10Alexandros Kosiaris: [C: 032] coredb_mysql: qualify vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/119468 (owner: 10Matanya) [14:24:41] akosiaris: the class is applied on the two beta cluster bastions if you want to check deployment-bastion.pmtpa.wmflabs and deployment-bastion.eqiad.wmflabs :] [14:24:53] (03CR) 10Dzahn: [C: 031] "added Yuvi, please update ticket" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119336 (owner: 10Dzahn) [14:27:03] hashar: I had already compiled and diffed all catalogs before merging, so I was pretty sure it was a noop [14:28:39] :]]]]]]] [14:30:11] akosiaris: I have marked them all with Gerrit topic "syslog-ng" : https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:syslog-ng,n,z [14:34:15] andrewbogott: any news about https://gerrit.wikimedia.org/r/#/c/97007/ ? [14:35:43] how do i add a repo to the output of grrrit-wm? [14:35:51] matanya, hashar: hi, btw [14:35:53] matanya: I think we determined that it's not needed, that $::variables still work in puppet 3 [14:36:23] andrewbogott: puppet3 whine about it: [14:36:23] Dynamic lookup of $openstack_version at /etc/puppet/manifests/role/nova.pp:229 is deprecated. Support will be removed in Puppet 2.8. Use a fully-qualified variable name (e.g., $classname::variable) or parameterized classes. [14:36:24] (03PS1) 10Hashar: contint: use role::labs::lvm::mnt on eqiad slaves [operations/puppet] - 10https://gerrit.wikimedia.org/r/119483 [14:36:46] (03CR) 10Hashar: "Making use of this new role on the eqiad contint slaves https://gerrit.wikimedia.org/r/119483 :-] Thank you!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119398 (owner: 10Andrew Bogott) [14:36:51] where does that var come from? [14:37:11] oh... [14:37:13] from site.pp I think [14:37:51] andrewbogott: mind glance at https://etherpad.wikimedia.org/p/Puppet3 [14:38:00] and see the OSM related stuff? [14:38:08] mutante: for grrrit-wm there is a Gerrit project holding the code and the configuration : ssh://gerrit.wikimedia.org:29418/labs/tools/grrrit.git [14:38:21] hashar: thanks! [14:38:31] mutante: the conf is done in the json file project.config . Then add Azathot and YuviPanda as reviewers [14:38:49] cool! [14:39:16] any salt master execpt ryan lane around? [14:39:59] i believe palladium is the salt master [14:40:18] :) [14:40:52] (03CR) 10Alexandros Kosiaris: [C: 032] redis: hostname is a fact [operations/puppet] - 10https://gerrit.wikimedia.org/r/119259 (owner: 10Matanya) [14:42:57] (03CR) 10Andrew Bogott: [V: 04-1] "This doesn't work -- there's no such setting available in metadata" [operations/puppet] - 10https://gerrit.wikimedia.org/r/117823 (owner: 10Andrew Bogott) [14:44:14] matanya: on wikitech-l, subject "wmf getting ready for puppet3, advice please" -- that was regarding that patch I think [14:44:39] i remember [14:48:39] akosiaris: would you get the other syslog-ng patches in ? [14:49:20] hashar: yeah, checking the first dependent one now [14:49:48] so andrewbogott this openstack part is somewhat a blocker to me [14:49:59] if you can find a nice way to resolve it [14:50:51] matanya: the $cluster thing is in site.pp in several places (re; salt role) [14:50:54] site.pp: $cluster = 'virt' [14:51:08] matanya, which parts in particular? I'm not used to reading that output, kind of lost [14:51:30] if anyone wants to use the SWAT window today I advise adding your patch to wikitech/wiki/Deployments very soon [14:52:03] andrewbogott: see line 99 and below [14:52:29] "Dynamic lookup of $openstack_version" <- that? [14:52:35] Sorry, I thought you were talking about something new. [14:52:55] yes, all that part [14:52:55] I thought that that email thread determined that we could set $openstack_version in site.pp and that was…ok? What am I missing? [14:53:16] it will *work* [14:53:29] but will be hard to qualify [14:53:31] question about SWAT, since it says SWAT is not responsible for code reviews, does that mean only stuff is allowed that already had code review [14:54:42] I don't know what 'hard to qualify' means [14:55:12] I'm confused because I was worried about this problem at the time -- most of labs relies on being able to set vars in the node definition in order to configure roles. [14:55:32] And then for a minute I thought that that was going to break in 3, but then there was that email thread that left me with the impression that, no, it's still fine in 3. [14:55:35] puppet3 requires all var to be qualified: mean full ref to the var [14:55:38] So, which of the above is wrong? [14:55:54] that it's still fine , afaik [14:56:15] though puppet3 does complian [14:56:35] if we want to switch to puppet3, it needs to change [14:56:39] ack, matanya? [14:56:47] yes [14:56:48] mutante, which is it? Fine, or needs to change? [14:57:04] I feel like everytime this issue is raised everyone agrees that "yes" and "no" [14:57:19] you asked which of he above is wrong, and i answered that:) [14:57:20] * matanya points @ akosiaris  [14:57:28] ... [14:57:31] by saying it's wrong that it's still fine [14:57:45] andrewbogott: it won't work in puppet3 [14:57:55] here, i said it out loud [14:57:57] So the conclusion that email thread reached was wrong. But everyone just let that stand? [14:58:00] andrewbogott: it needs to change [14:58:21] that's what i keep being told per that etherpad matanya linked to [14:58:25] right? [14:58:32] yes [14:58:54] it should be quite simple to fix though [14:59:01] just call the full var path [14:59:10] not simple enough to get them merged,heh [14:59:36] and if it is in site.pp it "will be hard to qualify" [14:59:46] "In puppet3 variables assigned in the node are still global." [14:59:55] that is correct [15:00:14] RECOVERY - Host mw1163 is UP: PING OK - Packet loss = 0%, RTA = 0.97 ms [15:00:43] it is not failing on the scope issue [15:01:00] it is failing on not being fully qualified [15:01:19] so andrewbogott all i asked is if you can qualify it [15:01:25] please :) [15:01:30] So fully qualified is "$::foo" ? [15:01:39] for instance [15:01:40] akosiaris: we should get your catalog compilation script integrated in Jenkins. One can then run the job manually and specify the change/patchset they want to be tested. :D [15:02:06] it might be openstack::foo as well [15:02:18] if that is what needs to be [15:02:28] can you explain when it would be one vs. the other? [15:02:41] andrewbogott: matanya: in puppet3 unqualified variables are looked in 3 places: 1 local, 2 inherited (avoid that please), node level [15:03:00] hashar: that is the plan :-). Gonna take some time though [15:03:04] PROBLEM - Apache HTTP on mw1163 is CRITICAL: Connection refused [15:03:16] andrewbogott: matanya: so node level variables do not need to be qualified [15:03:44] so way puppet complians then? [15:03:49] akosiaris: Except that lint hates that, it seems :) [15:04:07] puppet-lint parser is not the best parser around and i am being polite [15:04:13] matanya: if this is just a question of s/foo/::foo/ then… that's totally fine :) [15:04:25] I'm happy to change that to make lint happier, I prefer it anyway. [15:04:29] that is exactly what it is [15:04:47] matanya: because the moment it does not find it in local scope, it complains [15:04:59] that is good then [15:05:16] matanya: I can't do this right this second, but I encourage you to either make a bug for me or submit a patch. That code is quite a bit more stable now than it was a few weeks ago. [15:05:21] i can just qualify all by prefix :: :P [15:06:32] not being able to rely on what puppet-lint says makes the whole linting business.. i don't know what, but [15:06:36] andrewbogott: i can submit a patch, though i will tell you in advance i'm not 100% knowledgeable about openstack, so i might submit a patch witch is just stack [15:07:34] matanya: Unfortunately I just broke the instance which ordinarily would be used to test this :( But, a simple search/replace patch should be pretty safe. [15:08:03] i will quote you on that [15:08:22] a simple search/replace patch should be pretty safe <-- this should go into bugzilla quips [15:08:30] well, not so safe that I don't want to test it :) [15:08:33] hey, i dont know this better, but .. ah.. nevermind.. [15:09:05] * matanya dives back into the code [15:09:08] RECOVERY - Apache HTTP on mw1163 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.124 second response time [15:09:15] andrewbogott: do we got trusty labs images ? [15:09:24] akosiaris: not yet [15:10:01] will we ? and by when ? [15:10:20] I mean, do we wait trusty release ? [15:10:43] for our puppet code to become trusty compatible ? something else entirely ? [15:10:48] that is in a month or so, isn't it? [15:10:49] akosiaris: we will as soon as a get a free minute [15:11:02] aaa so something else entirely. [15:11:07] happy to hear that :-) [15:11:11] um… as soon as *I get a free minute [15:11:22] that is never ? [15:11:39] possibly :( [15:12:19] i guess if we still have hardy boxes, trusty can wait [15:12:56] ahahah. yeah right [15:13:19] switches matanya to Debian [15:13:24] we try to have trusty before the release date in alpha/beta [15:13:31] I doubt it can wait [15:13:52] mutante: i have a debian server at home, if it counts [15:14:02] matanya: :) [15:14:19] (03CR) 10Alexandros Kosiaris: [C: 032] beta: fill in swift backend for upload varnish [operations/puppet] - 10https://gerrit.wikimedia.org/r/119478 (owner: 10Hashar) [15:18:30] (03PS1) 10Matanya: openstack: qualify var [operations/puppet] - 10https://gerrit.wikimedia.org/r/119488 [15:30:24] (03CR) 10Hashar: "I have filled https://bugzilla.wikimedia.org/show_bug.cgi?id=62836 to setup twemproxy on the beta cluster." [operations/puppet] - 10https://gerrit.wikimedia.org/r/119455 (owner: 10Ori.livneh) [15:36:12] mutante: Re SWAT: Yes. Generally it'll be deploying backports of stuff already merged to master. [15:37:24] anomie: thanks! [15:40:51] I like SWAT as an abbreviation [15:41:04] next: DEVGRU = DEVeloper GRUmble [15:41:22] what SWAT stands for? [15:41:31] small window at time? [15:42:19] Setting Wikis Ablaze Team [15:42:21] when i asked Reedy said "small wiki admins team" so now i always think that,hehe [15:42:38] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [15:44:26] (03PS1) 10Hashar: beta: vary logstash instance per $::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/119493 [15:51:55] (03PS2) 10Hashar: beta: vary logstash instance per $::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/119493 [15:52:46] " this is less automatic but much more effective at breaking things. " haha [15:52:49] (03CR) 10Hashar: "I was using the wrong hostname for the eqiad instance (missed deployment- prefix)." [operations/puppet] - 10https://gerrit.wikimedia.org/r/119493 (owner: 10Hashar) [15:57:29] (03CR) 10Dzahn: [C: 032] "looks reasonable, SWAT deploy?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119493 (owner: 10Hashar) [15:57:52] mutante thanks :] [15:58:18] (03CR) 10Dzahn: "no, really, it's ok, it should just be removed once beta moved to eqiad as hashar said:)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119493 (owner: 10Hashar) [15:59:42] hashar: there's always some karma points left somewhere on the stash [16:00:42] apparently [16:01:27] mutante: bd808 has setup a puppetmaster for beta cluster so we would be able to cherry pick our patches there [16:01:47] (03CR) 10Mark Bergsma: [C: 032] Configuration for beta cluster caches in eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/115629 (owner: 10Hashar) [16:02:13] hashar: is that good or bad for ops?:) [16:03:49] The point of the local puppetmaster in beta is to be able to test things! Write patch -> submit to gerrit -> cherry pick to beta -> test -> approve in gerrit -> profit! [16:04:03] wishful! [16:04:34] It's working for me so far ;) But yes long cultural change cycle [16:04:44] bd808: that's puppetmaster::self ? [16:04:49] can i call a var in erb using it's parent class? e.g. @role::fundraising::civicrm::exim_bounce_collector [16:05:33] akosiaris: I just asked Ryan to use some of his contractor hours to make a trusty image. Hopefully that'll speed things up a bit. [16:05:54] mutante: role::puppet::self + $::puppetmaster; see https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=configure&project=deployment-prep&instanceid=abb50762-93d0-4c9d-8853-adbbb6b56e00®ion=eqiad [16:05:54] andrewbogott: cool. thanks :-) [16:07:01] Ryan setup a local salt master for us on that node too. I just got trebuchet working in beta last night [16:07:36] beta in eqiad is going to be the coolest. All the hip kids will want to use it to test things before going to prod [16:07:52] (03PS1) 10Hashar: beta: vary fluoride instance per $::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/119495 [16:08:33] mutante: if you still have some karma for me https://gerrit.wikimedia.org/r/119495 that vary deployment-fluoride per $::site much like the logstash instance [16:11:07] akosiaris: i guess you can answer my question [16:12:29] matanya: if the erb is referenced by a class that inherits another class, a variable in the partent class it does not really need a special lookup. $ should be enough. Otherwise scope.lookupvar [16:13:08] i hoped you would say something else [16:13:22] something else [16:13:25] there you go :-) [16:13:29] i;m going to uglify our erb's quite a lot now [16:13:40] :P thank you, that cheers me up :) [16:13:46] oh and I meant @ and not $ [16:13:50] sure [16:13:54] i understood that [16:15:14] or i can force ottomata to fix all the multiple/cross calling in the udp2log/filters/etc [16:16:33] PSSHH THOSE WILL GO AWAY SOMEDAY (SOON?) [16:17:01] i would love to see that [16:17:05] and get rid of Augeas! [16:17:18] one by one friends [16:17:35] he's already doing that, hashar:) [16:17:42] awesome [16:17:45] mutante: if you still have some karma for me https://gerrit.wikimedia.org/r/119495 that vary deployment-fluoride per $::site much like the logstash instance [16:17:46] :-] [16:18:29] (03CR) 10Dzahn: [C: 032] beta: vary fluoride instance per $::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/119495 (owner: 10Hashar) [16:19:26] bd808: thanks, i gotta look later (gotta get phone for 2 factor auth, heh):) [16:19:50] hashar: yea, same thing, np [16:21:00] thanks [16:21:31] and one day we will have to rename the production branch 'master' [16:21:36] fatal: Needed a single revision [16:21:36] invalid upstream origin/master [16:21:36] :D [16:21:57] hashar: i asked, answer was "it's really complicated" [16:22:09] (03PS1) 10Matanya: exim: fix scoping [operations/puppet] - 10https://gerrit.wikimedia.org/r/119496 [16:22:10] hashar: branches are like wikis when it comes to renaming i think:) [16:22:22] and that too (renaming wikis) :D [16:22:27] hehe, yes [16:22:51] akosiaris: i hope this is what you meant ^^ [16:23:27] `git symbolic-ref refs/heads/master refs/heads/production` [16:24:26] hashar: ^ [16:24:54] * bd808 hasn't actually tried that but it should work in theory [16:25:15] hopefully :D [16:25:27] then git-review will push to refs/for/master :-] [16:30:18] enough craziness for today I am heading out [16:32:33] (03PS1) 10Andrew Bogott: Increase dnsmasq cache size again. [operations/puppet] - 10https://gerrit.wikimedia.org/r/119498 [16:32:38] mutante: ^ [16:34:42] (03CR) 10Dzahn: [C: 031] "looks good, we had some intermittent DNS issues in labs and this made it better, now there might be a few left but less than before" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119498 (owner: 10Andrew Bogott) [16:36:27] (03CR) 10MarkTraceur: [C: 031] Enable Popups (Hovercards) on mediawiki.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119445 (owner: 10Spage) [16:39:09] (03CR) 10Andrew Bogott: [C: 032] Increase dnsmasq cache size again. [operations/puppet] - 10https://gerrit.wikimedia.org/r/119498 (owner: 10Andrew Bogott) [16:53:47] (03CR) 10Ori.livneh: [C: 032] applicationserver: don't set beta's twemproxy to false [operations/puppet] - 10https://gerrit.wikimedia.org/r/119455 (owner: 10Ori.livneh) [16:54:41] (03CR) 10Aaron Schulz: "Can we also set the compression to zlib (see http://www.php.net/manual/en/memcached.configuration.php). That way the bloat will be less th" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119461 (owner: 10Ori.livneh) [16:54:58] greg-g, skiping [16:55:53] (03CR) 10Aaron Schulz: [C: 031] Set serializer to 'php' for production memcache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119461 (owner: 10Ori.livneh) [17:07:24] yurik: kk [17:10:05] (03CR) 10Alexandros Kosiaris: [C: 032] osm module [operations/puppet] - 10https://gerrit.wikimedia.org/r/119408 (owner: 10Alexandros Kosiaris) [17:11:07] (03PS4) 10BryanDavis: Make logstash and kibana roles work in labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/119099 [17:12:54] (03CR) 10BryanDavis: [C: 031] "This patch has been cherry-picked into the puppet repository on deployment-salt.eqiad.wmfnet and used to configure the logstash instance o" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119099 (owner: 10BryanDavis) [17:13:08] geez, 32G /home/wikipedia/syslog/swift [17:13:11] on nfs1 [17:13:23] lots of room on the volume though [17:14:38] (03CR) 10Alexandros Kosiaris: [C: 032] Enable planet.osm population on labsdb1004 [operations/puppet] - 10https://gerrit.wikimedia.org/r/119419 (owner: 10Alexandros Kosiaris) [17:18:56] Does anyone know what I need to do to get `puppet:///volatile/GeoIP` populated on my self-hosted puppetmaster in labs? [17:19:17] This is for the deployment-prep (aka beta) project [17:24:28] bd808: want to CR https://gerrit.wikimedia.org/r/#/c/119241/1 ? [17:24:50] doesn't effect us now [17:25:51] AaronSchulz: Does that go along with one of the other reviews you threw at me yesterday? [17:26:14] * bd808 vaguely remembers 'latest' [17:26:31] I guess they are topically similar [17:28:00] (03PS1) 10Cmjohnson: Adding dhcp entries for ms-be1013-1015 [operations/puppet] - 10https://gerrit.wikimedia.org/r/119510 [17:30:50] (03CR) 10Dzahn: [C: 032] "nmap -sP 10.0.8.0/24 -> 256 IP addresses (0 hosts up)" [operations/dns] - 10https://gerrit.wikimedia.org/r/117210 (owner: 10Reedy) [17:32:11] !log DNS update - removing Tampa appservers [17:32:16] Logged the message, Master [17:32:19] (03PS1) 10Cmjohnson: Adding dns entries for ms-be1013-1015 [operations/dns] - 10https://gerrit.wikimedia.org/r/119511 [17:33:33] (03CR) 10Cmjohnson: [C: 032] Adding dhcp entries for ms-be1013-1015 [operations/puppet] - 10https://gerrit.wikimedia.org/r/119510 (owner: 10Cmjohnson) [17:33:58] (03CR) 10BryanDavis: [C: 031] "This should be merged during the 1.23wmf19 train deploy on 2014-03-20. Waiting until then to +2, but if someone wants to grab it in a SWAT" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/118778 (owner: 10Chad) [17:34:19] wheeeeeee [17:34:24] (03CR) 10Cmjohnson: [C: 032] Adding dns entries for ms-be1013-1015 [operations/dns] - 10https://gerrit.wikimedia.org/r/119511 (owner: 10Cmjohnson) [17:34:29] Reedy: :) [17:35:05] * bd808 tries to imagine Reedy yelling "wheeeeee" IRL [17:35:14] Rollercoaster [17:35:36] https://github.com/facebook/hhvm/blob/master/hphp/runtime/ext/fileinfo/libmagic/compat.h [17:35:38] #define emalloc HPHP::smart_malloc [17:35:40] (03CR) 10Dzahn: [C: 031] "i'm gonna leave this one for Mark or Robh, since it's esams and they were gone before i was here.. i think" [operations/dns] - 10https://gerrit.wikimedia.org/r/118480 (owner: 10Matanya) [17:35:42] * AaronSchulz chuckles [17:36:12] Reedy: do you play chess? [17:36:38] Not anytime recently.. [17:36:45] https://xkcd.com/chesscoaster/ [17:37:29] At least one of those is draughts ;) [17:39:05] Reedy: so yea, i actually did an nmap ping scan on those 3 networks, no replies.. we can start finding moar that uses power [17:39:17] looks at ganglia again [17:39:33] (03Abandoned) 10Reedy: Fix l10nupdate-1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/118946 (owner: 10Reedy) [17:40:05] mutante: https://gerrit.wikimedia.org/r/118641 and https://gerrit.wikimedia.org/r/118643 [17:40:40] 8 more hosts [17:41:24] ah true, those seemingly random squids, thanks! [17:41:27] heh [17:41:28] for reminder [17:41:38] looks like labs and swift have the most hosts still [17:42:02] yep, those i ignored so far [17:42:24] 7 dbhosts, 4 snapshot, 4 ES [17:42:41] 2 labstore, 9 virt hosts [17:43:03] did we ask springle about the 7 db's yet on some ticket [17:43:12] looks [17:43:39] getting there [17:44:14] yes:) [17:44:22] are the lvs hosts in use still? [17:44:25] if there's no app servers... [17:44:53] agreed..likely not:) [17:45:48] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1712: active_shards: 5059: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 0 [17:46:46] (03CR) 10RobH: [C: 031] "only one I'm not sure about is clematis, otherwise rest looks good." (032 comments) [operations/dns] - 10https://gerrit.wikimedia.org/r/118480 (owner: 10Matanya) [17:46:48] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1713: active_shards: 5062: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [17:47:17] PDF hosts are still busy :( [17:48:55] Reedy: hah, and i don't know how much longer:) [17:49:27] * Reedy greps for lvs [17:52:56] 216.152.80.208.in-addr.arpa name = osm-lb.pmtpa.wikimedia.org. [17:54:19] :p [17:55:54] <^d> manybubbles: elastic1003 ok? ^ [17:55:58] <^d> Or were you doing something? [17:56:11] ^d: reindexing all of group1. I check it [17:56:16] If it recovered..... [17:56:18] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [17:56:32] yeah, it was just complaining about the nodes being rebuilt [17:56:34] !log mw1177 powering down to reseat DIMM [17:56:39] Logged the message, Master [17:56:40] we need to write our own nagios plugin.... [17:56:51] <^d> Gotcha, mmk. [17:58:05] (03PS1) 10Reedy: Decomission lvs[1-6] [operations/puppet] - 10https://gerrit.wikimedia.org/r/119515 [17:58:14] I think that might be a bit greedy [17:58:48] PROBLEM - Host mw1177 is DOWN: PING CRITICAL - Packet loss = 100% [17:59:39] (03PS2) 10Reedy: Decomission lvs[1-6] [operations/puppet] - 10https://gerrit.wikimedia.org/r/119515 [18:01:30] I don't think lvs5 and lvs6 were even fully commissioned.. :/ [18:02:11] (03PS3) 10Reedy: Decomission lvs[1-6] [operations/puppet] - 10https://gerrit.wikimedia.org/r/119515 [18:03:46] files/dsh/group/check_nodegroups.sh looks deleteable [18:05:01] (03PS4) 10Reedy: Decomission lvs[1-6] [operations/puppet] - 10https://gerrit.wikimedia.org/r/119515 [18:08:08] RECOVERY - Host mw1177 is UP: PING OK - Packet loss = 0%, RTA = 1.17 ms [18:10:46] (03PS5) 10BryanDavis: Make trebuchet work in eqiad.wmflabs [operations/puppet] - 10https://gerrit.wikimedia.org/r/119221 [18:13:04] (03CR) 10BryanDavis: [C: 031] "A cherry-pick of this patch is being used in deployment-prep labs project. I worked with Ryan over irc to track down the tricky parts of f" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119221 (owner: 10BryanDavis) [18:18:23] (03PS2) 10Tim Landscheidt: Labs: Remove Bots project motd [operations/puppet] - 10https://gerrit.wikimedia.org/r/102335 [18:20:05] (03CR) 10Dzahn: "just curious where the bots are running nowadays, are they going to be part of tool labs..or?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/102335 (owner: 10Tim Landscheidt) [18:23:59] (03CR) 10Ryan Lane: [C: 031] Make trebuchet work in eqiad.wmflabs [operations/puppet] - 10https://gerrit.wikimedia.org/r/119221 (owner: 10BryanDavis) [18:24:25] !log powering down mw1183 to reseat dimm [18:24:30] Logged the message, Master [18:25:53] (03CR) 10Dzahn: [C: 032] Make trebuchet work in eqiad.wmflabs [operations/puppet] - 10https://gerrit.wikimedia.org/r/119221 (owner: 10BryanDavis) [18:26:38] PROBLEM - Host mw1183 is DOWN: PING CRITICAL - Packet loss = 100% [18:26:52] mutante: Thanks! [18:27:07] (03PS1) 10BryanDavis: Allow user to specify mount point for role::labs::lvm::mnt [operations/puppet] - 10https://gerrit.wikimedia.org/r/119524 [18:30:23] (03PS3) 10Tim Landscheidt: Tools: Unify Tools and Toolsbeta configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/102385 [18:31:10] (03CR) 10Alexandros Kosiaris: [C: 032] salt: qualify site and realm [operations/puppet] - 10https://gerrit.wikimedia.org/r/119403 (owner: 10Matanya) [18:32:06] (03PS2) 10Dzahn: Decommission ssl[1-4] [operations/puppet] - 10https://gerrit.wikimedia.org/r/118643 (owner: 10Reedy) [18:35:04] (03CR) 10Dzahn: [C: 032] Decommission ssl[1-4] [operations/puppet] - 10https://gerrit.wikimedia.org/r/118643 (owner: 10Reedy) [18:36:29] (03CR) 10Tim Landscheidt: "@Dzahn: They are already running on Tools :-)." [operations/puppet] - 10https://gerrit.wikimedia.org/r/102335 (owner: 10Tim Landscheidt) [18:38:08] RECOVERY - Host mw1183 is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [18:43:38] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [18:44:44] (03PS1) 10BBlack: Fix patch offset line number in fallocate() patch [operations/debs/varnish] (3.0.5-plus-wm) - 10https://gerrit.wikimedia.org/r/119530 [18:44:46] (03PS1) 10BBlack: varnish (3.0.5plus~wmftest-wm5) unstable; urgency=low [operations/debs/varnish] (3.0.5-plus-wm) - 10https://gerrit.wikimedia.org/r/119531 [18:45:18] (03CR) 10BBlack: [C: 032 V: 032] Fix patch offset line number in fallocate() patch [operations/debs/varnish] (3.0.5-plus-wm) - 10https://gerrit.wikimedia.org/r/119530 (owner: 10BBlack) [18:45:29] (03CR) 10BBlack: [C: 032 V: 032] varnish (3.0.5plus~wmftest-wm5) unstable; urgency=low [operations/debs/varnish] (3.0.5-plus-wm) - 10https://gerrit.wikimedia.org/r/119531 (owner: 10BBlack) [18:46:45] !log ssl1-4: puppet agent --disable, puppetstoredconfigclean, revoking puppet certs [18:46:50] Logged the message, Master [18:51:36] (03PS1) 10BryanDavis: Support eqiad labs secondary disk [operations/puppet] - 10https://gerrit.wikimedia.org/r/119534 [18:53:57] (03PS3) 10Tim Landscheidt: Fix indentation in and lint role::labs::instance [operations/puppet] - 10https://gerrit.wikimedia.org/r/114734 [18:54:18] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [18:56:56] (03PS2) 10BryanDavis: Support eqiad labs secondary disk [operations/puppet] - 10https://gerrit.wikimedia.org/r/119534 [18:58:01] (03PS16) 10Dzahn: lint admins.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/118794 [18:59:19] !log all varnishes upgraded to 3.0.5plus-wmftest-wm5 (without restart, just adds header vmod) [18:59:25] Logged the message, Master [19:01:21] oh nice [19:01:23] still wmftest? [19:03:19] honestly, I don't completely understand debian version ~things anyways. if we're past the "just testing" stage, should I move to 3.0.5plus-wm6 or something? [19:03:32] or do we always ned the ~ for local? [19:03:36] *need [19:04:23] (the above log message should have read that version as "3.0.5plus~wmftest-wm5") [19:06:57] 3.0.5plus-wm6 is fine [19:07:06] ~ means "less than" basically [19:07:14] it's for 1.0.0~rc1-1 [19:07:21] so that 1.0.0-1 > 1.0.0~rc1-1 [19:08:02] ah that makes sense, and explains some behaviors I've seen before :) [19:14:07] (03CR) 10Alexandros Kosiaris: [C: 032] Convert syslog-ng conf to templates [operations/puppet] - 10https://gerrit.wikimedia.org/r/119255 (owner: 10Hashar) [19:15:25] !log shutting down ssl1-4 [19:15:30] Logged the message, Master [19:21:37] !log sq67-70: disable puppet,revoke puppet certs,delete salt keys and stored configs (delete from icinga) [19:21:42] Logged the message, Master [19:30:28] (03PS2) 10Dzahn: Decomission sq67, sq68, sq69, sq70 [operations/puppet] - 10https://gerrit.wikimedia.org/r/118641 (owner: 10Reedy) [19:30:42] bye bye sq [19:31:52] death to the squid servers! [19:32:02] and their no longer accurate service names@! [19:32:17] service (based) hostnames that is. [19:32:34] those are actually running varnish [19:32:37] to make it worse [19:32:39] will these boxes be ever turned into varnishes? [19:32:56] *in eqiad [19:33:23] greg-g: i need to push firefox app bits out, when would be a good time? [19:33:41] yurikR: what does that mean? [19:33:41] should be a very quick depl :) [19:33:50] firefox app is hosted on our bits [19:33:54] oh, that [19:34:12] shouldn't affect anyone i would guess [19:34:16] (03CR) 10Dzahn: [C: 032] Decomission sq67, sq68, sq69, sq70 [operations/puppet] - 10https://gerrit.wikimedia.org/r/118641 (owner: 10Reedy) [19:34:18] some static resources [19:34:19] yurikR: now? [19:34:21] yep [19:34:26] greg-g: sure [19:34:32] here i go [19:34:40] dr0ptp4kt_: ^ [19:35:04] now i wanna now more about the firefox app [19:35:59] I think it's FirefoxOS related [19:36:46] https://github.com/wikimedia/apps-firefox-wikipedia [19:37:02] https://marketplace.firefox.com/app/wikipedia [19:38:25] thanks [19:40:08] i can click a button labeled "free" and get an install dialog in Iceweasel [19:40:15] Repeat request: Does anyone know what I need to do to get `puppet:///volatile/GeoIP` populated on the self-hosted puppetmaster for the deployment-prep labs project? [19:42:22] Asked paravoid? [19:42:45] I haven't yet. I've just been shouting down this well :) [19:43:34] hmm [19:43:44] do you need volatile/GeoIP? [19:43:52] can you just use GeoIP from maxmind? [19:44:37] ottomata: two questions: [19:44:50] 1) erbium for ever? [19:44:53] ottomata: I haven't tracked down the source, but several nodes want to apply the Geoip::Data::Puppet role. They are varnish boxes I think. [19:45:25] 2) need to qualify fundraising_log_directory any better way the scope_lookup? [19:46:14] ok, bd808, i'm looking [19:46:16] bd808: iirc it has something to do with the cache role [19:46:34] looks like probably the cache role shouldn't use geoip::data::puppet if in labs [19:46:38] OR [19:46:58] self_hosted puppet should do what puppetmaster::geoip does with compatibility symlinks [19:47:11] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [19:47:44] * bd808 will look at puppetmaster::geoip [19:48:43] ok bd808 yeah [19:48:54] probably, the proper thing to do would be to make cache.pp do the right thign in labs [19:49:02] there are several 'require geoip's in that file [19:49:15] which by default use geoip::data::puppet (:/) [19:49:33] ree [19:49:39] some case realm logic will most likely solve this [19:49:44] yeah, ergh [19:49:51] this is why I liked the geoip class parameterized! [19:49:56] faidon didn't like that so he changed it [19:50:04] now the main init class just does a default, you can choose [19:50:13] ergh, which means that anything that references it will have to be conditional too [19:50:31] yes [19:50:45] ok, in this case it isn't so bad though [19:50:51] i don't see anytihng referencing the geoip class [19:50:52] so its ok [19:50:53] so [19:51:00] bd808: i think you can probalbly do [19:51:25] yurikR, dialing you [19:51:31] if ($::realm == 'labs') { require geoip::data::package, geoip::bin } [19:51:31] else { require geoip } [19:51:33] or something like that [19:51:40] matanya: re erbium [19:51:41] wha? [19:51:50] Couldn't we just change the `if` in puppetmaster::geoip to `if $::realm == 'labs'`? [19:52:02] we spoke about killing it, didn't move forward [19:52:29] HMMMMMMMM, maybe bd808, but that sounds like it could be messy [19:53:26] you would affect a change on all existing self hosted puppetmasters out there [19:53:26] but, if it didin't break anything, that might be a good idea [19:53:37] bd808: doesn't if $is_labs_puppet_master solve your problem ? [19:54:06] matanya: Actually it probably does if I add a var for that in my project [19:54:19] * bd808 tries [19:54:56] OH [19:54:56] bd808: you'd need both [19:55:36] if $::realm == 'labs' or $is_labs_puppet_master [19:55:36] because for the default labs puppetmasters, $realm != 'labs' [19:55:36] they are prod boxes that run puppetmsater that labs VMs talk to [19:55:50] killing erbium? [19:55:50] erbium is in eqiad, right matanya? [19:56:25] ottomata: erbium is in eqiad. [19:56:33] all the elements are =] [19:56:40] just configure the node as node 'foo' { [19:56:40] $is_labs_puppet_master = true [19:56:40] } [19:56:54] bd808: so you managed to get /srv/vdb mounted :-] [19:57:22] hashar: Yeah. I'm got a cherry-picked patch doing that [19:58:17] (03PS1) 10Yurik: updated firefox os app to master [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119550 [19:58:30] bd808: so you push to gerrit then cherry pick locally ? [19:59:06] hashar: Yes. Then I can amend in gerrit as needed and re-pick until I get something that works. [19:59:16] Then ask for review! [20:00:00] (03CR) 10Yurik: [C: 032 V: 032] updated firefox os app to master [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119550 (owner: 10Yurik) [20:00:14] !log disabled puppet on labsdb1004 until osm2pgsql completes [20:00:20] Logged the message, Master [20:03:25] !log shut down sq67-70 [20:03:31] Logged the message, Master [20:03:47] Yuck. This rabbit hole goes too deep. puppet::self::master is in no way related to ::puppetmaster so setting anything there doesn't help at all. [20:05:04] But I can add something to puppet::self::master that does what puppetmaster::geoip does when $is_labs_puppet_master is true. [20:05:56] !log yurik synchronized docroot/bits/WikipediaMobileFirefoxOS/ [20:06:01] Logged the message, Master [20:06:57] or something similar, if understand your intention [20:06:59] you mean emery? [20:07:30] bd808: puppet::self [20:07:38] oh, ignore me [20:07:57] :) [20:08:26] (03CR) 10Hashar: "We ends up with a mount like:" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/119534 (owner: 10BryanDavis) [20:10:51] (03CR) 10Hashar: Support eqiad labs secondary disk (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/119534 (owner: 10BryanDavis) [20:11:16] (03PS3) 10Hashar: Support eqiad labs secondary disk [operations/puppet] - 10https://gerrit.wikimedia.org/r/119534 (owner: 10BryanDavis) [20:13:22] bd808: bah I have dropped all the patches from the beta cluster salt instance :( [20:13:42] oops. [20:13:54] We can put them back. reflog! [20:14:48] yeah did so [20:14:56] well reset + cherry picked the old ones [20:15:06] we need a script to track that [20:15:33] Yes. And if we find something cool it will help with the prod security patch issues [20:16:23] hashar: Maybe http://linux.die.net/man/7/guilt ? [20:16:26] I got a proposal for the prod sec patch [20:16:54] yeah guilt sounds about right [20:18:24] ah lvm [20:18:32] I havent played with it for like 5 years [20:21:10] \O/ [20:22:20] (03PS1) 10BryanDavis: Make a self-hosted puppetmaster workalike for puppet::geoip [operations/puppet] - 10https://gerrit.wikimedia.org/r/119555 [20:22:31] (03PS4) 10Hashar: Support eqiad labs secondary disk [operations/puppet] - 10https://gerrit.wikimedia.org/r/119534 (owner: 10BryanDavis) [20:24:05] greg-g: deployed [20:24:10] (03CR) 10Dzahn: [C: 032] Labs: Remove Bots project motd [operations/puppet] - 10https://gerrit.wikimedia.org/r/102335 (owner: 10Tim Landscheidt) [20:25:13] (03CR) 10Dzahn: "thanks" [operations/puppet] - 10https://gerrit.wikimedia.org/r/102335 (owner: 10Tim Landscheidt) [20:34:33] akosiaris: ping [20:35:16] (or any root) [20:35:52] we just tried to restart parsoid with the new sudo rights added in https://gerrit.wikimedia.org/r/#/c/119293/1/manifests/admins.pp, but didn't have much success: https://gist.github.com/subbuss/970251aeb0dc37b1efed [20:36:07] so we'll need a root to help us [20:36:51] either by helping us make the dsh restart work or (for now) by running dsh -g parsoid service parsoid restart as root [20:37:34] gwicke: need parsoid restart? [20:37:51] robh: yup; or a fix to the sudo setup [20:38:10] well, lemme run the restart for you rather than block that [20:38:20] cuz theres a chance i have no clue how to fix the sudo [20:38:20] in theory we should be able to run this now ourselves, but in practice we are getting upstart errors like stop: Rejected send message, 1 matched rules; type="method_call", sender=":1.10" (uid=648 pid=25696 comm="stop parsoid ") interface="com.ubuntu.Upstart0_6.Job" member="Stop" error name="(unset)" requested_reply="0" destination="com.ubuntu.Upstart" (uid=0 pid=1 comm="/sbin/init") [20:38:57] ok [20:39:06] can figure that out later [20:39:22] (03PS2) 10BryanDavis: Make a self-hosted puppetmaster workalike for puppet::geoip [operations/puppet] - 10https://gerrit.wikimedia.org/r/119555 [20:39:26] restarting parsoid service now [20:39:50] !log parsoid service successfully restarted across hosts via dsh [20:39:54] gwicke: ^ [20:39:56] Logged the message, Master [20:40:24] subbu, you can do a "!log deployed Parsoid ff8c49" now [20:40:29] robh, thanks! [20:40:35] !log Deployed parsoid/deploy 20923176 [20:40:38] welcome [20:40:40] Logged the message, Master [20:40:41] thanks [20:41:19] gwicke, ah, i used the deploy sha as you were typing that. do we want the parsoid sha there? [20:43:21] subbu, you can add it in https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:38] I often said something like "with deploy i cannot edit that page directly .. let me add a second log message. [20:45:46] !log deployed Parsoid ff8c49e9 with deploy 20923176 (a better log message then the previous one logged at 20:40) [20:45:48] did you try logging in? [20:45:51] Logged the message, Master [20:47:16] gwicke, either i dont have an account or i dont remember my password. i'll recover my password or create an account and edit that page in a bit. [20:47:38] subbu, try your labs user/pass [20:47:54] yup, that worked. [20:49:04] ahh, so many sudo alerts [20:49:09] subbu: ;p [20:49:22] (all the ops get emails for the failures ;) [20:49:58] subbu, robh sorry. I didn't remember the uid/pass for that wiki. [20:50:01] heh, I sense an evil way to get somebody to fix the sudo setup soon ;) [20:50:08] hahaha [20:50:27] gwicke: gaming the system is wrong, but more than likely effective ;p [20:50:27] subbu, the mails were caused by the failed dsh command [20:50:29] oh, sudo alerts from tin [20:50:31] got it. [20:50:51] what gwicke said yep [20:51:08] its funny since i intentionally dont filter those anypalce, heh [20:51:23] i think that's a feature:) [20:51:27] robh, could you check the sudoers file on one of the wtp boxes? [20:51:36] I wonder if the change is actually deployed [20:51:47] https://gerrit.wikimedia.org/r/#/c/119293/1/manifests/admins.pp [20:52:38] wtp1020 for example [20:53:17] (03PS1) 10Hoo man: toolabs: Add the sql command to the exec hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/119634 [20:53:27] Coren: ^ [20:53:56] ewk, typo... [20:54:37] (03PS2) 10Hoo man: ToolLabs: Add the sql command to the exec hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/119634 [20:55:10] gwicke: so those should append to /etc/sudoer.s/ [20:55:13] and in there i see wikidev [20:55:23] with those options listed inside the file [20:55:25] anyone? These were around on the tampa exec hosts but aren't on the new eqiad ones... [20:55:49] gwicke: looks live.... [20:57:11] robh, ok thanks [20:57:30] will reopen the ticket [20:57:34] cool [20:59:51] https://rt.wikimedia.org/Ticket/Display.html?id=6961 [21:10:00] greg-g: I'd like to enable Popups on testwiki as well as mw.org, I assume that's fine [21:10:14] spagewmf: yeppers [21:10:59] OK, starting extension Popups deploy [21:11:09] woot! [21:11:22] greg-g: I merged most of the code in popups, so I'll be around to provide any support if needed [21:11:30] extension:Popups, userfacing name:Hovercards, no one get confused please! ;) [21:11:39] YuviPanda: thank you! [21:11:40] * YuviPanda makes greg-g call it 'PeekCards' [21:11:49] PeekUpCards [21:11:57] hahaha [21:13:18] <^d> Can we call them Popup Ads? [21:13:42] ^d: yes. and we shall make them undismissable [21:13:42] * greg-g kicks ^d in the shins [21:14:04] THE USER MUST BE INFORMED AT ALL COSTS! [21:14:12] <^d> YuviPanda: I don't care if you can't dismiss them, as long as I don't have to have them pop up to begin with! ;-) [21:14:26] ^d: heh :P [21:24:14] (03PS3) 10BryanDavis: Make a self-hosted puppetmaster workalike for puppet::geoip [operations/puppet] - 10https://gerrit.wikimedia.org/r/119555 [21:26:07] (03PS4) 10BryanDavis: Make a self-hosted puppetmaster workalike for puppet::geoip [operations/puppet] - 10https://gerrit.wikimedia.org/r/119555 [21:31:06] (03PS2) 10Spage: Enable Popups (Hovercards) on mediawiki.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119445 [21:34:03] (03CR) 10Ottomata: [C: 031] Make a self-hosted puppetmaster workalike for puppet::geoip [operations/puppet] - 10https://gerrit.wikimedia.org/r/119555 (owner: 10BryanDavis) [21:43:51] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 11 Mar 2014 08:47:37 PM UTC [21:44:08] ^d how hard is it to rename an extension in Gerrit? "Over my dead body!" or [21:46:20] Looks like yesterday we had quite a few database connection failures from resourceloader: https://logstash.wikimedia.org/#/dashboard/elasticsearch/rl-stuff [21:46:29] Were there known issues yesterday? [21:48:16] spagewmf: has it been deployed? (asking so I can figure out if i could sleep :P) [21:49:33] YuviPanda: no. Now I'll scap to put the code on servers, then I'll sync the config that enables it, then test on testwiki, then sync the config changes. I think [21:49:45] spagewmf: ah, ok :) I'll be around for another 30m I think [21:50:12] <^d> spagewmf: Somewhere in between? It needs *damn compelling* reason to do so [21:50:23] YuviPanda: msg me Prtxksxna's mobile phone # :) [21:50:33] spagewmf: moment :) [21:51:21] (03PS1) 10BBlack: Switch to vmod_header for the ZeroOpts=tls cookie [operations/puppet] - 10https://gerrit.wikimedia.org/r/119645 [21:51:21] ^d: thx, I'll live with it :) [21:52:31] actually, if I scap the code without Popups in any extension list, will the i18n message caches get the new extension's messages? [21:52:35] spagewmf, I sent greg and dan a summary of the situation. They might be able to help explain. [21:53:15] (03CR) 10BBlack: [C: 032 V: 032] "manually verified, slow bot :P" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119645 (owner: 10BBlack) [21:54:00] (03PS3) 10BBlack: Use libvmod-header to set GeoIP cookie [operations/puppet] - 10https://gerrit.wikimedia.org/r/119448 (owner: 10Ori.livneh) [21:55:10] (03CR) 10BBlack: [C: 032 V: 032] "Note that this only affects labs until 119014 is merged" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119448 (owner: 10Ori.livneh) [21:55:28] bblack: I love you, man. Thanks! [21:55:44] np [21:56:16] I actually sat there and figured out how to see the C-output from VCL compiles and make that patch myself, then when I was about to commit it I saw your pending one already. exact same code heh. [21:56:42] * bblack should check gerrit more often [21:58:03] ori: the zero-related usage is more complex (remove then add) but not inline-C, and it's already pushing to prod. I tested that one manually on the betalabs mobile test. So assuming nothing breaks horribly from that, we should be good to go later. [21:58:05] (03PS2) 10Ori.livneh: Set serializer to 'php' for production memcache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119461 [21:58:27] bblack: weeeeee :) awesome [21:59:49] dbbot-wm: info 10.64.16.10 [21:59:51] @info 10.64.16.10 [21:59:51] Krinkle: [10.64.16.10: s5] db1021 [22:00:06] @info dewiki [22:00:06] Krinkle: [dewiki: s5] db1058: 10.64.32.28, db1005: 10.64.0.9, db1026: 10.64.16.15, db1021: 10.64.16.10 [22:00:14] nice feature [22:00:28] ResourceLoaderFileModule::getStyles: failed to update DB: exception 'DBConnectionError' with message 'DB connection error: Too many connections (10.64.16.10)' in /usr/local/apache/common-local/php-1.23wmf17/includes/db/Database.php:966 [22:00:54] https://github.com/Krinkle/wmfdbbot [22:01:55] (03PS1) 10Spage: Add new Popups extension to list [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119646 [22:09:51] (03CR) 10Spage: [C: 032] "Just the addition to extension-list from gerrit 119445." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119646 (owner: 10Spage) [22:09:53] (03Merged) 10jenkins-bot: Add new Popups extension to list [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119646 (owner: 10Spage) [22:10:58] !log spage updated /a/common to {{Gerrit|I3d59b8246}}: Add new Popups extension to list [22:11:04] Logged the message, Master [22:11:51] PROBLEM - Puppet freshness on labsdb1004 is CRITICAL: Last successful Puppet run was Wed 19 Mar 2014 07:10:56 PM UTC [22:12:11] spagewmf: w00t [22:12:30] gonna scap it [22:12:49] !log spage Started scap: Add Popups extension (Hovercards Beta Feature) to wmf17 and 18, not enabled yet [22:12:54] Logged the message, Master [22:13:00] good message [22:13:11] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [22:13:30] * bd808 points out that wmf17 dies tomorrow [22:14:32] bd808: my religious tracts spake thusly: "When adding a new extension to one branch, you also need to add the extension to any other branches in use on the cluster (typically the wmf{N-1} branch), even if the extension will not be enabled on any wikis running that branch. Otherwise the localization cache builder will complain." [22:14:38] :) [22:15:25] That is quite likely. The l10n builder is a fickle beast [22:16:04] I've been doing battle with it for over a month to fix a bug in the new branch deploys [22:16:15] bd808: I support a 2014 Papal Council of Trebuchet to rewrite Scripture [22:17:00] I got trebuchet working in beta last night. It was a banner moment [22:17:41] spagewmf: I'll be out in a few mins, hope that's ok :) [22:18:18] YuviPanda: no worries. I have Popups working locally and on beta labs, I can do some debugging [22:18:24] if needed [22:18:26] spagewmf: woot [22:22:51] exception.log spewing 2014-03-19 22:21:26 mw1163 enwiki: [f42cd34c] /wiki/Main_Page Exception from line 468 of /usr/local/apache/common-local/php-1.23wmf17/includes/cache/LocalisationCache.php: No localisation cache found for English. Please run maintenance/rebuildLocalisationCache.php [22:22:55] ?? [22:23:32] (03CR) 10BryanDavis: [C: 031] "Cherry-picked into deployment-prep project and used there to solve "puppet:///volatile/GeoIP not found" errors for varnish hosts." [operations/puppet] - 10https://gerrit.wikimedia.org/r/119555 (owner: 10BryanDavis) [22:24:25] spagewmf: just mw1163? [22:24:59] greg-g: good point, just mw1163 [22:25:04] whew [22:25:05] that one must have hardware issues [22:25:14] try a /last mw1163 here in channel [22:25:28] it showed up in monitoring a couple times ,then came back [22:25:35] i can remove it temp. from dsh [22:25:40] if that's where it showed up [22:25:55] mutante: looks like it didn't get the l10n cache rebuild [22:26:00] but got the code [22:26:13] 08:03 <+icinga-wm> PROBLEM - Apache HTTP on mw1163 is CRITICAL: Connection refused [22:26:16] 08:09 <+icinga-wm> RECOVERY - Apache HTTP on mw1163 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.124 second response time [22:26:16] awesome [22:27:09] * greg-g makes note of /last [22:27:09] i take that back, already not in dsh [22:27:10] (03CR) 10Ori.livneh: [C: 032] Set serializer to 'php' for production memcache [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119461 (owner: 10Ori.livneh) [22:27:19] i wonder how you synced to it then [22:27:45] #6741: mw1163 dmesg output shows memory errors [22:27:49] there we go [22:27:56] fun [22:28:01] depool? [22:28:04] !log ori synchronized wmf-config/mc.php 'Set serializer to php for production memcache' [22:28:09] Logged the message, Master [22:28:14] Feb 13 dzahn: "removed from dsh groups because deployers reported errors and asked for it" [22:28:33] so, not in dsh, but not depool'd? [22:28:36] "https://gerrit.wikimedia.org/r/#/c/113287/ [22:28:37] please just re-add/revert when this is fixed " [22:28:57] it was "completely down" on Mar 10 [22:29:03] then fixed on Mar 19 [22:29:06] that's fine for a quick fix, but long term... need it depooled [22:30:05] /usr/local/apache/common-local on mw1163 is a little out of date. It has stuff that was deleted from cluster last week. [22:30:21] yeah, depool it: https://logstash.wikimedia.org/#dashboard/temp/8JOj_a5OTNKI-MjhU8h_rQ [22:30:26] * bd808 notes that eventual consistency in scap is forward only [22:31:41] greg-g: i think it must have been depooled for a while or you would have noticed this way earlier [22:31:52] you'd assume, but I wouldn't :) [22:31:55] people ignore errors like this [22:32:00] <^d> ori: Nothing's freaking out :) [22:32:01] i guess it got reactivated after the memory fix [22:33:00] greg-g: found it. Mar 19 " Added back to the pool. " [22:33:02] on that ticket [22:33:07] 6741 [22:33:42] !log depooling mw1163 [22:33:47] Logged the message, Master [22:33:50] greg-g: Can I try a manual sync-common on it? [22:34:11] mutante thanks. *I* was freaking out over http://tinyurl.com/n3twd8k , a bit :) [22:34:15] bd808: yeah [22:34:32] mutante: weird, it did just start today, if my Fx unfreezes I'll share the logstash query [22:34:47] greg-g: well, March 19 is today [22:34:53] Chris fixed the memory and repooled it [22:34:55] oh right, /me reads [22:35:01] without dsh [22:35:01] the part missing was re-adding to dsh as well [22:35:03] * greg-g nods [22:35:24] bd808: manual sync sounds useful [22:35:26] yes [22:35:56] Sync is done. I'll do the l10n rebuild now [22:37:00] my scap is 77% done :-/ [22:37:27] you could revert https://gerrit.wikimedia.org/r/#/c/113287/ [22:37:32] if you think it looks good again [22:37:43] but it needs both, dsh and pybal or none [22:37:58] spagewmf: Look on the bright side, you have a progress meter now! [22:38:17] !log Manually ran sync-common and scap-rebuild-cdbs on mw1163 [22:38:22] Logged the message, Master [22:39:59] bd808 new scap is more helpful. BTW it output 22:12:58 WARNING - Unexpected argument(s) ignored: ['--extended'] [22:40:27] spagewmf: Yeah that's a known one I need to fix. [22:40:48] Glad you like the UX more than previous [22:41:36] (03PS1) 10Dzahn: Revert "remove mw1163 from dsh groups" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119650 [22:41:50] (03CR) 10jenkins-bot: [V: 04-1] Revert "remove mw1163 from dsh groups" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119650 (owner: 10Dzahn) [22:41:54] duh:) [22:42:09] well, you get the idea.. [22:42:40] bd808: would be nice to explain "sync-common to proxies". Also, "Finished rsync common (duration: 00m 09s)" sounds like the finished product (best present ever :) ). [22:42:41] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [22:43:21] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 2: number_of_data_nodes: 2: active_primary_shards: 36: active_shards: 95: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [22:43:39] I was wondering if it was having problems... [22:43:54] That's not good [22:44:02] Down an node [22:45:43] greg-g: "Caused by: java.lang.OutOfMemoryError: Java heap space" [22:45:46] boom [22:46:33] my scap is 78% done. greg-g at 7 minutes a machine (!) and 51 left, I'm going to go over the window [22:46:43] wth [22:46:47] bd808: advice? [22:47:06] Don't add a new l10n to 2 branches at once? [22:47:11] :/ [22:47:30] scap was chugging away with a few stalls, but now it's very slow. [22:47:32] (03PS1) 10Greg Grossmeier: Revert "remove mw1163 from dsh groups" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119653 [22:47:39] It takes 3.5m /host with new l10n on 1 branch [22:47:45] (03CR) 10jenkins-bot: [V: 04-1] Revert "remove mw1163 from dsh groups" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119653 (owner: 10Greg Grossmeier) [22:47:51] fine, jenkins [22:48:09] (03Abandoned) 10Greg Grossmeier: Revert "remove mw1163 from dsh groups" [operations/puppet] - 10https://gerrit.wikimedia.org/r/119653 (owner: 10Greg Grossmeier) [22:48:28] I wonder if one of the rsync slaves is overloaded [22:50:21] Might ori's sync-file or the sync-common to mw1163 have confused scap in the middle of running? [22:50:51] The sync-common wouldn't. [22:51:43] spagewmf: Is it still syncing or is it rebuilding the l10n cache now? [22:52:01] bd808: 22:19:01 DEBUG - Started update apaches [22:52:25] Network load on wm1070 wasn high but dropped down normal a few minutes ago [22:52:40] earlier was 22:17:32 INFO - Finished mw-update-l10n (duration: 04m 34s) [22:57:04] I don't see anything crazy in ganglia for mw1070 and mw1010 (rsync slaves) [22:57:54] now 50 machines left [22:59:04] I'm in no hurry, maybe Prtxsxnsna will wake before it completes :) [23:03:04] heh [23:03:21] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1712: active_shards: 5059: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 0 [23:03:42] :( [23:04:01] so, first logstash ES, now Cirrus ES? [23:04:11] manybubbles: ^ if you're still online [23:04:21] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1713: active_shards: 5062: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [23:04:28] manybubbles: nvm! :) [23:06:40] be bacvk shorly [23:08:29] bd808: `ps wlxU spage` shows 51 "/usr/bin/ssh -oBatchMode=yes -oSetupTimeout=10 mwXXXX /usr/local/bin/sync-common mw1010.eqiad.wmnet mw1070.eqiad.wmnet", I guess those are the remaining syncs [23:11:10] spagewmf: They are totally stuck? [23:11:20] You're still at 50 left? [23:11:43] bd808: yes, sync-common: 78% (ok: 181; fail: 0; left: 50) [23:11:47] Something went horribly wrong there [23:12:29] * bd808 goes to look at tin [23:16:53] spagewmf: I think something broke for those hosts. [23:17:12] I can se eyou are still connected but I don't see any cpu usage for your process [23:17:30] I would advise ^C to kill scap and then run again. [23:18:12] !log scap by spage looks hung with 50 hosts not finished rsyncing [23:18:18] Logged the message, Master [23:18:40] greg-g: that _should_ be ok [23:20:31] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - elasticsearch (production-logstash-eqiad) is running. status: red: timed_out: false: number_of_nodes: 2: number_of_data_nodes: 2: active_primary_shards: 32: active_shards: 35: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 59 [23:20:31] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - elasticsearch (production-logstash-eqiad) is running. status: red: timed_out: false: number_of_nodes: 2: number_of_data_nodes: 2: active_primary_shards: 32: active_shards: 35: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 59 [23:20:47] !log restarted elasticsearch on logstash1002 [23:20:52] Logged the message, Master [23:21:09] bd808: thanks, anything I can do to help debug before I stop? They're all in BSD STAT "Ss" interruptible sleep (waiting for an event to complete) [23:21:58] spagewmf: I grabbed a dump to the ps list so we can look at the hosts. Other than that I can't think of anything [23:22:15] !log spage scap aborted: Add Popups extension (Hovercards Beta Feature) to wmf17 and 18, not enabled yet (duration: 69m 26s) [23:22:19] Logged the message, Master [23:22:40] !log spage Started scap: Retry Add Popups extension (Hovercards Beta Feature) to wmf17 and 18, not enabled yet [23:22:45] Logged the message, Master [23:22:53] * bd808 crosses fingers for spagewmf and scfc_de  [23:23:06] sorry scfc_de, bad tab [23:24:19] bd808: yes. I couldn't run strace -p on my tin processes, so I'm not sure what they were waiting for. I should have ssh'd to a target to see what they're doing. [23:25:09] spagewmf: On tin they would have just been waiting on network io [23:27:11] bd808: second attempt printed "23:24:00 INFO - Finished sync-common to apaches (duration: 01m 03s)", I think first one didn't print that. Now doing scap-rebuild-cdbs: 88% (ok: 203; fail: 0; left: 28) [23:27:30] +! [23:27:34] +1 even [23:28:02] Something got loopy in the rsync returns [23:29:56] !log spage Finished scap: Retry Add Popups extension (Hovercards Beta Feature) to wmf17 and 18, not enabled yet (duration: 07m 15s) [23:30:01] Logged the message, Master [23:30:08] That's better [23:30:13] I should have Ctrl+C'd much sooner! [23:30:23] Sorry I let you hang out so long :( [23:30:53] I would say that after 5m of no change it's time to shoot the patient [23:31:14] bd808: no worries. Now to merge the enable patch, test on testwiki, and sync that. [23:31:42] spagewmf: Is this a brand new extension? [23:32:01] bd808: yes, Popups was only on beta labs until today [23:32:52] if any SWAT team members are around, ebernhardson is OK delaying "Send Flow specific logs to fluorine" [23:33:17] ya i just moved it tomorrows deploy, its not a big deal at all [23:34:29] (03CR) 10Spage: [C: 032] "extension-list change went out in gerrit 119646, now to do the rest." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119445 (owner: 10Spage) [23:34:41] (03Merged) 10jenkins-bot: Enable Popups (Hovercards) on mediawiki.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/119445 (owner: 10Spage) [23:36:05] spagewmf: "Popups" right? I'll add it to the make-wmf-branch config [23:36:34] bd808: I have a patch for that already, hold on [23:36:45] Oh cool [23:38:13] bd808: https://gerrit.wikimedia.org/r/#/c/119447 , AIUI that's not needed until the new branch is cut? [23:39:02] spagewmf: Right. Which I'll be doing at about 9AM your time tomorrow morning. [23:45:54] testwiki has Hovercards! [23:51:36] bd808: so I have 5 config files in /a/common/wmf-config to sync, what script should I use? [23:51:48] ^ or anyone [23:53:12] I'll do sync-dir wmf-config [23:55:34] sync-dir [23:55:43] touch InitialiseSettings.php too if you haven't edited it [23:56:38] Reedy: ! thx! [23:56:59] * bd808 is reminded that there is bug to make the scripts do that [23:57:33] I really need to get done playing in labs and get back to porting the rest of the scap scripts [23:57:44] !log spage synchronized wmf-config 'Config change to enable extension Popups (Hovercards) on mediawikiwiki' [23:57:49] Logged the message, Master [23:58:24] testing...