[01:16:09] (03PS1) 10TTO: Add suppressredirect user group for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142979 (https://bugzilla.wikimedia.org/67278) [01:33:32] (03CR) 10TTO: [C: 031] "Any problem here? The FeaturedFeeds patch has been merged." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136316 (https://bugzilla.wikimedia.org/66015) (owner: 10Whym) [01:40:54] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Sun 29 Jun 2014 23:40:27 UTC [01:41:32] !log springle Synchronized wmf-config/db-eqiad.php: depool db1061 during schema changes (duration: 00m 07s) [01:41:41] Logged the message, Master [01:46:58] (03PS1) 10TTO: Allow subpages in template namespace of frwikiversity [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142980 (https://bugzilla.wikimedia.org/57487) [01:49:41] (03PS2) 10Springle: Deploy db1072 and db1073 as future S1. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142487 [01:50:06] (03CR) 10Springle: [C: 032] Deploy db1072 and db1073 as future S1. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142487 (owner: 10Springle) [01:59:49] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Mon Jun 30 01:59:37 UTC 2014 [02:15:27] !log LocalisationUpdate completed (1.24wmf10) at 2014-06-30 02:14:23+00:00 [02:15:38] Logged the message, Master [02:24:59] !log LocalisationUpdate completed (1.24wmf11) at 2014-06-30 02:23:56+00:00 [02:25:04] Logged the message, Master [02:28:57] !log springle Synchronized wmf-config/db-eqiad.php: repool db1061, warm up (duration: 00m 07s) [02:29:03] Logged the message, Master [02:49:34] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jun 30 02:48:28 UTC 2014 (duration 48m 27s) [02:49:39] Logged the message, Master [03:32:22] (03PS1) 10Springle: switch tendril backend to db1011 [operations/puppet] - 10https://gerrit.wikimedia.org/r/142982 [03:37:56] (03CR) 10Springle: [C: 032] switch tendril backend to db1011 [operations/puppet] - 10https://gerrit.wikimedia.org/r/142982 (owner: 10Springle) [03:39:42] (03PS1) 10Ori.livneh: Apache config for Wikivoyage using mod_proxy_fcgi [operations/apache-config] - 10https://gerrit.wikimedia.org/r/142983 [03:40:02] TimStarling: a pache for your thoughts [03:40:36] also, good afternoon [03:54:09] (03PS1) 10Springle: add db1044 to mariadb 10 s1 for HA tests [operations/puppet] - 10https://gerrit.wikimedia.org/r/142985 [03:57:41] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Mon 30 Jun 2014 01:56:49 UTC [03:58:48] (03CR) 10Springle: [C: 032] add db1044 to mariadb 10 s1 for HA tests [operations/puppet] - 10https://gerrit.wikimedia.org/r/142985 (owner: 10Springle) [04:17:53] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Mon Jun 30 04:17:43 UTC 2014 [04:49:41] (03CR) 10Calak: [C: 031] Add suppressredirect user group for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142979 (https://bugzilla.wikimedia.org/67278) (owner: 10TTO) [05:45:18] (03CR) 10Calak: "TTO, admins should be able to add/remove this group; you have forgot it!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142979 (https://bugzilla.wikimedia.org/67278) (owner: 10TTO) [05:54:03] (03PS2) 10TTO: Add suppressredirect user group for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142979 (https://bugzilla.wikimedia.org/67278) [05:55:33] (03CR) 10TTO: "Sorry! It's been so long since I did one of these." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142979 (https://bugzilla.wikimedia.org/67278) (owner: 10TTO) [06:27:08] <_joe_> ori: still around? [06:32:02] akosiaris: morning, can we have a look at the patch again [07:30:47] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "seems legit, but some food for thought." (031 comment) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/142983 (owner: 10Ori.livneh) [07:34:00] Hey anyone out there? [07:34:35] Just looking for some further details about this: https://www.mediawiki.org/wiki/Manual:Database_access#Lock_contention [07:53:22] !log upgrading ms-be300[2-4] to swift icehouse [07:53:27] Logged the message, Master [08:03:54] (03PS1) 10Filippo Giunchedi: swift: add container-sync config section [operations/puppet] - 10https://gerrit.wikimedia.org/r/142989 [08:07:55] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: add container-sync config section [operations/puppet] - 10https://gerrit.wikimedia.org/r/142989 (owner: 10Filippo Giunchedi) [08:23:10] PROBLEM - swift-container-server on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [08:23:10] PROBLEM - swift-account-auditor on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [08:23:19] PROBLEM - swift-container-updater on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [08:23:19] PROBLEM - swift-account-reaper on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:24:10] RECOVERY - swift-container-server on ms-be3003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [08:24:10] RECOVERY - swift-account-auditor on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [08:24:19] RECOVERY - swift-container-updater on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [08:24:19] RECOVERY - swift-account-reaper on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [08:24:39] RECOVERY - Disk space on ms-be3003 is OK: DISK OK [08:27:04] <_joe_> .win 12 [08:34:03] _joe_: hey! [08:34:23] <_joe_> YuviPanda: hey [08:34:41] _joe_: switched over the labs proxis to use the nginx module [08:35:05] so can test pfs either with that patch or with the nginx::ssl thing [08:35:19] <_joe_> YuviPanda: oh, great [08:35:38] <_joe_> the nginx::ssl thing has been a creation of mine last week to ease transitions [08:35:51] yeah noticed :) [08:36:05] is prod nginx using the module? [08:36:24] also i got a few commits merged on the nginx module as well [08:37:55] we don't have throughput monitoring for the labs nginx yet tho [08:40:55] <_joe_> not really, no [08:41:14] <_joe_> just the new service rcstream and a few other things do [08:41:40] would prod switch evetually? [08:48:56] <_joe_> not at present [08:49:09] <_joe_> no idea if it ever will, it should probably [09:25:20] (03CR) 10Calak: [C: 031] "Thank you!" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142979 (https://bugzilla.wikimedia.org/67278) (owner: 10TTO) [09:54:03] <_joe_> Nemo_bis: fancy a new edition of the PFS gazette? :) [09:55:15] hashar: there? I'm going to merge 141488 141487 and 141924 [09:59:25] _joe_: yess [09:59:58] * Nemo_bis is always too lazy for tab-complete, fingers refuse to type more than 2 letters but jo* doesn't suffice [10:03:40] (03CR) 10Filippo Giunchedi: Add Grafana module & role (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/133274 (owner: 10Ori.livneh) [10:07:04] _joe_: morning, so PFS is a go? it was published on tech news, which means it must be delayed, tech news is never right about timing :) [10:10:10] godog: hi. Sorry I wasn't around :-D [10:10:47] matanya: [citation needed] [10:10:55] hashar: np, I think we're good to merge [10:11:07] +1 Nemo_bis [10:11:10] godog: ah the zuul puppet patches. I completely forgot about them over the weekend [10:11:17] godog: yeah those three are trivial enough [10:11:42] hashar: no worries, I've been away too on fri/weekend [10:12:01] godog: the rest of patches I would need some input / design idea / experience whatever :D [10:15:12] <_joe_> matanya: did not know about that [10:15:21] <_joe_> for me it's a go [10:15:57] I have subscribed to TechNews RSS feed https://meta.wikimedia.org/w/api.php?action=featuredfeed&feed=technews&feedformat=rss [10:16:05] that is handy sometime :) [10:16:09] godog: merge time? :-D [10:16:22] godog: I am on gallium ready to manually launch puppet [10:16:33] hashar: yup [10:16:55] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] zuul: migrate server definitions to server.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/141488 (owner: 10Hashar) [10:17:08] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] zuul: migrate merger definitions to merger.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/141487 (owner: 10Hashar) [10:17:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] zuul: get rid of git_dir and zuul_url in server conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/141924 (owner: 10Hashar) [10:17:27] (03PS1) 10Aude: Set internalEntitySerializerClass Wikibase setting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142998 [10:18:37] hashar: done [10:18:49] running puppet [10:20:25] !log restarting zuul after a puppet change for /etc/zuul/zuul.conf [10:20:31] Logged the message, Master [10:29:01] (03PS3) 10Hashar: zuul: prefix server default template with 'zuul.' [operations/puppet] - 10https://gerrit.wikimedia.org/r/141501 [10:29:09] (03CR) 10Hashar: "rebased" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141501 (owner: 10Hashar) [10:29:52] (03PS4) 10Hashar: zuul: merger now has its own default file [operations/puppet] - 10https://gerrit.wikimedia.org/r/141502 [10:30:00] (03CR) 10Hashar: "rebased" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141502 (owner: 10Hashar) [10:30:51] godog: the next two one split the init script default files for the two services running: the server and the merger. [10:31:01] godog: https://gerrit.wikimedia.org/r/#/c/141501/ and https://gerrit.wikimedia.org/r/#/c/141502/ :-] [10:41:03] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Mon 30 Jun 2014 08:40:26 UTC [10:47:32] hashar: yep looks good [10:47:43] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] zuul: prefix server default template with 'zuul.' [operations/puppet] - 10https://gerrit.wikimedia.org/r/141501 (owner: 10Hashar) [10:47:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] zuul: merger now has its own default file [operations/puppet] - 10https://gerrit.wikimedia.org/r/141502 (owner: 10Hashar) [10:48:23] hashar: merged! [10:48:38] \O/ [10:50:59] <_joe_> ok, if someone knows how I can check if the submodule change I did in a patch has already been merged in prod [10:51:09] <_joe_> and more importantly, how do I rebase? [10:51:14] <_joe_> shit I'm hating this. [10:51:45] git branch --all --contains ? [10:51:59] would let you know whether some sha1 is in the local repo [10:52:07] or drop --all for the current branch (i.e. HEAD) [10:52:11] <_joe_> it's a submodule [10:52:23] <_joe_> I really really hate this [10:52:46] cd to the submodule ? :-D [10:53:04] <_joe_> can't see how we're benefiting from this. Submodules may be nice in general (maybe), but having to work with gerrit it's a real mess [10:53:12] <_joe_> hashar: do you get how worng that is? [10:53:22] <_joe_> I'm sure you are [10:54:08] <_joe_> and rebasing does strange things in this situation. [11:00:40] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Mon Jun 30 11:00:31 UTC 2014 [11:01:27] (03PS4) 10Giuseppe Lavagetto: rcstream: add SSL support [operations/puppet] - 10https://gerrit.wikimedia.org/r/141931 [11:06:53] (03PS5) 10Giuseppe Lavagetto: rcstream: add SSL support [operations/puppet] - 10https://gerrit.wikimedia.org/r/141931 [11:07:13] (03CR) 10Giuseppe Lavagetto: [C: 032] rcstream: add SSL support [operations/puppet] - 10https://gerrit.wikimedia.org/r/141931 (owner: 10Giuseppe Lavagetto) [11:08:27] (03CR) 10Physikerwelt: "Which WMF version is currently being used?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139421 (https://bugzilla.wikimedia.org/66587) (owner: 10Reedy) [11:27:55] <_joe_> lunch, bbl [12:09:17] _joe_: you can do git submodule in the ops repo to get the hash at which that submodule is pinned, and then in the submodule repo do a git log [12:09:37] _joe_: and if this is about your nginx::ssl change then yes, it is in prod now [12:19:47] <_joe_> YuviPanda: oh yes I long solved that :) thanks :) [12:20:36] _joe_: :) do you want to test the PFS patches on the labs proxies today? [12:20:59] <_joe_> YuviPanda: if you feel like it, yes [12:21:10] <_joe_> but don't cut out ie6 :) [12:21:14] _joe_: sure! can you merge? let me find patches [12:21:39] (03PS2) 10Yuvipanda: dynamicproxy: Tweak SSL config to match prod [operations/puppet] - 10https://gerrit.wikimedia.org/r/142207 [12:21:48] _joe_: ^ [12:21:57] <_joe_> YuviPanda: ok, will do [12:22:29] _joe_: still supports sslv3 so ie6 is ok... for now :) [12:23:05] (03CR) 10Giuseppe Lavagetto: [C: 032] dynamicproxy: Tweak SSL config to match prod [operations/puppet] - 10https://gerrit.wikimedia.org/r/142207 (owner: 10Yuvipanda) [12:23:24] <_joe_> 5 mins and I'll merge [12:24:30] _joe_: ok! I can run puppet on the box when appropriate [12:26:46] <_joe_> merged [12:28:03] any paravoid seen? [12:28:54] _joe_: done! [12:29:18] _joe_: https://sugarfrosties.wmflabs.org/wiki/Main_Page is sample URL [12:30:04] <_joe_> The connection is encrypted and authenticated using AES_128_GCM and uses ECDHE_RSA as the key exchange mechanism [12:30:08] <_joe_> :) [12:30:42] matanya: yes [12:31:36] _joe_: I see a CPU spike http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1404131481.593&from=00%3A00_20140626&until=23%3A59_20140630&target=project-proxy.dynamicproxy-gateway.cpu.total.system.value&target=project-proxy.dynamicproxy-gateway.cpu.total.user.value [12:32:00] let [12:32:05] 's see if that is just momentary [12:32:08] paravoid: have you seen OTRS admins complaint ? [12:32:11] <_joe_> YuviPanda: which is expected, it's still very low [12:32:28] <_joe_> YuviPanda: the point is not the cpu alone, but the throughput [12:32:39] <_joe_> which is not greatly affected [12:32:45] _joe_: right, which we don't track yet, sadly [12:32:51] _joe_: yeah, 9% seems ok :) [12:33:15] but it is almost a 2.5x spike, and this box has low load [12:33:19] <_joe_> YuviPanda: also, that includes the puppet run I think [12:33:32] _joe_: yeah, am waiting for it to stabilize. [12:34:16] <_joe_> YuviPanda: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1404131481.593&from=09%3A00_20140630&until=23%3A59_20140630&target=project-proxy.dynamicproxy-gateway.cpu.total.system.value&target=project-proxy.dynamicproxy-gateway.cpu.total.user.value [12:34:25] <_joe_> this shows the last few hours [12:34:32] <_joe_> and the spikes are all puppet runs [12:34:34] <_joe_> :) [12:34:37] _joe_: ah, makes sense :) [12:34:46] <_joe_> so the cpu increase is not really there [12:34:53] (03PS2) 10Yuvipanda: toollabs: Tweak SSL config to match prod [operations/puppet] - 10https://gerrit.wikimedia.org/r/142208 [12:35:00] _joe_: yeah, seems so. [12:35:03] <_joe_> (I predict a 20%-30% increase) [12:35:26] <_joe_> but, that server is literally doing nothing :) [12:35:27] _joe_: ^ is toollabs, which has more traffic. we should have throughput measurements in about a week, so I'm ok if you want to wait for that [12:35:34] _joe_: yeah [12:35:43] _joe_: most of labs traffic is to toollabs :) [12:35:59] paravoid: https://gerrit.wikimedia.org/r/#/c/141919/ for ref [12:36:13] <_joe_> YuviPanda: we actually wanted to go live tomorrow :) [12:36:19] _joe_: aah :) [12:36:27] _joe_: feel free to merge then :) [13:26:23] (03CR) 10Filippo Giunchedi: [C: 04-1] initial commit (0311 comments) [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [13:31:54] !log upgrade ms-fe300[12] to swift icehouse [13:31:59] Logged the message, Master [13:33:36] (03CR) 10Anomie: Disable local uploads on Malay Wiktionary (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142318 (https://bugzilla.wikimedia.org/67152) (owner: 10Odder) [13:33:49] (03CR) 10Anomie: [C: 04-1] Disable local uploads on Malay Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142318 (https://bugzilla.wikimedia.org/67152) (owner: 10Odder) [13:34:01] woohoo [13:35:05] <_joe_> wow [13:35:57] so far so good! [14:00:42] (03CR) 10Manybubbles: [C: 031] "Load tested this this morning and saw a 2-5% spike in load when I enabled it as the cache filled. After about two minutes the load died d" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142590 (owner: 10Chad) [14:03:41] (03PS2) 10Odder: Disable local uploads on Malay Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142318 (https://bugzilla.wikimedia.org/67152) [14:05:58] godog: how's swift? [14:07:50] paravoid: aside from that sdk problem in esams (that gets unmounted/remounted) I think it is fine, I see some bandwith going on now between the backends tho [14:11:37] godog: you can just set the weight for that disk at 0 [14:11:44] on the rings [14:11:51] we should anyway, it's a failed disk [14:12:55] paravoid: yup, I'm not sure it can replicate it without filling up the disks though [14:13:12] well, what's remaining, ~100G per disk [14:13:42] more 180G [14:14:53] yeah [14:22:08] mhh ok I'll try to decrease the weight by 30% or so and see what swift thinks [14:23:34] (03CR) 10Manybubbles: [C: 031] Disable local uploads on Malay Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142318 (https://bugzilla.wikimedia.org/67152) (owner: 10Odder) [14:24:19] (03CR) 10Manybubbles: [C: 031] Remove completed surveys [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138634 (owner: 10MarkTraceur) [14:27:16] (03CR) 10Manybubbles: "Does this have to wait for the next commit in the chain (I26a5971a4ce388c81ead4f8a0af8d0754b3d945c) or is it safe?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137804 (owner: 10Awight) [14:29:16] anomie: I can swat today if you ant [14:29:18] want [14:30:33] manybubbles: I'd have no problem SWATting today either, but go ahead if you want [14:30:52] anomie: I want to redeem myself after the last time I failed to deploy and broke cirrus [14:31:01] I need to get my deploy legs back under me, I guess [14:31:20] that, and we're deploying again at 1 any way [14:41:20] !log Cleared out a watchlist with 126652 entries on warwiki to resolve https://bugzilla.wikimedia.org/show_bug.cgi?id=67123 [14:41:25] Logged the message, Master [14:42:24] paravoid: did you see my messages ? [14:42:43] I did [14:42:48] Jeff_Green is working on it [14:43:02] thanks. [14:43:13] fwiw, the correct response is to file an RT ticket [14:43:29] it should be fixed temporarily, I'm working on the permanent fix [14:44:05] thank you both, i'll do so next time. [14:44:26] :) [14:44:38] thanks for relaying the message :) [14:45:23] hoo is enjoying himself lately, I notice. [14:45:48] twkozlowski: I love killing people's wathclists, yeah :P [14:46:03] What was that in total? 200k+? [14:46:29] That bug was about 3 watchlists... biggest one was 800k+ entries [14:46:46] so 1M+ [14:46:52] JUST DON'T WATCH SO MANY PAGES PEEPS [14:46:53] In total? Yeah [14:47:33] <_joe_> 800k pages? [14:47:36] <_joe_> wtf? [14:47:52] yep, on shwiktionary, he watched 819k+ pages [14:48:23] !log installed new swift ring on esams, decrease ms-be3003/sdk1 weight [14:48:28] Logged the message, Master [14:49:44] But why watching 800k pages? You can't even make so many edits in your whole life... [14:51:08] twkozlowski: I've no idea how he managed to do that... he "only" got 450k edits over ther [14:51:08] e [14:52:32] 800k pages ????? [14:52:42] you can do that with a bot [14:52:46] Yep... biggest watchlist I've ever seen [14:52:54] :o [14:53:20] suppose it's watch everything i edit, then they go crazy with huggle or such [14:53:29] could do on wikidata [14:53:34] he got 450k edits on taht wiki [14:53:43] ooo [14:53:54] it's easy, when you operate a bot on the account [14:55:41] * aude has almost 4000 pages on my wikidata watchlist [14:56:19] oO [14:57:10] paravoid: I think we have a permissions snafu on polonium as well [14:57:19] could be [14:57:38] polonium:/var/lib/spamassassin is debian-spamd.debian-spamd but the proc runs as spamd.spamd, there are no bayes* files [14:57:38] aude: I got 65 pages... mostly because I watch virtually no items [14:58:01] ah [14:58:06] i have the setting enabled [14:58:23] some point have to clear and/or disable it [14:58:39] hoo see bug [14:59:04] How is that possible with only 63,000 edits [14:59:08] Oh [14:59:12] I see. [14:59:19] Kolega2357: ? [14:59:37] hoo https://bugzilla.wikimedia.org/show_bug.cgi?id=67123 [14:59:37] x2 right? [14:59:38] it would be a problem to have soooo many and then ask for maximum changes to show [15:00:04] manybubbles, anomie: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140630T1500) [15:00:12] (03PS1) 10Jgreen: make iodine (otrs) spamassassin daemon run as default user (spamd) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143033 [15:00:48] twkozlowski: around to verify your SWAT patch? [15:00:48] Kolega2357: What's with that? Should be fine now [15:01:07] Jeff_Green: please fix polonium too :) [15:01:10] hoo Yes all is a fine [15:01:14] we should probably run it as debian-spamd everywhere [15:01:15] marktraceur: same for you - around so you can verify your patch? [15:01:25] Hi! Yeah, I am for a bit [15:01:28] paravoid: yeah [15:01:30] Should be pretty quick [15:01:41] apergos: dataset alert? :) [15:02:02] paravoid: i'll figure it out, and make it adjust the directory ownership accordingly [15:02:14] ok  [15:02:16] thanks :) [15:02:27] sorry for breaking it [15:02:31] did the deb create that directory? [15:02:36] marktraceur: ok -you'll go first then [15:02:40] /var/lib/spamassassin? iirc yes [15:02:46] Sweet [15:02:52] paravoid: ok [15:03:00] (03CR) 10Jgreen: [C: 032 V: 031] make iodine (otrs) spamassassin daemon run as default user (spamd) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143033 (owner: 10Jgreen) [15:03:07] (03CR) 10Manybubbles: [C: 032] Remove completed surveys [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138634 (owner: 10MarkTraceur) [15:03:27] manybubbles: Also the ProofreadPage thing, right? [15:03:34] * marktraceur being helpful to some community person [15:03:36] marktraceur: yup - both [15:03:41] ottomata/manybubbles: there's a "Slow CirrusSearch query rate" alert [15:04:04] Coren: labstore1001 has /exp/dumps with 0% free space [15:04:26] paravoid: thanks - I'll have a look at it in a bit - I imagine it is caused by the performance testing I was doing this morning - I wanted to push the system hard enough that it'd cause that [15:04:35] manybubbles: 1d 13h old [15:04:36] (03Merged) 10jenkins-bot: Remove completed surveys [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/138634 (owner: 10MarkTraceur) [15:04:43] paravoid: then not so good [15:04:52] I'll investigate after swat [15:05:11] k [15:05:13] thx :) [15:05:15] there are unstaged changes on tin! [15:05:42] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:43] wmf-config/db-eqiad.php [15:06:04] I'm pretty sure that means I need to not deploy until someone decides what to do with them [15:06:14] committer? [15:06:15] springle? [15:06:22] unstaged - so no committer [15:06:39] manually changed you mean? [15:06:49] I thought merged but not deployed [15:07:12] paravoid: nope - manually changed but not committed [15:07:27] its in the middle springle's changes, but can't be sure who did it [15:07:54] I should be able to work around it if I'm careful but it'll be a pain [15:08:37] <_joe_> manybubbles: we should verify if that has been distributed [15:08:49] _joe_: sure - let me go check [15:08:50] <_joe_> if not, we should just revert the file IMO and save a stash [15:09:24] (03CR) 10BryanDavis: Add Grafana module & role (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/133274 (owner: 10Ori.livneh) [15:10:00] _joe_: looks to have been synced [15:10:29] <_joe_> ... [15:10:41] yeah! [15:10:43] (03CR) 10Andrew Bogott: [C: 04-1] "Thanks for doing this!" (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142479 (owner: 10Scottlee) [15:10:59] marktraceur and _joe_: I'm going to sync the proofreadpage change because this one won't be in the way [15:11:19] <_joe_> manybubbles: I don't know what to advise you to do [15:11:44] <_joe_> surely do not stash that change; I'd probably commit it right away since it's already in prod [15:11:57] <_joe_> but, you may have your ideas about that [15:12:00] _joe_: me neither - I might just commit it with a snarky message. I'm not sure what the db-eqiad file configures because I haven't looked but I can guess. [15:12:24] well, I know its synced to mw1001 - its probably on all the others, but I'm just assuming [15:12:29] Sean uses it to pool/depool/other stuff EQIAD machines [15:14:02] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: Fetching readonly [15:15:16] !log manybubbles Synchronized php-1.24wmf10/extensions/ProofreadPage: SWAT - fix ProofreadPage number of pages (duration: 00m 09s) [15:15:21] Logged the message, Master [15:15:29] marktraceur: ^^^^ that was proofreadpage [15:15:32] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.007 second response time [15:15:49] manybubbles: ugh, sorry re uncommited but sync'd changes [15:16:10] paravoid: is there any particular reason the spamassassin module only manages the user if the user is 'spamd' ? [15:16:17] greg-g: advice on should commit it, work around it carefully, or stop? [15:16:23] I imagine I should commit it [15:16:28] lemme read SAL [15:16:43] ah, yeah, that [15:16:59] is the diff basically: " springle Synchronized wmf-config/db-eqiad.php: repool db1061, " [15:17:03] ? [15:17:43] greg-g: I can't tell if its to repool it or repool it with a lower weight [15:17:49] I imagine its to lower the weight [15:19:03] <_joe_> !log restarted profiler-to-carbon, stuck since _9_ days, will see that my patch gets deployed. [15:19:08] Logged the message, Master [15:20:29] (03PS1) 10Manybubbles: Lower load factor on db-eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143036 [15:20:49] manybubbles: well, the previous one was a depool, so I bet we can assume we can commit it with a snarky message ;) [15:22:33] (03CR) 10Manybubbles: [C: 032] Lower load factor on db-eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143036 (owner: 10Manybubbles) [15:22:39] (03Merged) 10jenkins-bot: Lower load factor on db-eqiad [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143036 (owner: 10Manybubbles) [15:23:50] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: Fetching readonly [15:23:55] !log manybubbles Synchronized wmf-config/: SWAT - remove completed mediaviewer surveys (duration: 00m 04s) [15:23:57] marktraceur: ^^^^^ [15:24:00] Logged the message, Master [15:24:07] now blame and shame :) [15:24:12] (03PS1) 10Jgreen: run spamd as user debian-spamd and manage that user+homedir [operations/puppet] - 10https://gerrit.wikimedia.org/r/143039 [15:24:32] I committed it and made sure that is was synced everywhere [15:25:03] progress [15:25:07] ok, moving on [15:25:39] twkozlowski: want to do yours next? [15:25:41] Yes [15:25:41] thanks manybubbles [15:25:42] (03CR) 10Jgreen: [C: 032 V: 031] run spamd as user debian-spamd and manage that user+homedir [operations/puppet] - 10https://gerrit.wikimedia.org/r/143039 (owner: 10Jgreen) [15:25:48] (03CR) 10Manybubbles: [C: 032] Disable local uploads on Malay Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142318 (https://bugzilla.wikimedia.org/67152) (owner: 10Odder) [15:26:00] wow jenkins review is speedy now! [15:26:10] (03Merged) 10jenkins-bot: Disable local uploads on Malay Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142318 (https://bugzilla.wikimedia.org/67152) (owner: 10Odder) [15:26:28] hashar: ^^ re Jeff's comment [15:26:44] !log manybubbles Synchronized wmf-config/: SWAT - disable local uploads on Malay Wiktionary (duration: 00m 04s) [15:26:45] twkozlowski: ^^^^^^ [15:26:50] Logged the message, Master [15:27:09] * twkozlowski nods [15:27:17] James_F and aude: next? [15:27:22] Sure. [15:27:28] I don't see awight [15:27:41] here [15:28:10] also, I was a bit worried about awight's one because it only had Nemo's blessing (but I love Nemo) and wasn't sure if that was enough. also, it has a subsequent commit and I'm not sure if that needs to go too. [15:28:20] James_F: then you can be next! [15:29:01] manybubbles: Yeah, given the follow-up commit is -2'ed as not working let's leave it? [15:29:08] the followup is a completely distinct thing [15:29:43] <mathshowimage> [15:29:51] Nemo_bis: It does look distinct - just accidentally a follow up. I'd like awight to be around though [15:29:54] Looks like a cache issue or whatever? [15:30:02] twkozlowski: math [15:30:24] Yes, but why doesn't it appear correctly? [15:30:27] I don't see anything on the great eye of logstash [15:30:56] heya ori, lemme know when youa re around [15:30:59] (03PS2) 10BryanDavis: scap: /usr/local/bin/sync-common-file is unused [operations/puppet] - 10https://gerrit.wikimedia.org/r/135925 [15:31:00] nuria and i have a q for you [15:32:27] (03CR) 10BryanDavis: [C: 031] "It's been a couple of weeks since I448f574 was merged, so cleaning up the ensure=>absent should be safe." [operations/puppet] - 10https://gerrit.wikimedia.org/r/135925 (owner: 10BryanDavis) [15:32:30] oh, sorry, nuria and oyu have already chatted abou tthis, maybe we can figure it out... [15:35:27] !log manybubbles Synchronized php-1.24wmf11/extensions/VisualEditor/: SWAT Correctly VisualEditor - update full size in MediaSizeWidget (duration: 00m 07s) [15:35:29] James_F: ^^^^^ [15:35:32] Logged the message, Master [15:35:52] Thanks manybubbles. [15:36:09] sure! is better? [15:36:12] ready [15:36:22] aude: merging [15:36:29] manybubbles: Will take a while to get through the bits cache, IME. [15:36:48] ok [15:38:00] (03PS4) 10BryanDavis: Set wgGitInfoCacheDirectory to point to scap managed location [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142320 (https://bugzilla.wikimedia.org/53972) [15:39:29] opinions on the awight patch? Without him around I'm not supposed to deploy it. I imagine Nemo_bis could verify that it is working though. Is it something I can just revert if it is broken or will it leave pages in a busted state? I don't know that part of the code at all. [15:40:18] no clue about that code [15:40:55] it's only triggered on edit, no retroactive effects [15:41:37] if you're unsure you can wait for Nikerabbit to +1 https://gerrit.wikimedia.org/r/#q,137804,n,z [15:43:02] !log manybubbles Synchronized php-1.24wmf11/extensions/Wikidata/: (no message) (duration: 00m 09s) [15:43:05] aude: ^^^^^ [15:43:07] Logged the message, Master [15:43:31] Nemo_bis: I'll push it then - just to be safe - I broke things badly last week and don't want to push my luck this week [15:43:34] ok [15:44:04] it's certainly not urgent [15:44:27] manybubbles: looks good [15:44:42] aude: sweet - done SWATing for today [15:44:56] :) [15:45:26] manybubbles: Thanks! [15:53:52] (03CR) 10coren: [C: 032] "Yeah, that makes sense." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142208 (owner: 10Yuvipanda) [15:54:09] _joe_: ^ merged for toollabs :) [15:54:26] <_joe_> ok [15:54:36] <_joe_> great! [15:57:03] <_joe_> I'll be back later [16:02:50] Coren: ? [16:03:53] paravoid: ! [16:03:58] 18:04 < paravoid> Coren: labstore1001 has /exp/dumps with 0% free space [16:04:08] * Coren facepalms. [16:04:55] :) [16:04:59] Yeah, the RT ticket we were hoping would reach consensus hasn't. I'll carve out some space out of the general redundant array for the read stats, that'll allow us more time. [16:05:11] also escalate this into our meeting today [16:05:17] Yeah, good idea. [16:05:32] :) [16:05:33] The fundamental problem remains that dumps growth is, by definition, unbounded. [16:08:09] anybody working on the elasticsearch cache warming? [16:20:41] (03CR) 10Rush: "re: Too lazy to setup a trebuchet target for the graphite project in labs, so I just cloned the appropriate repo into the appropriate loc" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/133274 (owner: 10Ori.livneh) [16:48:32] godog: oh btw, there's another sort-of related action item that I've forgot to tell you [16:48:47] esams ms-fes are loadbalanced using DNS roundrobin, which is very suboptimal [16:49:13] at the time we didn't have LVS servers that sat on the esams internal VLAN or even supported multiple VLANs [16:49:30] now we do, so this should probably be converted to use an LVS service IP, like eqiad [16:49:36] bblack can help you with that in my absence :) [16:49:36] (03PS1) 10Jgreen: modify train_spamassassin to operate as user debian-spamd [operations/puppet] - 10https://gerrit.wikimedia.org/r/143048 [16:51:39] ^ :P [16:51:40] paravoid: oh ok, didn't notice that detail, will add it to the TODO [16:52:06] (03CR) 10Jgreen: [C: 032 V: 031] modify train_spamassassin to operate as user debian-spamd [operations/puppet] - 10https://gerrit.wikimedia.org/r/143048 (owner: 10Jgreen) [16:52:10] dns roundrobin would be awesome if caches were better at it [16:52:21] nah, we don't use ms-fe esams for real traffic [16:52:36] but even for swiftrepl (a custom replicator that mark wrote), it's a problem [16:52:42] it opens multiple persistent connections [16:52:54] (I meant if dns caches were better at it) [16:53:04] but doesn't do separate DNS queries when it does that, iirc [16:53:06] bblack: real caches or stuff like os stub resolvers too? [16:53:34] depends. for a big client app, you probably want your own resolver anyways. [16:53:49] but mostly real caches are bad at it [16:54:01] (03PS5) 10Scottlee: Fixed spacing and lint rules for manifests files. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142479 [16:54:52] * bblack adds writing an awesome dns cache to the tail of his ever growing to-do list, in the section that will still be there long after he expires. [16:55:21] andrewbogott: could you review and ? [16:55:38] hehe I thought unbound might do bblack ? [16:55:40] bblack: is that where the geoip patch is? :D [16:55:48] ori: yes [16:55:54] heheh [16:56:19] ori: Yes, sorry -- I had already selected "+2" for that first patch on Friday when my wifi stopped working [16:56:34] bblack: amateurish C code at the front line of a top-10 website, what could go wrong?! [16:56:39] godog: well, anything's better than BIND [16:57:06] (03PS10) 10Faidon Liambotis: Improve nginx TLS cipher list & session timeout [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) (owner: 10JanZerebecki) [16:57:43] (03CR) 10Andrew Bogott: [C: 032] mediawiki_singlenode: port apache::vhost to apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/142206 (owner: 10Ori.livneh) [16:58:52] (03CR) 10Faidon Liambotis: [C: 032] Improve nginx TLS cipher list & session timeout [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) (owner: 10JanZerebecki) [16:59:14] (03CR) 10Andrew Bogott: [C: 032] role::deployment: port apache::vhost to apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/142205 (owner: 10Ori.livneh) [16:59:42] andrewbogott: much obliged [17:00:04] manybubbles, ^d: The time is nigh to deploy Search (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140630T1700) [17:00:09] dogeydogey: I can deploy if you want [17:00:30] (03CR) 10Manybubbles: [C: 032] Move about half of pool 4 lsearchd wikis to CirrusSearch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142590 (owner: 10Chad) [17:00:50] manybubbles deploy what? [17:00:52] (03Merged) 10jenkins-bot: Move about half of pool 4 lsearchd wikis to CirrusSearch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142590 (owner: 10Chad) [17:01:10] hey dogeydogey [17:01:18] heard you'd like to do something to help :) [17:01:24] dogeydogey: just a search config update - its scheduled for right as the bot said [17:01:32] ori: what do you want to do with https://gerrit.wikimedia.org/r/#/c/137947/ ? [17:01:39] manybubbles: would be nice to help package https://github.com/etsy/logster [17:01:43] I don't know anything about pacakging yet [17:01:51] YuviPanda I want to learn packaging [17:01:57] paravoid: i should update it.. i got feedback from tim [17:02:02] paravoid: give me half an hour [17:02:15] I'm probably not the right person to talk to about packaging [17:02:17] <^d> manybubbles: You doing the tin sync too? [17:02:23] manybubbles can i shadow the packaging? [17:02:23] ^d: sure [17:02:36] dogeydogey: ? I'm confused sorry [17:03:06] me too, I don't know what I have to do with the search config update [17:03:17] but YuviPanda asked you to do packaging I guess? [17:03:32] YuviPanda hi, yes, I'd love to help [17:03:55] Sorry manybubbles, I got dragged away [17:03:59] The PP thing looks good [17:04:08] marktraceur: wonderful [17:04:13] So does the config change [17:04:17] great [17:04:52] (03CR) 10Andrew Bogott: [C: 04-1] "Just a couple more whitespace niggles; looks good otherwise." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142479 (owner: 10Scottlee) [17:04:56] dogeydogey: k. I can do the deploy then talk to you about it a bit if you'd like - its pretty quick [17:05:38] !log manybubbles Synchronized wmf-config/: Enable CirrusSearch as the default search backend on 30 more wikis (duration: 00m 05s) [17:05:41] Logged the message, Master [17:07:07] !log manybubbles Synchronized wmf-config/: Enable CirrusSearch as the default search backend on 30 more wikis - for real (duration: 00m 04s) [17:07:13] Logged the message, Master [17:08:41] !log manybubbles Synchronized wmf-config/: Enable CirrusSearch as the default search backend on 30 more wikis - for real for real (duration: 00m 04s) [17:08:46] Logged the message, Master [17:08:55] !log manybubbles Synchronized wmf-config/: Enable CirrusSearch as the default search backend on 30 more wikis - take four (duration: 00m 04s) [17:08:59] Logged the message, Master [17:09:39] \o/ [17:09:44] matanya: Not working? [17:09:47] Ugh [17:09:50] manybubbles: Not working? [17:10:11] <^d> Hmm. [17:10:14] Reedy: synced wrong dir - forgot to touch initialize settings [17:10:16] checking again [17:10:24] <^d> I was gonna say, search looked pretty lsearchd on dawiki [17:10:44] ottomata: here now, missed your ping earlier [17:11:58] ^d: didn't sync cirrus.dblist? [17:12:00] trying agani [17:12:06] <^d> cirrus.dblist isn't in wmf-config. [17:12:14] <^d> It's in the root of /a/common [17:12:27] manybubbles ping me when you're done :) [17:12:44] !log manybubbles Synchronized cirrus.dblist: Enabled CirrusSearch as the default search backend on 30 more wikis - take five (duration: 00m 04s) [17:12:49] Logged the message, Master [17:12:50] akosiaris: if you're around, i'd appreciate feedback on https://gerrit.wikimedia.org/r/#/c/142400/ [17:13:02] ^d: now down [17:13:11] ^d: does this mean i can shut down cirrus as beta feature? [17:13:21] on he.wiki? [17:13:52] matanya: yup - its on hewiki [17:13:57] done [17:14:14] <^d> Config should disable beta feature on wikis where it's primary already? [17:14:31] maybe push a patch to remove from beta [17:14:59] it should [17:15:38] it isn't on en.wikisource.org where we haven't been a betafeature for a long time [17:15:58] maybe it takes time for betafeatures to age out - I remember there is some time based thing in that extensions [17:16:03] I think it is the counts though [17:16:08] <^d> Same with mw.org [17:16:41] ori, no worries, nuria and I found the problem [17:16:47] the eventlogging consumer wasn't running on hafnium [17:16:53] and puppet didn't bother trying to start it either [17:17:00] I have documented it on wikitech [17:17:14] we just have to do some work with EL and upstart [17:17:31] dogeydogey: I'm done with the deploy now - what is it you wanted to know and from what perspective are you coming at this? volunteer, employee, something else? It matters because the access is different [17:17:58] greg-g: we're done with our deploy [17:18:16] manybubbles I'm a volunteer, and I wanted to help with packaging tasks but I'm a newbie in that regard and would love to shadow you if you're doing a packaging task [17:18:18] ori, ottomata: i think we might need to take an outage to fix upstart issues (if you can suggest a setup where we can test upstart & ELtaht is not prod it wil be great) [17:18:22] ottomata, nuriathanks [17:18:28] <^d> manybubbles: Well that was much more boring than last week. [17:18:33] <^d> Which is good. [17:18:33] there's labs [17:18:39] ^d: I'm so happy about that [17:19:28] ori: ok, if we can deploy it to labs then we can troubleshoot there [17:22:39] YuviPanda: you asked me to package something a few minutes ago - did you mean to ask me to do it? I'm far and away not the best candidate to build debs [17:22:50] manybubbles: oh, no, I was asking dogeydogey [17:22:53] YuviPanda you around? [17:23:00] manybubbles: apologies if I pinged the wrong one [17:23:47] YuviPanda: what kind of keyboad layout do you use that you can ping me instead of dogeydogey? :) [17:24:18] manybubbles: :D I was expecting to complete to 'last person I talked to' and not 'last person who talked in the channel' [17:24:21] hey dogeydogey [17:24:49] YuviPanda: ah! just does nothing for me without some characters. for the best [17:25:11] YuviPanda what can I help with? andrew mentioned monitoring tasks and I would love to learn packaging [17:25:48] dogeydogey: eah :) packaging https://github.com/etsy/logster woudl be nice [17:26:38] dogeydogey: note that I have no knowledge of packaging what so ever :( [17:26:46] k [17:27:55] logster is already packaged and in our apt repository [17:28:38] http://git.wikimedia.org/summary/?r=operations/debs/logster.git [17:28:47] and we have puppet code for it too [17:28:55] sorry :) [17:31:21] we should run cutesy behind two load-balanced yakisoba frontends [17:31:36] paravoid: ow :) [17:31:39] paravoid: good to know :) [17:31:40] * ori is making up plausible-sounding etsy product names [17:31:43] paravoid: are we using it somewhere? [17:32:04] YuviPanda: ottomata is [17:32:07] ori: it's much easier for Netflix, just add "monkey" to almost any word [17:32:08] git grep logster is your friend :) [17:32:19] ori: searching for logster pollutes my search results with 'lobster' [17:33:06] paravoid: :) I was git grepping for logtail, didn't occur to me to git grep for logster itself. [17:33:07] will do [17:33:25] oh wow, we have a module! [17:33:26] \o/ [17:33:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] "A couple of minor stuff. Not sure yet on the conf-{available,enabled} thing < 14.04" (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142400 (owner: 10Ori.livneh) [17:33:39] dogeydogey: looks like logster is already packaged :( [17:34:04] that's fine :) [17:34:18] ori: done. how strong do you feel about the conf-{available,enabled} on <14.04 machines ? [17:36:18] akosiaris: i think it's appropriate to use puppet to create a uniform interface over a nonuniform reality, BUT [17:36:20] hey dogeydogey, i can help mentor a bit [17:36:24] there are a lot of things that I don't know [17:36:33] but there are plenty that I do, and can defer the things I don't to others [17:36:35] really! [17:36:39] akosiaris: my approach with this work is to regard folks like you as the "customer" if you will, and as the saying goes, the customer is always right [17:36:51] ottomata yeah, when can we get started? [17:36:52] akosiaris: so i'm fine with going with whatever seems cleaner / more intuitive to you [17:37:24] dogeydogey: i'd say, first step is to understand git buildpackage [17:37:43] http://honk.sigxcpu.org/projects/git-buildpackage/manual-html/gbp.html [17:37:56] k, will read this and get back to you [17:38:09] !log Cirrus reindex update! all wikipedias finished their in place reindex except ruwiki - that one is running now. all group1 wikis finished their from mediawiki reindex except commons and mgwiktionary which are running now. started from mediawiki reindex of all wikipedias exception for enwiki, itwiki, and cawiki which are already long done. [17:38:15] Logged the message, Master [17:39:26] ori: cool. I think I 'd like the rest of the team weigh on this too. It is going to affect us all anyway [17:40:34] akosiaris: once youv'e done with that mind reviewing my touchy lints? [17:41:50] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2943.31572852 [17:42:21] akosiaris: sounds good [17:42:47] matanya: yes, I haven't forgotten. It is just my puppet catalog differ broke with the puppet 3 migration and I want to fix it first. Add me in any lints I am not already a reviewer [17:43:19] thanks a lot. will do [17:48:56] ottomata: hey! I was told you use logster :) are you using it for apache logs or nginx logs? [17:50:21] (03PS6) 10Scottlee: Fixed spacing and lint rules for manifests files. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142479 [17:51:25] (03PS3) 10Ori.livneh: role::mediawiki::webserver: set maxclients dynamically [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 [17:51:34] YuviPanda: I had originally wanted to use it for varnishkafka stuff [17:51:54] but it was kinda complicated for that for reasons I can't remember..i think because i needed to get metrics out more than once a minute [17:52:08] but, currently its being used for cirrussearch slow query log [17:52:33] (03CR) 10Ori.livneh: "ps3: rebased; took giuseppe's advice re: keeping the bits stuff outside of this patch; took tim's advice from irc: https://dpaste.de/q9aQ/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 (owner: 10Ori.livneh) [17:52:41] ^ paravoid, _joe_ [17:53:33] YuviPanda: https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/logging.pp#L99 [17:54:23] ori: that won't work [17:54:43] you're including a class twice, once with a parameter and once without [17:54:54] ottomata: ah, right. [17:55:01] ottomata: i might use it in labs for nginx [17:55:01] that doesn't do what any sane person would guess it would do [17:57:30] YuviPanda: cool! what do you want it to do? [17:57:35] count errors or something? [17:58:06] ottomata: yeah [17:58:21] ottomata: count errors from the tools/general nginx proxy into labs graphite [17:58:33] ottomata: graphite.wmflabs.org is now fully functioning, bringing in metrics from all of labs via diamond [17:58:44] coool [17:59:21] cool [17:59:28] would this regex happen to match your lines? [17:59:29] https://github.com/wikimedia/operations-debs-logster/blob/master/logster/parsers/ErrorLogLogster.py#L28 [17:59:36] if so, this would be very easy [18:00:53] ottomata: hmm, unsure. I want to catch non-200 responses, don't actually think this does that [18:00:59] ottomata: so it'll have to tail access.log [18:01:10] ottomata: I also want to catch average response time. might have to write a plugin [18:02:57] ottomata: I wonder - if I write a plugin, should I submit it upstream or to our repo? [18:04:43] to ours I think, ours has diverged a bit (i've refactored some things to make it more useable via cli, etc.) [18:05:03] ottomata: ah, right. cool, does seem better. [18:05:10] ottomata: are we considering pushing our changes upstream? [18:05:24] probably should... [18:05:36] ottomata: so I suppose way to go is 1. write code to record what I want, 2. poke someone to get it merged, 3. poke someone to update the package, 4. use it in labs [18:05:39] i didn't because when I made the changes (when i was going to use for varnishkafka) i gave up on it for that use case [18:05:47] ottomata: right [18:05:47] so, i had done all the prep, but didnt' end up using it [18:05:52] but, now we are using it so, probably should [18:05:54] (03PS4) 10Ori.livneh: role::mediawiki::webserver: set maxclients dynamically [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 [18:06:01] ^ paravoid [18:06:05] yeah, YuviPanda, that'll do [18:06:09] not elegant, but will be further refactored [18:06:17] and, that will be a good thing for dogeydogey to help with too [18:06:18] ottomata: cool :) mind if I poke you for steps (2) and (3)? [18:06:21] testing your package changes, buildling, etc. [18:06:23] ori: (meeting) [18:06:25] yeah, poke me and dogeydogey [18:06:38] ottomata: cool, ty! [18:07:10] YuviPanda: i don't think websockets are working on novaproxy [18:07:26] cscott: oh? [18:07:54] cscott: it *should*, I remember we tested it waaay back... [18:08:13] i hacked on it for a couple hours on friday after you fell asleep ;) but i can't get a connection to the togetherjs-hub.wmflabs.org to give me anything except 502 [18:08:30] YuviPanda: we did, but things have changed. ;) [18:08:40] cscott: looks like :) I'm on the proxy box, tailing logs. [18:08:41] wait [18:09:20] cscott: wanna try now? [18:09:42] cscott: '2014/06/30 18:09:22 [error] 22383#0: *633475 connect() failed (111: Connection refused) while connecting to upstream, client: 50.163.53.37, server: , request: "GET /hub/5IubRnPgQE HTTP/1.1", upstream: "http://10.68.17.25:8080/hub/5IubRnPgQE", host: "togetherjs-hub.wmflabs.org"' [18:09:48] cscott: gah, leaked your IP :| sorry [18:10:00] looked for it in the wrong place. [18:10:02] no worries [18:10:15] cscott: so, yeah, seems upstream isn't responding [18:10:25] http://togetherjs.wmflabs.org/wiki/Main_Page click on 'together' in the top toolbar with a js console open. [18:10:37] let me double check that it's still running on 8080 [18:11:16] cscott: nmap tells me there's something on 8000, nothing on 8080 [18:11:50] cscott@towtruck:/srv/mediawiki/extensions/TogetherJS/togetherjs/hub$ node server.js --port 8080 [18:11:50] INFO HUB Server listening on port 8080 interface: 127.0.0.1 [18:12:41] (03CR) 10Andrew Bogott: Fixed spacing and lint rules for manifests files. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142479 (owner: 10Scottlee) [18:13:08] csteipp: nmap still doesn't show me anything [18:13:09] and https://wikitech.wikimedia.org/wiki/Special:NovaProxy says that togetherjs-hub.wmflabs.org should be connected to http://towtruck.eqiad.wmflabs:8080 [18:13:15] gah, I meant cscott [18:13:25] cscott: is your security group configured to leave 8080 open? [18:14:17] https://wikitech.wikimedia.org/wiki/Special:NovaSecurityGroup shows the 'default' group for the visual editor project to have everything open except for 5666, if i'm reading it correctly. [18:15:18] 'sudo aptitude install telnet nmap' fails because: [18:15:19] Err http://nova.clouds.archive.ubuntu.com/ubuntu/ precise/main telnet amd64 0.17-36build1 [18:15:19] Could not resolve 'brewster.wikimedia.org' [18:15:37] didn't we move packages from brewster to carbon ages ago? [18:16:14] i don't have either brewster or carbon in my sources.list, it seems to be some ubuntu configuration thing. [18:16:26] is http://nova.clouds.archive.ubuntu.com/ubuntu/ a redirection service? [18:16:29] maybe your base image was ancient? [18:16:44] YuviPanda: well, it was installed a few months ago. is that ancient? [18:18:39] in any case, i don't seem to have any good tools to confirm/deny that port 8080 is open on towtruck. [18:19:09] csteipp: hmm, right. [18:19:32] oh, wait, found it. [18:19:38] (i think you means cscott, not csteipp, above) [18:19:45] yeah, stupid autocomplete [18:20:01] .describe("host", "The interface to serve on (default $HUB_SERVER_HOST, $HOST, $VCAP_APP_HOST, 127.0.0.1). Use 0.0.0.0 to make it public") [18:20:08] I just though YuviPanda suddenly really wanted to talk to me :) [18:20:20] added --host 0.0.0.0 to the node command line and now things appear to be working correctly [18:20:30] so sorry for doubting your excellent code, YuviPanda ;) [18:20:30] (03PS10) 10Withoutaname: Delete ve.wikimedia.org and leave redirect [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131907 (https://bugzilla.wikimedia.org/55737) [18:21:19] (03PS6) 10Withoutaname: Initialize some settings for wikimania 2015 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/139279 (https://bugzilla.wikimedia.org/66370) [18:21:20] csteipp: :D sorry! [18:21:25] cscott: :D [18:30:44] (03CR) 1020after4: "Thanks for the detailed review Filippo," [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [18:36:34] (03PS7) 10Scottlee: Fixed spacing and lint rules for manifests files. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142479 [18:37:09] jgage: the ms-be3003 / full is due to puppet contantly trying to re-mount swift disks, it shouldn't do that :( [18:39:19] Danny_B: http://www.wikimedia.cz/web/Hlavn%C3%AD_strana Y NO HTTPS? [18:43:55] (03PS1) 10Andrew Bogott: Intentionally break puppet compile for virt1008 [operations/puppet] - 10https://gerrit.wikimedia.org/r/143070 [18:45:19] (03CR) 10Andrew Bogott: [C: 032] Intentionally break puppet compile for virt1008 [operations/puppet] - 10https://gerrit.wikimedia.org/r/143070 (owner: 10Andrew Bogott) [18:45:43] andrewbogott: puppet agent -t on virt1008 and then /usr/lib/nagios/plugins/check_nrpe -H virt1008.eqiad.wmnet -c check_puppet_checkpuppetrun on neon [18:45:51] no need to wait 2 hours :-) [18:45:55] thanks [18:46:41] (03Abandoned) 10Awight: Ruthless kludge to provide a custom translation workflow for the Fundraising Thank-you letter [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140574 (owner: 10Awight) [18:47:39] (03CR) 10Awight: "@Manybubbles: thank you for noticing. No, the next patch turned out to not work, so I've abandoned it." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137804 (owner: 10Awight) [18:51:31] (03CR) 10Filippo Giunchedi: "no problem :)" [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 (owner: 1020after4) [18:52:26] manybubbles: i left tin db-eqiad.php dirty? sorry about that. thanks for the fix [18:52:48] springle: morning! I just hope I did the right thing - commit what was there [18:52:55] and keep deploying [18:53:36] that's fine. i meant to simply checkout the file afresh with db1061 at 400 [18:54:19] springle: cool [18:54:41] springle: do you review at 5 am ? [18:55:02] matanya: no :) i attend ops meetings at 4am [18:55:41] ok, some other time, sorry to hear you suffer on the wrong side of the globe :D [18:58:22] (03PS1) 10Ori.livneh: misc::statistics::sites::default: use apache::site, not apache::vhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/143075 [19:00:36] papaul: has cyrusone completed the enclosure buildout? [19:00:41] by putting on doors and such? [19:10:11] (03CR) 10Ottomata: [C: 031] misc::statistics::sites::default: use apache::site, not apache::vhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/143075 (owner: 10Ori.livneh) [19:10:36] (03CR) 10Ori.livneh: [C: 032] misc::statistics::sites::default: use apache::site, not apache::vhost [operations/puppet] - 10https://gerrit.wikimedia.org/r/143075 (owner: 10Ori.livneh) [19:14:17] RobH no [19:18:54] (03PS2) 10Ori.livneh: Apache config for Wikivoyage using mod_proxy_fcgi [operations/apache-config] - 10https://gerrit.wikimedia.org/r/142983 [19:19:04] ^ _joe_ [19:20:11] (03PS1) 10Krinkle: dynamicproxy: Let 403 and 404 responses pass through [operations/puppet] - 10https://gerrit.wikimedia.org/r/143081 [19:24:06] (03CR) 10Yuvipanda: [C: 031] dynamicproxy: Let 403 and 404 responses pass through [operations/puppet] - 10https://gerrit.wikimedia.org/r/143081 (owner: 10Krinkle) [19:25:38] <_joe_> ori: yes, I'll take a look :) [19:34:55] (03CR) 10Ottomata: [C: 04-1] "Comments inline." (0310 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/142483 (owner: 10Scottlee) [19:35:16] (03CR) 10Ottomata: "BTW, thanks!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142483 (owner: 10Scottlee) [19:38:42] (03CR) 10Faidon Liambotis: [C: 04-1] "If this is setting MaxClients, why is it being referred as a "worker limit"? This is not the same thing." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 (owner: 10Ori.livneh) [19:38:49] RobH: nothing has been done so far , nobody did show up . [19:39:45] (03CR) 10Ori.livneh: "@paravoid: http://httpd.apache.org/docs/2.4/upgrading.html "MaxClients has been renamed to MaxRequestWorkers, which describes more accurat" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 (owner: 10Ori.livneh) [19:40:58] ori: but you're using both names :) [19:41:12] setting $maxclients = $workers_limit [19:41:34] hm, fair. ok, so i'll update the var to be $max_workers = .. [19:41:45] but the actual config directive will still have to be MaxClients for back-compat [19:42:06] right, fair enough [19:42:15] do we set ServerLimit? [19:44:23] <_joe_> !log restarting pybal on lvs1005 [19:44:28] Logged the message, Master [19:49:32] (03PS8) 10Andrew Bogott: Fixed spacing and lint rules for manifests files. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142479 (owner: 10Scottlee) [19:52:06] (03CR) 10Andrew Bogott: [C: 032] Fixed spacing and lint rules for manifests files. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142479 (owner: 10Scottlee) [19:52:49] (03PS1) 10Reedy: Remove cz and cs cnames [operations/dns] - 10https://gerrit.wikimedia.org/r/143086 [19:53:13] (03CR) 10Reedy: [C: 04-1] Remove cz and cs cnames [operations/dns] - 10https://gerrit.wikimedia.org/r/143086 (owner: 10Reedy) [19:55:00] (03CR) 10Andrew Bogott: [C: 032] Tools: Fix pastebinit configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/135500 (owner: 10Tim Landscheidt) [19:55:31] (03CR) 10Reedy: "Redirects are going to be moved to the apache-config repo, so hence these CNAMEs aren't needed any more." [operations/dns] - 10https://gerrit.wikimedia.org/r/143086 (owner: 10Reedy) [19:55:42] (03PS1) 10Giuseppe Lavagetto: rcs: do not declare upstream twice [operations/puppet] - 10https://gerrit.wikimedia.org/r/143087 [19:56:10] (03PS1) 10Ori.livneh: add apache::param resource [operations/puppet] - 10https://gerrit.wikimedia.org/r/143090 [19:56:20] (03PS2) 10BryanDavis: labs: role::deployment - port apache::vhost to apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/142407 [19:56:47] (03PS2) 10Giuseppe Lavagetto: rcs: do not declare upstream twice [operations/puppet] - 10https://gerrit.wikimedia.org/r/143087 [19:57:13] andrewbogott: could you do bryan's patch as well, ? [19:57:25] bd808: what's an example of a VM that has the /bin/bash vs. /bin/false problem? [19:57:49] andrewbogott: deployment-apache02 was where I was checking [19:57:51] (03PS1) 10Ottomata: Now using CDH5 in labs for new Hadoop clusters. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143091 [19:57:55] ok [19:58:17] (03CR) 10BryanDavis: [C: 04-1] "Reworking" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142407 (owner: 10BryanDavis) [19:58:47] (03PS3) 10Giuseppe Lavagetto: rcs: do not declare upstream twice [operations/puppet] - 10https://gerrit.wikimedia.org/r/143087 [19:59:11] ori: Looks like it's still in progress [19:59:32] (03PS4) 10Ori.livneh: rcs: do not declare upstream twice [operations/puppet] - 10https://gerrit.wikimedia.org/r/143087 (owner: 10Giuseppe Lavagetto) [19:59:55] _joe_: woops, didn't mean to clobber your change [20:00:01] had some formatting fixes and didn't want to bother you with those [20:00:04] gwicke, subbu: The time is nigh to deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140630T2000) [20:00:09] _joe_: let me update again [20:00:14] <_joe_> ok :) [20:00:25] (03PS1) 10John F. Lewis: Remove two rights from editors on ruwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/143094 (https://bugzilla.wikimedia.org/67304) [20:00:34] <_joe_> ori: I thought I got mad :P [20:01:13] _joe_: what's your preference in general? i sometimes do that when i feel it would be petty to -1 the change for formatting [20:01:47] <_joe_> ori: I do get lint errors in my editor, I must've overlooked them [20:02:07] <_joe_> ori: please do that freely [20:02:12] (03PS5) 10Ori.livneh: rcs: do not declare upstream twice [operations/puppet] - 10https://gerrit.wikimedia.org/r/143087 (owner: 10Giuseppe Lavagetto) [20:02:17] <_joe_> it was just bad timing :) [20:02:19] (03PS3) 10BryanDavis: labs: role::deployment - port apache::vhost to apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/142407 [20:02:28] ori: do you mind if I removed the nginx stuff from the grafana role when moving it to apache? [20:02:42] YuviPanda: wasn't it on apache originally? [20:02:45] bd808: where's does betalabs's puppetmaster live? [20:02:59] ori: yeah, but it had nginx role and config (unused) [20:02:59] hm, actually, nm, I'll do this on a test box [20:03:03] andrewbogott: deployment-salt [20:03:27] andrewbogott: And the puppet checkout there is a little "interesting" at the moment [20:03:35] YuviPanda: well, we've identified that we want to use nginx generally (or at least that is a sentiment that has been expressed repeatedly), and we've also agreed that we'd like to be able to open up access to these resources (though we're not sure we can) [20:03:43] YuviPanda: so i'd like to keep the nginx config, even if unused [20:03:55] YuviPanda: i did the same w/graphite [20:04:01] ori: alright! I shall do that. [20:04:13] (03CR) 10Giuseppe Lavagetto: [C: 032] rcs: do not declare upstream twice [operations/puppet] - 10https://gerrit.wikimedia.org/r/143087 (owner: 10Giuseppe Lavagetto) [20:04:15] ori: oooh, cool. maybe I can just use nginx for both graphite and grafana on labs, since that is open. [20:05:40] (03CR) 10Ori.livneh: "tested" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143090 (owner: 10Ori.livneh) [20:06:54] (03CR) 10Faidon Liambotis: [C: 04-1] "They shouldn't be removed, they should be replaced with a CNAME to wikimedia-lb, so that they can actually be handled by our Apaches :) Re" [operations/dns] - 10https://gerrit.wikimedia.org/r/143086 (owner: 10Reedy) [20:08:10] <_joe_> this is what I get for not re-testing in labs [20:08:11] (03Abandoned) 10Dzahn: move Cyprus from Asia to Europe [operations/dns] - 10https://gerrit.wikimedia.org/r/142622 (owner: 10Dzahn) [20:08:42] (03PS5) 10Ori.livneh: role::mediawiki::webserver: set maxclients dynamically [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 [20:09:25] (03PS1) 10Reedy: Redirect c[sz].wikimedia.org to http://www.wikimedia.cz [operations/apache-config] - 10https://gerrit.wikimedia.org/r/143095 [20:09:29] (03CR) 10jenkins-bot: [V: 04-1] Redirect c[sz].wikimedia.org to http://www.wikimedia.cz [operations/apache-config] - 10https://gerrit.wikimedia.org/r/143095 (owner: 10Reedy) [20:09:43] (03PS1) 10Giuseppe Lavagetto: rcstream: fix ssl config syntax [operations/puppet] - 10https://gerrit.wikimedia.org/r/143096 [20:11:02] bd808: Is mwdeploy a member of deployment-prep? And if so, do you know how? Did I add it directly in ldap? [20:11:08] (03CR) 10Dzahn: [C: 031] "the new "puppet run" check does really replace the "puppet disabled" check, right?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142560 (owner: 10Dzahn) [20:11:26] (03CR) 10Giuseppe Lavagetto: [C: 032] rcstream: fix ssl config syntax [operations/puppet] - 10https://gerrit.wikimedia.org/r/143096 (owner: 10Giuseppe Lavagetto) [20:11:37] i need some ops help with what looks like maybe git deploy breakage? [20:12:23] (03PS2) 10Reedy: Point c[sz].wikimedia.org at wikimedia-lb rather than external site [operations/dns] - 10https://gerrit.wikimedia.org/r/143086 [20:12:47] andrewbogott: I don't think mwdeploy is a project member for deployment-prep. The user doesn't show on https://wikitech.wikimedia.org/wiki/Special:NovaProject [20:13:19] (03PS2) 10Reedy: Redirect c[sz].wikimedia.org to http://www.wikimedia.cz [operations/apache-config] - 10https://gerrit.wikimedia.org/r/143095 [20:13:28] andrewbogott, can i bug you? :) [20:13:40] https://gist.github.com/subbuss/9b6efb874baeabd8e245 ... [20:13:41] (03CR) 10Reedy: "https://gerrit.wikimedia.org/r/143095 is the apache change" [operations/dns] - 10https://gerrit.wikimedia.org/r/143086 (owner: 10Reedy) [20:13:56] normally i never have to retry even once. [20:13:58] (03CR) 10Reedy: "https://gerrit.wikimedia.org/r/#/c/143086/ is the dns change" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/143095 (owner: 10Reedy) [20:14:28] subbu, I've never touched git-deploy so I may not be the right one to ask. Unless you know more specifically what's failing... [20:14:30] (03CR) 10Ori.livneh: "@paravoid: on the 160 app servers with 12gb ram, MaxClients will be set to 98; on the 60 with 64gb it'll be set to 539. Both quite sane." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137947 (owner: 10Ori.livneh) [20:15:11] who might be able to help me with this? [20:15:24] greg-g, ^? [20:15:37] bd808: ok, but when you run puppet on deployment-prep it doesn't say 'user 'mwdeploy' does not exist in /etc/passwd' does it? So there must be something different about that user in that project... [20:15:47] hm, maybe I'm just screwing up my test [20:15:57] subbu: i usually just hit 'y' rather than 'r' [20:16:07] subbu: you may need to 'abort' now [20:16:11] andrewbogott: The nsswitch.conf should make it lookup from ldap I think [20:16:25] ori, yes, https://gist.github.com/subbuss/0dc479e60807f8d11cc1 that was the first deploy attempt. [20:16:40] and when i checked on wtp1020, it is still old code. [20:16:42] bd808: um.. I don't know what that is. It's not enough to just define the user in puppet, take it? [20:16:45] subbu: good grief, dunno then [20:17:16] subbu: when you ask for a concise report does it give you one? [20:17:20] subbu: It may help to debug if you can find a root to run `salt-call deploy.fetch 'parsoid/deploy'` on one of the parsoid minions directly. [20:17:54] See also -- https://wikitech.wikimedia.org/wiki/Trebuchet#Troubleshooting_the_deployment_from_multiple_locations [20:18:42] I've had the best luck in getting detailed error output in beta with the "direct commands on the minions" instructions [20:18:56] (03CR) 10Gage: "one issue, comment inline" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143091 (owner: 10Ottomata) [20:19:03] bd808: all of that seems to require root [20:19:06] i see ... i dont think gwicke has root .. [20:19:44] gwicke, subbu: Yes, you need a root. Looks like robh is on rt duty [20:22:00] hey [20:22:22] so run salt-call deploy.fetch 'parsoid/deploy' on which minion? any mw? [20:22:41] oh, sorry, parsoid minion [20:22:52] yes, parsoid minion. [20:23:19] which are? (im looking but feel free to just give me answers ;) [20:23:21] try wtp1020 which is what i checked for fresh code and which has code from last wed's deploy [20:23:24] k [20:23:38] RobH: -g parsoid [20:23:47] (03CR) 10Rush: [C: 032 V: 032] "daniel merging this as it is +'d all good already and I know we want it done." [operations/puppet] - 10https://gerrit.wikimedia.org/r/142601 (owner: 10Dzahn) [20:23:47] sorry for slow uptake, was on another gear entirely, heh [20:24:19] robh@wtp1020:~$ sudo salt-call deploy.fetch 'parsoid/deploy' [20:24:19] local: [20:24:19] ---------- [20:24:21] dependencies: [20:24:23] repo: [20:24:25] parsoid/deploy [20:24:27] status: [20:24:29] 10 [20:24:31] bd808: ^ [20:24:38] andrewbogott: As far as I understand the puppet user + ldap problem, if the unless checks for the puppet resource pass based on the info that pam can pull in from ldap things are fine, but if there is a disagreement then puppet will puke when it tires to modify a non-existent local user. [20:24:39] (03CR) 10Gage: [C: 031] "LGTM but not +2 yet while Otto is testing" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143091 (owner: 10Ottomata) [20:25:04] RobH: status 10. I'll look that up in the salt module... hang on [20:25:10] bd808: andrewbogott: based on what we saw on labstore it seems the puppet logic and ldap behavior [20:25:14] changed with Puppet3 [20:25:21] * RobH is clueless and just acting as a root puppet ;] [20:25:35] :) [20:25:39] root marionette even, less confusing with us using puppet [20:26:59] (03PS2) 10Rush: switch legalpad.wm over to misc varnish [operations/dns] - 10https://gerrit.wikimedia.org/r/142597 (owner: 10Dzahn) [20:27:12] subbu: The exit code of 10 supposedly means that the tag being fetched isn't present in the repo on tin -- https://github.com/wikimedia/operations-puppet/blob/production/modules/deployment/files/modules/deploy.py#L470 [20:27:20] (03CR) 10Rush: [C: 032 V: 032] "makes sense should mimic logstash, merging to get this propagating" [operations/dns] - 10https://gerrit.wikimedia.org/r/142597 (owner: 10Dzahn) [20:27:32] bd808: Ah, so I see. Hm. [20:28:10] bd808: you're some kinda salt wizard arent you? [20:28:13] ;] [20:28:18] saltmaster [20:28:30] (03CR) 10Dzahn: "all fine with me, thank you Rush" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142601 (owner: 10Dzahn) [20:28:32] (03PS1) 10Yuvipanda: toollabs: Increase space on /var for all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143098 [20:28:35] Coren: ^ [20:28:46] Coren: not sure if this is the best way to do it, since it means older logs will be hidden foreve. [20:28:50] 4eva [20:28:53] RobH: :) I just read things on wikitech over and over [20:29:10] Coren: alternative is to do it in wikitech one by one while stopping appropriate services and copy old logs over [20:29:20] so if the repo isnt on tin [20:29:24] i guess i have to do the salt-call deploy.deployment_server_init as root on tin? [20:29:27] bd808, i don't think anything changed on our end .. so, not sure what that means. gwicke? [20:29:29] according to wikitech [20:29:29] YuviPanda: No, it's a really bad way to do it; at least not without having done the switch by hand on extant instances first. [20:29:46] Coren: right. I'll -1 until we can do that by hand. [20:29:55] Coren: but after that has been done this seems ok [20:29:55] YuviPanda: What we'll end up with if we do that is logs that fill "behind" the mount point and can't be truncated. :-) [20:29:57] RobH, subbu: Let me login to tin and see if I can see anything obviously wrong [20:30:05] Coren: aaah, hadn't thought of that, of course [20:30:06] ok, i wont run anything [20:30:06] k. [20:30:16] * RobH stands by to stand by [20:30:16] subbu: no idea either -- just double-checked the checkout on tin, but that all looks fine [20:30:39] https://wikitech.wikimedia.org/wiki/Trebuchet#Repo_doesn.27t_exist_on_tin (is my reference for previously mentioned command) [20:31:20] YuviPanda: But yeah, that's the correct longer-term solution. We'll just have to play with bind mounts and log rotation to do it by hand on the live boxen. [20:31:28] Coren: right. [20:31:46] Coren: uh, is there a way to do this without actually stopping the services on the box that are loging? [20:32:18] akosiaris: https://gerrit.wikimedia.org/r/#/c/143090/ for your consideration too :) [20:32:46] (03CR) 10Ori.livneh: "I plan to use https://gerrit.wikimedia.org/r/#/c/143090/ for the define" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/142983 (owner: 10Ori.livneh) [20:32:56] subbu, RobH: I can see the tag in the repo /srv/deployment/parsoid/deploy -- `git tag | grep parsoid/deploy-sync-20140630-201035` [20:33:09] * bd808 checks to see if salt can sync the test repo [20:33:39] YuviPanda: The webservices log to /data/project, not in /var/log. pretty much everything else can be made to rotate logs. [20:33:49] Coren: ah, right. [20:34:37] subbu, RobH: `git deplloy sync` fails for test/testrepo too. Looking at apache vhost next... [20:35:03] subbu: you all good? [20:35:10] subbu, RobH: apache vhost is the problem. [20:35:11] oh, looks like in progress. [20:35:13] greg-g, yes .. bd808 is investigating. [20:35:20] (03CR) 10Giuseppe Lavagetto: "@Krinkle: I discussed at length this - managing the localhost entry via puppet is quite a nightmare and we wanted to avoid that. We can't " [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 (owner: 10Giuseppe Lavagetto) [20:35:37] ori: The puppet change didn't quite work -- allow from 208.80.152.0/2210.0.0.0/1610.64.0.0/2210.64.16.0/2210.64.32.0/2210.64.48.0/2210.64.5.0/2410.64.21.0/2410.64.36.0/2410.64.53.0/24 [20:35:54] (03PS2) 10Dzahn: lint backups.pp - part 1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140942 [20:35:58] ah! [20:35:59] fixing [20:36:31] (03PS1) 10Yuvipanda: toollabs: Send exim queue length to graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/143111 [20:36:33] keep making this rt triage week easy on me by finding your own solutions ^_^ [20:36:33] Coren: ^ exim queue length monitoring [20:36:36] heh [20:36:44] bd808: thx! [20:37:00] RobH: We just need your magic root fingers to type things :) [20:37:04] (03Abandoned) 10Dzahn: lint backups.pp - part 1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140942 (owner: 10Dzahn) [20:37:39] (03PS1) 10Ori.livneh: Fix-up for I12c6a042d [operations/puppet] - 10https://gerrit.wikimedia.org/r/143123 [20:38:04] (03CR) 10BryanDavis: [C: 031] Fix-up for I12c6a042d [operations/puppet] - 10https://gerrit.wikimedia.org/r/143123 (owner: 10Ori.livneh) [20:38:34] I'm looking at them, they look pretty magical. [20:39:09] jouncebot: next [20:39:09] In 2 hour(s) and 20 minute(s): SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140630T2300) [20:39:34] ottomata in regard to this, https://gerrit.wikimedia.org/r/#/c/142483/3/manifests/misc/deployment.pp, so just add the quotes back in? [20:39:48] bd808: is it a bad time to delete sync-common-file? [20:39:59] (03PS2) 10Yuvipanda: toollabs: Send exim queue length to graphite [operations/puppet] - 10https://gerrit.wikimedia.org/r/143111 (https://bugzilla.wikimedia.org/58871) [20:40:27] (03CR) 10Dzahn: [C: 031] scap: /usr/local/bin/sync-common-file is unused [operations/puppet] - 10https://gerrit.wikimedia.org/r/135925 (owner: 10BryanDavis) [20:40:28] yeah, dogeydogey, i'd leave that as is, just to be safe [20:40:33] we can do that as a separate commit [20:40:41] as this one is supposed to be a no-op for sure, and that isn't guarunteed to be [20:40:50] mutante: SHould be safe anytime. That change is a no-op esentially [20:40:51] so we can try it and see if it actually changes the a template in a commit on its own [20:41:01] (03CR) 10Dzahn: [C: 032] scap: /usr/local/bin/sync-common-file is unused [operations/puppet] - 10https://gerrit.wikimedia.org/r/135925 (owner: 10BryanDavis) [20:41:20] ottomata k [20:42:03] ori: Do we need RobH to merge and apply https://gerrit.wikimedia.org/r/#/c/143123/ on tin or can you do that? [20:42:37] well, if its his patchset [20:42:46] i should +1 it for him so he doesnt get shit for it =] [20:43:07] (03CR) 10RobH: [C: 031] "fixing the issue so deployments continue, woooooo" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143123 (owner: 10Ori.livneh) [20:43:22] ori: now if you wanna merge no one can say it wasn't reviewed [20:43:25] =] [20:44:08] now if integration would test it we'd be awesome [20:44:49] mutante: sorry for duplication of backup.pp [20:45:21] matanya: no need to be sorry, i was happy to get it out of the queue actually [20:45:25] (03PS1) 10Ottomata: Remove $ha_enabled parameter, infer this from $journalnode_hosts [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/143147 [20:45:41] matanya: it just looked surprising after rebase :p [20:45:48] :) [20:46:07] I noticed your work only after mine was merged [20:46:24] same here other way around [20:47:03] (03PS2) 10Ottomata: Remove $ha_enabled parameter, infer this from $journalnode_hosts [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/143147 [20:47:26] ori: did you want to merge that or should I? [20:47:39] (deployment.erb change) [20:48:28] (03CR) 10Gage: [C: 032] "Discussed on IRC" [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/143147 (owner: 10Ottomata) [20:50:07] (03PS2) 10Ottomata: Now using CDH5 in labs for new Hadoop clusters. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143091 [20:52:28] bd808: i dunno what happened [20:52:35] i guess i can merge ori's change? [20:52:44] (this is blocking you guys finishing parsoid deployement right?) [20:52:55] RobH: Yeah, let's try it. It's blocking subbu [20:53:13] (03CR) 10RobH: [C: 032] Fix-up for I12c6a042d [operations/puppet] - 10https://gerrit.wikimedia.org/r/143123 (owner: 10Ori.livneh) [20:54:07] (03PS1) 10Ottomata: Remove $ha_enabled parameter [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/143151 [20:54:09] merged and running puppet agent test on tin [20:54:19] (03CR) 10Ottomata: [C: 032 V: 032] Remove $ha_enabled parameter [operations/puppet/cdh] - 10https://gerrit.wikimedia.org/r/143151 (owner: 10Ottomata) [20:54:45] akosiaris: I merged my 'break puppet intentionally' patch > 2 hours ago, and icinga is still showing that box as green [20:54:45] (03PS3) 10Ottomata: Now using CDH5 in labs for new Hadoop clusters. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143091 [20:54:58] and says that puppet ran successfully 17 minutes ago. Seems wrong [20:55:06] (03PS4) 10Scottlee: Fixed spacing and lint rules for manifests/misc files. [operations/puppet] - 10https://gerrit.wikimedia.org/r/142483 [20:55:18] andrewbogott: or puppet is now self aware and self healing... which means we're proper fucked [20:55:43] I for one welcome our new robot overlords. [20:55:55] the best kind of new overlords [20:56:03] RobH: We'd live for kings for a few weeks, after everything works but before the machines have armed themselves... [20:56:12] * bd808 joins the resistance [20:56:30] * YuviPanda is inducted into the capacitance [20:56:39] subbu: So tin has run its puppet update. Perhaps try your stuff again? [20:56:47] lets see if the new allow range works [20:56:50] ok, will do. thanks. [20:56:53] * RobH sees it go live on tin [20:57:13] bd808: I have not learned much, except that ~ every user is set to 'sillyshell' in ldap, yet actually gets bash. So clearly there's something else intervening. [20:57:30] RobH, subbu: The vhost allow looks better now [20:57:43] andrewbogott: Ah. Well that's something. [20:57:47] 24/24 minions completed fetch [20:57:49] indeed, hopefully this fixes the block [20:57:51] yay [20:57:53] \o/ [20:57:55] \o/ [20:58:05] 24/24 minions completed checkout [20:58:12] so, looks good on that front. [20:58:15] bd808 / ori thx for help and fix patchset [20:58:27] RobH: yw [20:58:39] And next time you'll know another thing to check :) [20:58:42] Hm, now it says " Puppet is currently enabled, last run 86 seconds ago with 99 failures " but still shows as green. [20:58:49] So maybe that's somehow on purpose? [20:58:49] indeed [20:59:02] ok, im going to relocate from office to home to wait for UPS delivery. i'll be offline for the next 30 minutes or so then back [20:59:14] so if there are deployment train issues... uhhh... it'll suck [21:00:03] well, mutante is about and will handle deployment pings for the next 20 or so till im back [21:00:15] * RobH didnt wanna abandon without notice ;] back shortly [21:00:45] !log deployed parsoid 0b365d516 [21:00:49] Logged the message, Master [21:01:13] robh_away, bd808 thanks for the help [21:02:53] hm, ok, '99' is the number of errors that means 'total fail' [21:03:03] (03CR) 10BryanDavis: "Cherry-picked to deployment-salt for testing, but can't verify because puppet runs on deployment-bastion are failing at the moment for unr" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142407 (owner: 10BryanDavis) [21:03:41] chasemp: phab module question: in init.pp file you have set $timezone = 'America/Los_Angeles', why ? won't UTC be a better choice ? [21:04:29] that particular line is whatever php.ini will accept, I'm not opposed to changing it, but in this case the default for our install is that timezone so it's sort of an argument of what default should be otherwise :) [21:05:11] i.e. what's the point of a default that everyone has to set, so in this case the default matches our default install [21:05:52] they changed their timezone stuff now so I think that will change anyway [21:06:00] I see [21:06:28] thanks for clarifying this. mind if i lint the file chasemp? [21:06:57] please do, but I'm about to add a bunch of stuff for trusty happiness so may be worth waiting a day or two? [21:07:24] your call. any packaged stuff or more git-foo ? [21:07:37] (03PS4) 10Ottomata: Now using CDH5 in labs for new Hadoop clusters. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143091 [21:07:42] (03CR) 10Ottomata: [C: 032 V: 032] Now using CDH5 in labs for new Hadoop clusters. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143091 (owner: 10Ottomata) [21:08:01] matanya: mostly apache version change consequences, and a few lessons learned [21:08:11] no git::install coming tho that I know of [21:08:20] changes I mean for trusty [21:08:53] btw, any reason you use git::install and not packaged version, unless something obvious like "no packaged version" ? [21:09:19] (03CR) 10Dzahn: [C: 032] torrus: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/139786 (owner: 10Matanya) [21:09:24] there is no packaged version, so we could create one but in this case seems overblown [21:09:42] there isn't even a 'stable release' it's all rolling [21:09:52] ottomata: merge conflict? [21:09:57] so we are inferring stable checkpoints with tags that we'll have to vett [21:09:59] but it's not so bad [21:10:36] oh? [21:10:43] oh i didn't puppet-merge [21:10:43] sorry [21:10:46] go ahead [21:10:56] until we add custom stuff, then it will hurt [21:11:08] mutante: ^^ [21:11:09] ottomata: ok, yea, just the warning about multiple committers.. merging the CDH5 stuff [21:11:15] yeah, its fine [21:11:27] done [21:11:34] matanya: nah won't be so bad I think, but time will tell. number one thought is, we gotta start with something and see how we like it [21:11:50] No packages found matching hadoop. [21:11:50] No packages found matching hadoop. [21:11:50] No packages found matching hive. [21:11:52] ottomata: ? [21:11:52] Is http://releases.wikimedia.org/ known to be down? [21:12:26] chasemp: i really hope so :) ask andre__ about his journey with bugzilla customization... [21:12:34] yeah, i dunno, I saw that on labs instances too...but i don't see how that commit could make happen [21:12:40] mutante: where are you running puppet? [21:12:48] ottomata: No packages found matching hadoop. [21:12:48] No packages found matching oozie. [21:13:06] ottomata: on netmon1001, so i did not expect that there [21:13:11] i wanted to confirm the other change [21:13:17] yeah its strange, i don't see how this commit could make that happen [21:13:22] and i'm not sure where that error message is coming from [21:13:32] as it doesn't actually cause any problems in my labs tests [21:13:40] yea, it shows up in the puppet run somehow, but differently [21:13:41] it sounds like an apt error message [21:14:19] dunno though [21:14:23] hmm, not sure if it's APT.. [21:14:37] dont see it in /var/log/apt/history.log [21:15:17] it's after "Info: Loading facts in /var/lib/puppet/lib/facter/versions.rb" [21:15:31] yeha [21:15:45] i'm not sure, its really weird [21:15:58] hm, i mean, i included the cdh submodule for the first time in that change [21:16:03] (different than cdh4, cdh4 is being deprecated) [21:16:12] but, nothing should be installed unless you actually include classes from that node [21:16:17] so i'm not ure why it would even try [21:17:44] James_F: i dont think so.. ugh? [21:18:55] (03PS1) 10Ottomata: Fix undefined variable reference in hadoop.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/143157 [21:19:03] James_F: eh.. something (puppet) killed the Apache config [21:19:16] (03CR) 10Ottomata: [C: 032 V: 032] Fix undefined variable reference in hadoop.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/143157 (owner: 10Ottomata) [21:19:17] mutante: Fun. [21:19:19] ori: eh.. [21:19:28] something happened here, look at this [21:19:33] # This file is managed by Puppet! [21:19:37] .. [21:19:41] # vim: filetype=apache [21:19:43] but it' [21:19:50] it's empty besides that [21:21:34] mutante: aside from those error messages, did anything weird related to hadoop stuff happen? [21:22:32] ottomata: not that i know of, but i just ran puppet on netmon1001 [21:22:51] ottomata: this issue on caesium though.. apache config wiped... [21:22:52] yeay [21:24:17] mutante: i doubt that is related [21:24:18] James_F: reload.. manual quick fix [21:24:33] dpkg-query: no packages found matching hadoop [21:24:33] ottomata: no, it's not, just saying i was on the other issue already [21:24:39] mutante: I see the same stuff during puppet runs ^ [21:24:46] unrelated host [21:25:25] !log fixing releases.wikimedia.org Apache site, delete sites-enabled file broken by puppet, add symlink, graceful [21:25:30] Logged the message, Master [21:25:33] chasemp: yes, unrelated host, confirmed [21:25:58] puppet put an actual file into sites-enabled instead of a symlink [21:26:16] and the contents of that file consisted of only the "managed by puppet" line [21:26:35] ok cool [21:26:35] chasemp: the oozie etc. package stuff/ [21:26:35] ? [21:26:45] (03CR) 10coren: [C: 032] "Moar data!!1!one!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143111 (https://bugzilla.wikimedia.org/58871) (owner: 10Yuvipanda) [21:26:48] mutante: Yeah, works for me now. [21:27:16] remembers a patch that wanted to directly put files into sites-available and go around the usual mechanism [21:27:17] ottomata: just chiming in that I was seeing teh same notices on an unrelated host 'dpkg-query: no packages found matching oozie' [21:27:21] blames that patch [21:27:42] 4 or 5 times for hadoop, 2 x hive, 2x oozie [21:27:56] James_F: lemme see if it stays that way after puppet runs again [21:28:02] mutante: :-D [21:30:06] (03PS1) 10Andrew Bogott: Check for puppet failures as well as for puppet staleness. [operations/puppet] - 10https://gerrit.wikimedia.org/r/143161 [21:30:40] James_F: no :/ ehm.. [21:32:29] so weird, mutante! those error messages don't get logged to syslog [21:32:41] its like puppet is running dpkg or apt commands and that stuff is getting spit out on stdout [21:33:00] so, no way to know if my change caused this problem...without reverting [21:33:17] hmm, i see it in labs, i could just run puppet on a non updated host there [21:33:19] testing... [21:35:05] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Mon 30 Jun 2014 19:34:17 UTC [21:35:05] PROBLEM - Puppet freshness on fluorine is CRITICAL: Last successful Puppet run was Mon 30 Jun 2014 19:34:07 UTC [21:35:55] !log running mwscript updateSpecialPages.php --wiki=enwiki --only=Mostlinkedtemplates --override on terbium [21:35:59] Logged the message, Master [21:38:50] mutante: ah, sorry, thanks for the fix [21:39:05] PROBLEM - Puppet freshness on erbium is CRITICAL: Last successful Puppet run was Mon 30 Jun 2014 19:38:01 UTC [21:39:14] ori: well, i dont know the puppet fix yet... [21:39:21] ori: i was just staring at things.. [21:39:31] puppet will re-kill the config currently [21:39:50] hmph, just checked on a labs instance with an older puppet commit checked out, i did not see those errors [21:39:52] growl. [21:39:56] how'd that happen? [21:39:56] hmmm [21:40:40] mutante: pretty sure there error message is coming from dpkg-query [21:40:44] $ dpkg-query --show oozie [21:40:44] No packages found matching oozie. [21:40:46] but whyyyyy [21:41:40] why did this start happening now..? that is the question. [21:41:41] gr [21:42:54] (03PS1) 10Yuvipanda: toollabs: Allow diamond to sudo to reach out to exim [operations/puppet] - 10https://gerrit.wikimedia.org/r/143165 [21:43:15] i don't know how it would be related to submodules.. i'm trying to understand the other issue [21:43:34] !log disabling puppet on caesium [21:43:39] Logged the message, Master [21:44:13] haha, ok [21:44:34] sorry, but that is actually the mw download page being down [21:44:38] i just know if I leave this for the day someone is going to be all like "WTF IS THIS ERROR MESSAGE ROOOAR" [21:44:42] yeah np [21:44:46] (03CR) 10coren: toollabs: Allow diamond to sudo to reach out to exim (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143165 (owner: 10Yuvipanda) [21:44:56] lemme know if I can help mutante [21:45:40] (03PS1) 10Yuvipanda: toollabs: Remove broken ganglia monitoring of exim messages [operations/puppet] - 10https://gerrit.wikimedia.org/r/143167 [21:46:16] (03PS2) 1020after4: Packaging of php-mailparse from the pecl [operations/debs/php-mailparse] (review) - 10https://gerrit.wikimedia.org/r/142751 [21:48:18] PROBLEM - Puppet freshness on gadolinium is CRITICAL: Last successful Puppet run was Mon 30 Jun 2014 19:48:07 UTC [21:48:18] PROBLEM - Puppet freshness on silver is CRITICAL: Last successful Puppet run was Mon 30 Jun 2014 19:47:10 UTC [21:49:18] PROBLEM - Puppet freshness on oxygen is CRITICAL: Last successful Puppet run was Mon 30 Jun 2014 19:48:48 UTC [21:49:40] what the .. [21:49:46] Could not find dependency Augeas [21:49:48] what's going on [21:51:56] mutante: what node was that on? [21:52:24] ottomata: silver, but see icinga-wm above [21:52:51] weiiird, i see that too [21:52:51] its on oxygen [21:53:13] huh why is apt.wikimedia.org lighttpd? [21:53:21] just because. [21:53:22] haha [21:53:24] i think is the answer [21:53:26] no good reason [21:53:27] afaik [21:53:28] historical? [21:53:34] ottomata: https://gerrit.wikimedia.org/r/#/c/142479/ maybe [21:53:36] ok. just wondering because i haven't seen lighty elsewhere [21:53:53] jgage: https://rt.wikimedia.org/Ticket/Display.html?id=7743 [21:54:18] (03CR) 10Dzahn: "Error: Failed to apply catalog: Could not find dependency Augeas[$hostname iptables tables] for Augeas[silver iptables filter chains] at /" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142479 (owner: 10Scottlee) [21:54:22] cool [21:54:24] mutante: that was a while ago though, no? [21:55:14] mutante: where are you seeing that? [21:55:23] AH [21:55:26] yes, i see that one mutante [21:55:32] bad single quote replacement [21:55:32] that is wrong quoting [21:55:44] you fixing or shall I? [21:55:59] can do [21:56:02] ok, I guess you're on top of it [21:56:02] ok [21:56:11] ok I think the same thing guys just to confirm more eyes :) [21:56:14] andrewbogott: on anything that uses iptables.pp.. f.e. silver [21:56:18] yea, wait [21:56:27] wow, thanks for the puppet checks andrewbogott [21:56:28] needs curly brackets [21:56:40] mutante: if possible please link to the fix so that Scott can see what was wrong [21:56:43] i think that this has been bad since that commit went in last week [21:56:54] so anything that uses these older iptables classes [21:57:00] probably havn't had puppet running [21:57:06] ottomata: so now we still want to know why puppet actively deletes the contents of the apache config in role "releases" [21:57:09] brb [21:57:12] Wait, are we talking about the same thing? [21:57:19] we have several issues [21:57:19] :( [21:57:24] ottomata: was merged a few hours ago [21:57:24] andrewbogott: there are 3 weird things happening right now [21:57:28] oh [21:57:37] oh i see that, thanks matanya_ [21:58:06] since that commit is cleanup is it not maybe a good idea to roll it back and see, since we know it has at least one strangely manifested syntax error? [21:58:35] andrewbogott: [21:58:35] 1. puppet syntax error on an iptables class (mutante is fixing) [21:58:35] 2. apache configs borked on release server (not sure why yet, maybe an apache module change problem) [21:58:35] 3. I just merged a change that added the cdh (cdh5) submodule puppet module, but didn't include it anywhere [21:58:48] And mutante has a patch on the way for 1. right? [21:58:59] (right) [21:59:09] 3...which is causes strange (dpkg-query) output on puppet runs, saying can't find hadoop related packages [21:59:15] 2. sounds familiar, like a thing that's been happening in labs for a couple of weeks. Is it a brand new issue? [21:59:19] nothing is harmed by this, as the classes aren't included [21:59:25] no idea why my change would cause that to happen though [21:59:47] I have an idea for apache [21:59:56] https://gerrit.wikimedia.org/r/#/c/142479/8/manifests/webserver.pp [21:59:57] 1. yes [21:59:58] chasemp: doesn't sound like these other issues are related to that refactor, so I'd like to keep it in. I read it very closely (althoug apparently not close enough :( ) [22:00:24] oo [22:00:28] does string comparison, iirc it changed to Boolean and now broken [22:00:31] yeah let's fix those boolean quotes too [22:00:39] shoudla caught that one [22:00:49] i caught that on a different lint fix by scott [22:00:53] (03PS1) 10Dzahn: fix quoting error in iptables.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/143172 [22:00:55] even though puppet booleans are probably correct [22:01:03] behavior is different in erb templates if they aren't done right [22:01:14] need to track down there source [22:01:18] matanya_: https://gerrit.wikimedia.org/r/#/c/143172/1/manifests/iptables.pp [22:01:30] ottomata: [22:01:35] (03CR) 10Ottomata: [C: 031] fix quoting error in iptables.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/143172 (owner: 10Dzahn) [22:02:02] comma mutante [22:02:17] *g* ...ok [22:02:17] if you are already there ... [22:02:57] (03CR) 10Andrew Bogott: [C: 032] fix quoting error in iptables.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/143172 (owner: 10Dzahn) [22:03:01] haha, ooook, i'd just fix it asap but you got matanya the style guard on ya! :) [22:03:29] nvm ... andrewbogott merged [22:03:36] heh [22:03:44] Were there problems with unquoting booleans in that patch as well? [22:03:53] i think so [22:03:56] tracking it down [22:03:58] i've got a patch for that [22:03:59] thanks [22:04:12] checks "silver" [22:04:16] ja totally [22:04:18] will be different [22:04:19] <% if ["true", "false"].include?(ssl) -%> [22:04:19] (03CR) 10Yuvipanda: toollabs: Allow diamond to sudo to reach out to exim (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143165 (owner: 10Yuvipanda) [22:04:25] I encouraged Scott to grep and make sure that none of the booleans were getting compared to 'true' or 'false' but… I didn't grep for all of them myself. [22:04:35] should've, I guess [22:04:44] (03PS1) 10Ottomata: Revert un-quoting of booleans in webserver.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/143174 [22:05:00] RECOVERY - Puppet freshness on silver is OK: puppet ran at Mon Jun 30 22:04:53 UTC 2014 [22:05:05] that looks good so far [22:05:14] andrewbogott: i think its safer to do that in separate commits [22:05:17] so things like this don't get lost [22:05:25] lint commits should be things that are 100% safe no-ops [22:05:39] boolean / string stuff changes the semantics sometimes [22:05:49] ok, so now, why does puppet delete the apache config [22:05:55] on the release server [22:05:58] mutante: https://gerrit.wikimedia.org/r/#/c/143174/ [22:06:01] i think that is why [22:06:23] can you also link it to where that was introduced? [22:06:27] oh, hmm [22:06:47] done [22:06:47] (03PS2) 10Ottomata: Revert un-quoting of booleans in webserver.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/143174 [22:06:54] * matanya waits for a cookie [22:06:57] haha [22:06:57] :) [22:07:00] thanks matanya! :) [22:07:32] shame on me for not reviewing [22:07:39] ottomata: I would rather than you fix whatever place it is that actually requires those booleans to be quoted… since that's terrible [22:08:02] Although I guess we can revert and then fix wholesale afterwards [22:08:03] andrewbogott: agreed, in general, but i think that should be done in a different commit more carefullyu [22:08:04] yeah [22:08:06] i'm about to do that [22:08:09] 'k [22:08:15] but not today [22:08:19] but, the thing is down so uhhh [22:08:27] +1 or +2 please? :) [22:08:35] andrewbogott: you merged also : https://gerrit.wikimedia.org/r/#/c/142479/8/manifests/sudo.pp in that patch [22:08:39] https://gerrit.wikimedia.org/r/#/c/143174/ [22:08:40] RECOVERY - Puppet freshness on gadolinium is OK: puppet ran at Mon Jun 30 22:08:36 UTC 2014 [22:08:48] has trailing spaces [22:09:04] DU DUHHHH caught by the style guard! [22:09:20] RECOVERY - Puppet freshness on oxygen is OK: puppet ran at Mon Jun 30 22:09:12 UTC 2014 [22:09:31] matanya: you would be proud of me for all my comments about trailing spaces in: https://gerrit.wikimedia.org/r/#/c/142483/ [22:10:13] mutante: merge this and I think it will fix your woes [22:10:14] https://gerrit.wikimedia.org/r/#/c/143174/ [22:10:26] (03CR) 10Andrew Bogott: [C: 032] Revert un-quoting of booleans in webserver.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/143174 (owner: 10Ottomata) [22:10:32] * matanya hands a cookie to ottomata [22:11:17] ok, thx, re-enabling puppet [22:11:18] andrewbogott: virt puppet3 might break too [22:11:47] $is_puppet_master = 'true' changed to -> $is_puppet_master = true [22:12:02] Oh, I think I checked that one, but… will verify [22:12:08] and $is_labs_puppet_master = 'true' changed to -> $is_labs_puppet_master = true [22:13:07] ottomata: andrewbogott that looks good, it fixed it [22:13:18] cool [22:13:25] andrewbogott: mailman as well [22:13:30] mutante: just curious, still seeing those dpkg-query messages on puppet runs ? [22:13:40] RECOVERY - Puppet freshness on analytics1003 is OK: puppet ran at Mon Jun 30 22:13:35 UTC 2014 [22:13:50] RECOVERY - Puppet freshness on fluorine is OK: puppet ran at Mon Jun 30 22:13:40 UTC 2014 [22:13:52] yup, i just tested and answered my own question [22:13:53] ottomata: thx for the fix, and yes, i still see them [22:13:54] hmmmmPH [22:13:58] https://gerrit.wikimedia.org/r/#/c/142479/8/manifests/mail.pp generic::debconf::set { 'mailman/gate_news': [22:14:07] value => 'false', [22:14:09] !log re-enabled puppet on caesium [22:14:13] Logged the message, Master [22:14:27] matanya: can you help me by grepping and seeing that boolean is compared to a 'quoted' value anyplace? [22:14:28] (03CR) 10Ori.livneh: [C: 031] mediawiki: collect apc variables via diamond [operations/puppet] - 10https://gerrit.wikimedia.org/r/142250 (owner: 10Giuseppe Lavagetto) [22:14:47] yes, but i found a definite error before this [22:14:53] https://gerrit.wikimedia.org/r/#/c/142479/8/manifests/mail.pp [22:15:02] revert the last fix [22:15:08] it will fail for sure [22:15:37] $install == 'true' ? [22:15:48] yes [22:16:01] I'm slow… it seems like it was broken before and is broken now, but in exactly the same way? [22:16:09] (03CR) 10coren: [C: 032] "Despite the lameness of needing root." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143165 (owner: 10Yuvipanda) [22:16:16] unless "true" != 'true' [22:16:34] iirc "true" != 'true' [22:16:45] in some puppet cases [22:16:51] checks [22:17:04] I'm going to throw a fit if that turns out to be right :( [22:17:09] (03CR) 10coren: [C: 032] "Out with ye!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143167 (owner: 10Yuvipanda) [22:18:31] RECOVERY - Puppet freshness on erbium is OK: puppet ran at Mon Jun 30 22:18:25 UTC 2014 [22:18:57] yes, andrewbogott see http://docs.puppetlabs.com/puppet/latest/reference/lang_datatypes.html [22:19:04] dang, matanya, I don't see is_puppet_master getting checked anywhere. [22:19:12] single quotes vs double quotes [22:19:33] andrewbogott: in site.pp [22:19:56] matanya: it's set but not tested? [22:20:09] modules/puppetmaster/manifests/init.pp [22:20:44] I see is_labs_puppet_master there [22:21:19] yeah, sorry [22:21:34] doesn't seem to exist [22:21:39] Anyway… the is_labs_puppet_master checks are properly unquoted, so that seems good. [22:21:44] is_puppet_master can probably be purged. [22:21:49] As for 'true' vs. "true"... [22:21:59] should ask _joe_ about this [22:22:05] ottomata: is it possible even the "no packages" lines are from some lint change? [22:22:06] It looks to me like those docs are talking about the difference in how they're parsed, not how they're stored [22:22:35] i don't take any chances when it comes to puppet [22:22:36] i suppose [22:22:44] i gotta run though, so can't look into it until tomorrow [22:22:48] if you find anything do let me know [22:22:50] laters all! [22:22:55] bye [22:23:05] what is the error ,mutante? [22:23:15] No packages found matching hadoop. [22:23:20] No packages found matching oozie. [22:23:25] No packages found matching hive. [22:23:35] on ..random nodes. [22:24:06] can you give me an example of a random node? [22:24:13] caesium [22:24:34] and what happens when you do apt-get install hadoop there ? [22:24:52] does it find it in apt.wm.o ? [22:24:58] Huh, it's not even highlighted as a warning. [22:25:01] It's like a debug line or something [22:25:46] ListShellHook: grep-dctrl -e -S '^zookeeper$|^hadoop$|^hadoop-0.20-mapreduce$|^bigtop-jsvc$|^bigtop-utils$|^sqoop$|^hbase$|^pig$|^pig-udf-datafu$|^hive$|^oozie$|^hue$|^bigtop-tomcat$|^spark$|^avro-libs$|^parquet$|^parquet-format$' [22:26:35] matanya: yes, it _would_ find it, but it's not from apt [22:26:44] it's from dpkg-query somehow [22:26:55] it's also not in apt/history.log as a normal install attempt [22:27:16] weird [22:27:39] note how those packages are related to cdh4 (submodule) [22:27:48] and ottomata started using that [22:30:41] too late for thinking. one bug spotting is enough for the night :) [22:31:13] the module isn't applied anywhere, but the errors seem to appear just as it loads. [22:31:16] Presuming it's caused by that... [22:31:30] Anyway, it seems harmless, yes? So we can wait for _joe_ to contribute his wisdom. [22:31:37] (03CR) 10Matanya: "Also: https://gerrit.wikimedia.org/r/143174" [operations/puppet] - 10https://gerrit.wikimedia.org/r/142479 (owner: 10Scottlee) [22:31:57] https://gerrit.wikimedia.org/r/#/c/142565/3/modules/install-server/files/reprepro/updates ? [22:32:06] he touched that "ListShellHook" [22:32:30] i think so, yea [22:32:44] Hm, but not the actual packages it's complaining about [22:34:51] did he reprepo them before the patch ? [22:35:12] it can't update a missing packages ... [22:36:14] dunno ? http://apt.wikimedia.org/wikimedia/pool/main/o/oozie/ [22:36:24] but there is an oozie package, f.e. [22:36:34] there was also the cd4 vs. cdh5 change.. [22:37:05] mutante: reprepro ls oozie ? [22:38:00] mutante: Are you done with me for the moment? I'm expected elsewhere shortly... [22:38:10] oozie | 4.0.0+cdh5.0.2+180-1.cdh5.0.2.p0.14~precise-cdh5.0.2 | precise-wikimedia | amd64 [22:38:13] oozie | 4.0.0+cdh5.0.2+180-1.cdh5.0.2.p0.14 | precise-wikimedia | source [22:38:25] so yes [22:38:29] andrewbogott: i agree it doesnt seem to break things right now.. so yea [22:38:33] you are missing cdh4 [22:38:36] mutante: by the way, don't remove the puppet staleness check yet -- the new puppet tests are behaving very badly for me, they need work. [22:38:47] andrewbogott: good to know, ok [22:38:51] For a while virt1008 was green with '99 errors', now it shows 0 errors [22:38:58] but of course nothing has changed and it's still failing... [22:39:16] I added a new variation of the test which will work better but even with that things will still be semi broken [22:39:32] ..ok [22:39:35] *shrug* will take it up with alex tomorrow. [22:40:04] so mutante you can't find missing packages [22:40:05] alright, thx.. "99 errors" :p [22:40:08] (It's doing something clever right now, like, 'Warn if the number of failures exceeds the number of seconds in the freshness period' [22:40:14] and cdh4 is missing [22:40:19] 99 errors is actually kind of valid, it's a magic number that == failed compile [22:40:48] mutante: anyway, https://gerrit.wikimedia.org/r/#/c/143161/ if you're interested. [22:42:08] (03CR) 10Tim Landscheidt: "I think funny things can happen (logrotate, etc.), so to keep it simple, I would disable Puppet on all Tools hosts, merge this, and then o" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143098 (owner: 10Yuvipanda) [22:42:08] matanya: ok, so i _think_ otto wants cdh5 instead [22:42:09] (03CR) 10Tim Starling: "I'm not sure how the HHVM configuration interacts with the non-HHVM configuration. Will the variant aliases still work?" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/142983 (owner: 10Ori.livneh) [22:42:20] mutante: he wants that in labs [22:42:25] not in prod yet [22:43:03] see commit message: https://gerrit.wikimedia.org/r/#/c/143091/ [22:45:25] ok, good night all. Thanks for help with the premature merge... [22:45:50] andrewbogott: good night [22:45:57] night andrewbogott [22:51:21] (03PS1) 10Reedy: Add robots.txt rewrite rule where wiki is public [operations/apache-config] - 10https://gerrit.wikimedia.org/r/143184 [22:53:04] (03CR) 10Aaron Schulz: "Maybe, I'm not sure how useful it is for most sites. This is mostly for edge cases of edit history." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141892 (owner: 10Aaron Schulz) [22:54:18] (03PS2) 10Reedy: Add robots.txt rewrite rule where wiki is public [operations/apache-config] - 10https://gerrit.wikimedia.org/r/143184 [22:58:52] (03PS1) 10Reedy: Make apple-touch-icon.png configurable via touch.php [operations/apache-config] - 10https://gerrit.wikimedia.org/r/143188 [22:59:53] (03CR) 10Reedy: "I do wonder if this should just be done everywhere for standardisation" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/143184 (owner: 10Reedy) [23:00:04] mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140630T2300) [23:00:21] i'll do it [23:00:45] (03CR) 10Ori.livneh: [C: 032] Keep thumbnail guessing enabled [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142476 (https://bugzilla.wikimedia.org/64554) (owner: 10Gergő Tisza) [23:01:03] (03Merged) 10jenkins-bot: Keep thumbnail guessing enabled [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/142476 (https://bugzilla.wikimedia.org/64554) (owner: 10Gergő Tisza) [23:02:03] !log ori Synchronized wmf-config: Iba41a37a1: Keep thumbnail guessing enabled (duration: 00m 05s) [23:02:07] Logged the message, Master [23:02:11] tgr: ^ [23:02:50] ori: thanks! [23:04:23] np. [23:04:41] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.006 second response time [23:05:25] (03PS4) 10Dzahn: bugzilla apache: Enable required modules for caching [operations/puppet] - 10https://gerrit.wikimedia.org/r/127254 (https://bugzilla.wikimedia.org/49720) (owner: 10JanZerebecki) [23:13:00] (03CR) 10Dzahn: Check for puppet failures as well as for puppet staleness. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/143161 (owner: 10Andrew Bogott) [23:13:41] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.003 second response time [23:14:42] (03PS1) 10Yuvipanda: labs: Enable diamond PuppetAgent collector on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 [23:15:26] (03CR) 10Reedy: [C: 04-1] "Apache change needs deploying first" [operations/dns] - 10https://gerrit.wikimedia.org/r/143086 (owner: 10Reedy) [23:16:08] (03CR) 10Dzahn: [C: 031] Redirect c[sz].wikimedia.org to http://www.wikimedia.cz [operations/apache-config] - 10https://gerrit.wikimedia.org/r/143095 (owner: 10Reedy) [23:17:26] (03CR) 10Dzahn: [C: 031] Point c[sz].wikimedia.org at wikimedia-lb rather than external site [operations/dns] - 10https://gerrit.wikimedia.org/r/143086 (owner: 10Reedy) [23:19:05] ori: chasemp bd808 I'm considering setting up an instance of http://cabotapp.com/ on labs for monitoring/alerts, based off graphite. Thoughts? Since icinga on labs seems to be rather complicated... [23:20:37] we'll never be able to test icinga changes /me hides [23:21:02] mutante: heh :) I did look through prod icinga, it seems... very prod specific :) [23:21:09] mutante: and collecting resources also is a problem. [23:21:39] YuviPanda: you're trying to solve the same same issue petanb had over 2 years ago [23:22:30] just saying.. it ended with something unrelated to prod.. hence..now it is prod specific :) [23:22:34] mutante: piping the (now growing!) set of diamond/graphite metrics for it seems like a good idea. Specifically for tools - in the last two days we've been notified of 1. disk space running out on one sever, 2. one server being fully unresponsive, 3. puppet being out of date by days/hours on a couple of servers, all by people on IRC. [23:22:46] mutante: heh, true. but I don't know how to solve the resource collection problem. [23:23:23] mutante: another option was to setup icinga, but have it use check_graphite for checks. [23:23:52] (03PS5) 10Ori.livneh: bugzilla apache: Enable required modules for caching [operations/puppet] - 10https://gerrit.wikimedia.org/r/127254 (https://bugzilla.wikimedia.org/49720) (owner: 10JanZerebecki) [23:23:55] afaik we had labs icinga just minus the puppet freshness check at one point [23:23:58] mutante: I'm not going to do anything without some consensus. [23:24:11] and it got the node names from wikitech api somehow [23:24:19] mutante: do you know if it's on puppet? [23:24:40] mutante: any icinga based solution I could think of seems rather hacky, plus we *already* are running diamond everywhere. why not use it for alerts? [23:25:03] YuviPanda: i don't have all the answers, you just asked for thoughts. the part about diamond is a good point [23:25:27] mutante: indeed :D I want a non-hacky solution that'll scale decently well. [23:25:51] mutante: I just submitted a patch that'll have diamond collecting last-puppet-run stats as well, so we can actually have a puppet freshness check via that as well [23:26:09] YuviPanda: no, i think you just have the prod icinga role. that being said.. ideally that is the same thing, right, applying the same role in labs would just work... [23:26:41] mutante: right, but in this case I highly doubt it :) plus resource collection is the primary means of collecting what checks to run... [23:28:02] hmm. maybe..now you make me curious what is the actual problem with doing that in labs [23:28:45] mutante: resource collection? yeah, I would like to know more in detail too :) From what I'm told, it's an issue because of the fact it is trivial to dump a ton of resources into puppetDB and make everything crawl everywhere. [23:41:16] (03PS2) 10Yuvipanda: labs: Enable diamond PuppetAgent collector on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 [23:41:44] (03PS3) 10Yuvipanda: labs: Enable diamond PuppetAgent collector on all nodes [operations/puppet] - 10https://gerrit.wikimedia.org/r/143193 [23:41:54] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data exceeded the critical threshold [500.0] [23:45:30] what was that? [23:47:01] (crickets) [23:47:07] ori: btw redis-jobqueue.log is spammed...osmium again [23:47:20] was that running unpatched hhvm again? [23:47:57] not by me [23:56:54] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [23:57:44] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Mon 30 Jun 2014 21:57:29 UTC [23:58:22] (03PS1) 10Dzahn: remove pmtpa section from Round Robin LVS records [operations/dns] - 10https://gerrit.wikimedia.org/r/143201 [23:59:45] (03PS1) 10Dzahn: remove pmtpa access switches [operations/dns] - 10https://gerrit.wikimedia.org/r/143202