[00:14:48] (03PS1) 10Aaron Schulz: Disable job restarter if run_jobs_enabled is false [operations/puppet] - 10https://gerrit.wikimedia.org/r/137229 [00:17:18] (03PS1) 10BryanDavis: beta: NFS no longer used for mediawiki deployment [operations/puppet] - 10https://gerrit.wikimedia.org/r/137232 [00:18:01] (03CR) 10Ori.livneh: [C: 04-1] "This won't work. If run_jobs_enabled is ever true, the cron job will be installed. If run_jobs_enabled is then set to false, Puppet won't " [operations/puppet] - 10https://gerrit.wikimedia.org/r/137229 (owner: 10Aaron Schulz) [00:22:04] (03PS2) 10Aaron Schulz: Disable job restarter if run_jobs_enabled is false [operations/puppet] - 10https://gerrit.wikimedia.org/r/137229 [00:26:18] ori: I wonder if the error spam is https://bugzilla.wikimedia.org/show_bug.cgi?id=65466 [00:26:22] though that should be fixed already [00:30:39] (03PS1) 10Gergő Tisza: Enable MediaViewer survey on enwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137237 [00:32:30] (03CR) 10Ori.livneh: [C: 031] Disable job restarter if run_jobs_enabled is false [operations/puppet] - 10https://gerrit.wikimedia.org/r/137229 (owner: 10Aaron Schulz) [00:35:18] (03CR) 10BryanDavis: "Cherry-picked to deployment-salt." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137232 (owner: 10BryanDavis) [00:35:39] (03PS1) 10BryanDavis: beta: add deployment-parsoid04 as a scap target [operations/puppet] - 10https://gerrit.wikimedia.org/r/137238 [00:41:15] (03PS1) 10Krinkle: asset-check: Use query hack to neither hardcode pagename nor redirect [operations/puppet] - 10https://gerrit.wikimedia.org/r/137239 [00:41:17] (03PS1) 10Krinkle: asset-check: Minor code clean up [operations/puppet] - 10https://gerrit.wikimedia.org/r/137240 [00:41:19] (03PS1) 10Krinkle: asset-check: Implement --debug [operations/puppet] - 10https://gerrit.wikimedia.org/r/137241 [00:41:21] (03PS1) 10Krinkle: asset-check: Use "response.stage" property to filter out duplicates [operations/puppet] - 10https://gerrit.wikimedia.org/r/137242 [00:42:15] (03PS1) 10Tim Landscheidt: Revert "toollabs: Remove unused and empty webproxy role" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137243 [00:48:27] What is the mediaWiki "node" in Ganglia under misc-eqiad? [00:48:27] http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&c=Miscellaneous+eqiad&h=mediaWiki [00:50:52] (03CR) 10BryanDavis: [C: 04-1] "Don't apply until Gabriel says it's needed." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137238 (owner: 10BryanDavis) [00:59:57] (03PS1) 10Mwalker: Adding Ferm rule for OCG HTTP [operations/puppet] - 10https://gerrit.wikimedia.org/r/137247 [01:00:05] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 21:59:16 UTC [01:01:03] (03CR) 10jenkins-bot: [V: 04-1] Adding Ferm rule for OCG HTTP [operations/puppet] - 10https://gerrit.wikimedia.org/r/137247 (owner: 10Mwalker) [01:04:32] (03PS2) 10Mwalker: Adding Ferm rule for OCG HTTP [operations/puppet] - 10https://gerrit.wikimedia.org/r/137247 [01:22:05] PROBLEM - Puppet freshness on tungsten is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 13:19:16 UTC [01:23:32] (03PS1) 10Krinkle: asset-check: Track POST requests, redirects, http4xx, and http5xx [operations/puppet] - 10https://gerrit.wikimedia.org/r/137248 [01:24:05] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [01:24:11] (03Abandoned) 10BryanDavis: beta: add deployment-parsoid04 as a scap target [operations/puppet] - 10https://gerrit.wikimedia.org/r/137238 (owner: 10BryanDavis) [01:29:26] (03PS2) 10Krinkle: asset-check: Track POST requests, redirects, http4xx, and http5xx [operations/puppet] - 10https://gerrit.wikimedia.org/r/137248 [01:30:16] (03PS2) 10Giuseppe Lavagetto: mediawiki::monitor::graphite: monitor thresholds over 1hr interval [operations/puppet] - 10https://gerrit.wikimedia.org/r/137043 (owner: 10Ori.livneh) [01:30:31] (03PS2) 10BryanDavis: beta: NFS no longer used for mediawiki deployment [operations/puppet] - 10https://gerrit.wikimedia.org/r/137232 [01:31:53] (03CR) 10BryanDavis: "Cherry-picked to deployment-salt." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137232 (owner: 10BryanDavis) [01:34:30] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::monitor::graphite: monitor thresholds over 1hr interval [operations/puppet] - 10https://gerrit.wikimedia.org/r/137043 (owner: 10Ori.livneh) [01:37:15] RECOVERY - Puppet freshness on tungsten is OK: puppet ran at Wed Jun 4 01:37:13 UTC 2014 [01:38:35] (03PS1) 10Krinkle: asset-check: Fix broken bodySize [operations/puppet] - 10https://gerrit.wikimedia.org/r/137252 [01:40:15] (03PS2) 10Krinkle: asset-check: Use content-length header when response.bodySize is missing [operations/puppet] - 10https://gerrit.wikimedia.org/r/137252 [01:52:26] (03PS3) 10Krinkle: asset-check: Track POST requests, redirects, http4xx, and http5xx [operations/puppet] - 10https://gerrit.wikimedia.org/r/137248 [01:52:28] (03PS3) 10Krinkle: asset-check: Use content-length header when response.bodySize is missing [operations/puppet] - 10https://gerrit.wikimedia.org/r/137252 [01:52:30] (03PS1) 10Krinkle: asset-check: Track whether requests are compressed with gzip [operations/puppet] - 10https://gerrit.wikimedia.org/r/137253 [02:02:09] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 33 data above and 0 below the confidence bounds [02:02:09] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 35 data above and 5 below the confidence bounds [02:03:19] (03PS4) 10Krinkle: asset-check: Track POST requests, redirects, http4xx, and http5xx [operations/puppet] - 10https://gerrit.wikimedia.org/r/137248 [02:03:21] (03PS2) 10Krinkle: asset-check: Track whether requests are compressed with gzip [operations/puppet] - 10https://gerrit.wikimedia.org/r/137253 [02:03:48] (03CR) 10Krinkle: "Removed duplicate resource.bytes assignment resulting from rebase." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137248 (owner: 10Krinkle) [02:03:53] (03CR) 10Krinkle: "Rebased." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137253 (owner: 10Krinkle) [02:47:09] !log LocalisationUpdate completed (1.24wmf6) at 2014-06-04 02:46:06+00:00 [02:47:16] Logged the message, Master [02:52:30] (03PS1) 10Krinkle: asset-check: Track uncaught exceptions in javascript [operations/puppet] - 10https://gerrit.wikimedia.org/r/137257 [02:52:32] (03PS1) 10Krinkle: asset-check: Track number of registered modules and their state [operations/puppet] - 10https://gerrit.wikimedia.org/r/137258 [03:16:38] !log LocalisationUpdate completed (1.24wmf7) at 2014-06-04 03:15:34+00:00 [03:16:42] Logged the message, Master [03:22:30] (03PS1) 10Ori.livneh: mediawiki::monitor -> mediawiki::monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/137263 [03:35:32] !log Deploy I882e3fa57b2e5e3de in Zuul and reload config [03:35:38] Logged the message, Master [03:44:02] (03CR) 10Krinkle: "@Jforrester: This will also catch when a module is e.g. executed in try/catch (this no uncaught error, as tracked by I3180f9747c1257a). So" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137258 (owner: 10Krinkle) [03:46:19] (03CR) 10Ori.livneh: [C: 04-1] asset-check: Track uncaught exceptions in javascript (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/137257 (owner: 10Krinkle) [03:47:51] (03CR) 10Ori.livneh: "Looks good; have you tested it?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137253 (owner: 10Krinkle) [03:49:19] (03PS2) 10Krinkle: asset-check: Track uncaught exceptions in javascript [operations/puppet] - 10https://gerrit.wikimedia.org/r/137257 [03:49:23] (03CR) 10Ori.livneh: [C: 031] "This is an excellent idea" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137258 (owner: 10Krinkle) [03:49:47] (03CR) 10Krinkle: "Fixed." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/137257 (owner: 10Krinkle) [03:49:51] (03CR) 10Jforrester: "Really lovely work, Timo, thank you. :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137258 (owner: 10Krinkle) [03:49:55] (03PS2) 10Krinkle: asset-check: Track number of registered modules and their state [operations/puppet] - 10https://gerrit.wikimedia.org/r/137258 [03:50:29] (03CR) 10Krinkle: "Yep, ran it on 'http://en.wikipedia.org/wiki/Main_Page' which produced output as embedded in the commit message." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137253 (owner: 10Krinkle) [03:53:44] (03PS1) 10Ori.livneh: text-frontend VCL: grep Orig-Cookie for GeoIP, not (just) Cookie [operations/puppet] - 10https://gerrit.wikimedia.org/r/137265 [03:59:34] (03CR) 10Krinkle: [C: 031] text-frontend VCL: grep Orig-Cookie for GeoIP, not (just) Cookie [operations/puppet] - 10https://gerrit.wikimedia.org/r/137265 (owner: 10Ori.livneh) [04:00:59] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 21:59:16 UTC [04:07:34] (03PS2) 10Ori.livneh: text-frontend VCL: grep Orig-Cookie for GeoIP, not (just) Cookie [operations/puppet] - 10https://gerrit.wikimedia.org/r/137265 [04:24:59] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [04:27:38] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jun 4 04:26:32 UTC 2014 (duration 26m 31s) [04:27:43] Logged the message, Master [04:54:24] (03CR) 10Ori.livneh: "This is live on beta; you can verify it by curling http://en.wikipedia.beta.wmflabs.org/wiki/Main_page with and without the GeoIP cookie." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137265 (owner: 10Ori.livneh) [05:16:54] springle, yt? [05:19:42] MaxSem: yep [05:21:09] springle, due to GeoData updates, we don't need geo_updates and geo_killlist tables anymore - asking for permission to drop:) [05:22:27] they aren't that large are they [05:22:30] * springle checks [05:22:42] updates should be 1 row [05:22:57] killlist prolly empty by now [05:23:58] I would need to re-check that no code attempts to use them [05:24:02] killlist is ~1M on enwiki [05:24:10] ewwww [05:24:27] can be truncated at any moment:) [05:24:32] way too many L's [05:24:49] that's silly Sphinx search term [05:25:15] let's be paranoid. check code, etc [05:25:19] (yes, it was initially planned tto use sphinx!) [05:25:58] then do a rename _old, wait a bit, then drop [05:26:11] ok [05:26:42] just to clarify, the data in them is already useless [05:27:03] and can be ditched right now [05:27:09] useless to everyone? like analytics etc [05:27:37] nah, it's for feeding Solr. and we ain't got solr anymore:) [05:28:39] ok [05:30:59] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Wed 04 Jun 2014 02:29:51 UTC [05:34:28] MaxSem: how does geo_tags relate? [05:34:42] geo_tags remains unchanged [05:37:57] lots of recent INSERT GeoDataHooks::doSmartUpdate traffic for geo_killlist in sampled traffic logs [05:38:19] i guess kill that off then we're ready to kill the tables [05:38:32] meh,artefacts of migration [05:39:03] :) [05:39:19] (03PS2) 10Springle: Make dbstore1002 handle s2 analytics queries [operations/dns] - 10https://gerrit.wikimedia.org/r/137174 (https://bugzilla.wikimedia.org/66068) (owner: 10QChris) [05:39:48] (03CR) 10Springle: [C: 032] Make dbstore1002 handle s2 analytics queries [operations/dns] - 10https://gerrit.wikimedia.org/r/137174 (https://bugzilla.wikimedia.org/66068) (owner: 10QChris) [05:40:38] https://gerrit.wikimedia.org/r/#/c/134851/ should take care of it [05:41:16] however, I might write a stopgap that simply stops killist updates [05:41:42] otherwise it will grow hugge now that the purging cronjob is gonee [05:42:04] what's stopping that change set? [05:43:57] the HW was gone only today, so until then we theoretically had the ability to switch back in case of emergency [05:44:22] after that, still would love to have this change on labs for a week [05:44:31] fair enough [05:46:21] the tables could simply be engine=blackhole until the change set is ready [05:46:30] but killing the updates separately is fine too [05:46:51] yeah, that they're still happening is a separate bu [05:46:53] g [05:51:00] https://gerrit.wikimedia.org/r/137272 - will be deployeed during tomorrow's training [05:51:36] then I'll truncate that table [05:53:06] sounds good [06:14:50] !log starting online schema change, bug 66089 gerrit 137149 [06:14:55] Logged the message, Master [06:46:19] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] [06:46:51] <_joe_> oh well [06:47:07] <_joe_> good morning to you, icinga-wm_ [06:47:26] <_joe_> just a spike [06:57:42] <_joe_> springle: are you familiar with DNS changes? [06:57:56] <_joe_> I did some changes and I need to push them to production [06:58:13] rubidium authdns-update [06:58:20] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [06:59:00] as root. it should be roughly like doing a puppet-merge, verify, yes/no etc [07:01:59] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 21:59:16 UTC [07:02:21] <_joe_> ok [07:02:23] <_joe_> thanks man [07:02:38] np [07:04:42] _joe_: any idea why db1007 puppet is critical in icinga. it runs fine on the box itself [07:05:41] <_joe_> no [07:05:55] <_joe_> let me brush my teeth and I'll give a look [07:06:00] thanks [07:07:06] puppet freshness seems to do this from time to time. something odd with passive checks [07:08:00] <_joe_> springle: it is an snmp trap [07:08:09] <_joe_> which is rather lame IMO [07:08:44] ah [07:09:01] flaky [07:09:22] <_joe_> so, it gets triggered when puppet runs [07:09:40] <_joe_> if it's not caught [07:09:43] <_joe_> a critical gets issued [07:13:55] (03PS1) 10BryanDavis: beta: fix scap for videoscalers [operations/puppet] - 10https://gerrit.wikimedia.org/r/137274 [07:14:59] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Wed Jun 4 07:14:49 UTC 2014 [07:15:20] there we go [07:15:31] just keep hitting it until it gives in [07:23:06] (03CR) 10BryanDavis: "Cherry-picked to deployment-salt and applied on deployment-videoscaler01. Fixed scap failure due to missing ssh authentication grant." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137274 (owner: 10BryanDavis) [07:25:59] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [07:28:33] (03PS2) 10Giuseppe Lavagetto: puppet3: make puppet::self::master work in puppet 3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/137025 [07:51:51] <_joe_> !log rebooted ms-be1001, host unresponsive to ping, blank console [07:51:55] Logged the message, Master [07:54:59] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [08:02:12] (03PS4) 10Giuseppe Lavagetto: rcstream: add DNS records for stream.w.o [operations/dns] - 10https://gerrit.wikimedia.org/r/136983 [08:04:35] (03PS5) 10Giuseppe Lavagetto: rcstream: add DNS records for stream.w.o [operations/dns] - 10https://gerrit.wikimedia.org/r/136983 [08:16:25] (03PS1) 10Gilles: Reduce MediaViewer EventLogging sampling factor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137276 [08:16:31] (03CR) 10jenkins-bot: [V: 04-1] Reduce MediaViewer EventLogging sampling factor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137276 (owner: 10Gilles) [08:17:40] (03CR) 10Giuseppe Lavagetto: [C: 032] rcstream: add DNS records for stream.w.o [operations/dns] - 10https://gerrit.wikimedia.org/r/136983 (owner: 10Giuseppe Lavagetto) [08:18:28] (03PS2) 10Gilles: Reduce MediaViewer EventLogging sampling factor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137276 [08:40:26] (03CR) 10Alexandros Kosiaris: [C: 04-1] Adding Ferm rule for OCG HTTP (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/137247 (owner: 10Mwalker) [08:45:53] (03PS3) 10Mwalker: Adding Ferm rule for OCG HTTP [operations/puppet] - 10https://gerrit.wikimedia.org/r/137247 [08:46:23] I wish we had a way to validate ferm rules :( [08:47:21] (03PS4) 10Mwalker: Adding Ferm rule for OCG HTTP [operations/puppet] - 10https://gerrit.wikimedia.org/r/137247 [08:47:49] hashar, instead of just applying them and hoping things dont break? :D [08:47:56] <_joe_> hashar: well, I think theoretically that could be [08:47:59] yeah :D [08:48:21] mwalker: also I would prefix $service_port with 'ocg'. Ie: $ocg_service_port . That makes grep easier, but I am nitpicking [08:48:45] mwalker: also adding a ferm::rule also makes iptables to default to DROP. Which might cause some issues [08:49:00] hashar, I already have that problem in labs [08:49:07] so I might as well just start solving it in production too [08:49:14] (03PS3) 10Giuseppe Lavagetto: rcstream: add lvs, modify mountpoint [operations/puppet] - 10https://gerrit.wikimedia.org/r/136990 [08:49:36] mwalker: by labs do you mean the beta cluster? [08:49:57] *nods* I have to apply the natfix; which enabled ferm; which means I need to specify my ferm rules [08:50:20] it might have also broken trebuchet; but I dont know that for sure yet [08:50:23] agh yeah make sense [08:50:49] you can get your ferm patch applied on beta cluster puppetmaster which is deployment-salt.eqiad.wmflabs [08:51:09] cherry pick the patch under /var/lib/git/operations/puppet then run puppetd -tv on whatever host you are interested in [08:51:47] (03PS4) 10Whym: FeaturedFeeds for Wiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136316 (https://bugzilla.wikimedia.org/66015) [08:52:37] (03CR) 10Hashar: [C: 031] "_joe_ godog : you can get this merged. It is already on the labs puppetmaster and fixed the issue I had on the Jenkins slaves in labs :]" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136310 (owner: 10Hashar) [08:52:49] (03CR) 10Giuseppe Lavagetto: [C: 032] rcstream: add lvs, modify mountpoint [operations/puppet] - 10https://gerrit.wikimedia.org/r/136990 (owner: 10Giuseppe Lavagetto) [08:53:29] (03PS9) 10Hashar: contint: split Zuul server and merger (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/129292 [09:01:04] can someone help me with an urgent wikidata issue? [09:01:09] changes are not being dispatched [09:01:13] to the wikipedias [09:01:30] can someone please check what is wrong with the cron job for that? [09:03:30] (03CR) 10Filippo Giunchedi: "that might work too, not sure if there are other dependencies/scripts that are calling that and expect things on stdout though? another vi" [operations/puppet] - 10https://gerrit.wikimedia.org/r/135133 (owner: 10Aaron Schulz) [09:04:17] (03PS10) 10Hashar: contint: split Zuul server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/129292 [09:07:25] (03PS3) 10Filippo Giunchedi: contint: fix resource ordering for labs slave [operations/puppet] - 10https://gerrit.wikimedia.org/r/136310 (owner: 10Hashar) [09:07:45] (03PS1) 10Giuseppe Lavagetto: stream.w.o: move to high-traffic2 pool [operations/puppet] - 10https://gerrit.wikimedia.org/r/137277 [09:07:47] (03CR) 10Filippo Giunchedi: [C: 032] contint: fix resource ordering for labs slave [operations/puppet] - 10https://gerrit.wikimedia.org/r/136310 (owner: 10Hashar) [09:07:53] (03CR) 10Filippo Giunchedi: [V: 032] contint: fix resource ordering for labs slave [operations/puppet] - 10https://gerrit.wikimedia.org/r/136310 (owner: 10Hashar) [09:07:56] \O/ [09:08:22] (03PS2) 10Giuseppe Lavagetto: stream.w.o: move to high-traffic2 pool [operations/puppet] - 10https://gerrit.wikimedia.org/r/137277 [09:08:24] hashar: more like /O\ [09:08:27] ;-) [09:08:34] <_joe_> ach [09:08:42] hashar: done (https://gerrit.wikimedia.org/r/#/c/136310/) [09:10:13] (03CR) 10Giuseppe Lavagetto: [C: 032] "trivial change." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137277 (owner: 10Giuseppe Lavagetto) [09:11:47] (03CR) 10Hashar: "Hey ops, I could use a review for the cron part being added there. The rest is working on labs :-)" (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/129292 (owner: 10Hashar) [09:12:11] _joe_: if you have any cron knowledge I could use a review of a cron entry at https://gerrit.wikimedia.org/r/#/c/129292/10/modules/zuul/manifests/merger.pp [09:12:38] <_joe_> well, I do have some [09:13:02] <_joe_> hashar: but, I'm in the middle of a potentially fatal change (fiddling with LVS and pybal) [09:13:51] _joe_: give it a chance once you recovered from the potential outage :d [09:17:25] (03PS1) 10Mwalker: Allowing a configurable StatsD server [operations/puppet] - 10https://gerrit.wikimedia.org/r/137278 [09:18:55] (03PS2) 10Mwalker: Allowing a configurable StatsD server [operations/puppet] - 10https://gerrit.wikimedia.org/r/137278 [09:22:36] <_joe_> hashar: will do [09:26:30] (03PS1) 10Giuseppe Lavagetto: stream: fix copy/paste of IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/137279 [09:27:08] <_joe_> I'd kick myself for this [09:27:52] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] stream: fix copy/paste of IPs [operations/puppet] - 10https://gerrit.wikimedia.org/r/137279 (owner: 10Giuseppe Lavagetto) [09:28:43] (03CR) 10Christopher Johnson (WMDE): "The intention of the script is to notify people about a problem and enable the recording of performance data about the lag. The idea of m" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136095 (owner: 10Christopher Johnson (WMDE)) [09:29:01] _joe_: do you have time for 1 more graphite-puppet question? [09:29:14] <_joe_> nuria: not now, give me ~ 20 mins [09:29:22] k [09:35:08] (03PS1) 10Nuria: [WIP] monitoring: monitor eventlogging thresholds [operations/puppet] - 10https://gerrit.wikimedia.org/r/137280 (https://bugzilla.wikimedia.org/65482) [09:35:47] (03CR) 10Hashar: [V: 032] beta: bring in mediawiki/skins.git [operations/puppet] - 10https://gerrit.wikimedia.org/r/136325 (https://bugzilla.wikimedia.org/65868) (owner: 10Hashar) [09:41:29] (03CR) 10Filippo Giunchedi: [C: 031] initial debianization (031 comment) [operations/debs/python-statsd] - 10https://gerrit.wikimedia.org/r/131449 (owner: 10Gage) [09:44:56] (03CR) 10Hashar: "Thx for the rename. Maybe get rid of operations/debs/python-statsd and create operations/debs/pystatsd instead?" [operations/debs/python-statsd] - 10https://gerrit.wikimedia.org/r/131449 (owner: 10Gage) [09:45:38] springle: still here? [09:45:43] hello [09:46:04] (03PS5) 10Christopher Johnson (WMDE): Icinga: new command "check_dispatch" for Wikidata [operations/puppet] - 10https://gerrit.wikimedia.org/r/136095 [09:46:29] Am here to deliver a message [09:46:56] anyone here? [09:47:07] yes [09:47:08] what's up? [09:48:22] the second coming of The Lord Jesus Christ will be out of Ghana [09:49:02] paravoid: yep [09:49:29] is anyone aware of this? [09:49:46] springle: https://rt.wikimedia.org/Ticket/Display.html?id=5797, frimpressions, still pending? [09:50:10] Isaiah 9:6-7 [09:51:01] paravoid: noted. emails in transit on that one [09:51:08] 2014 marks another 14th generation Matthew 1:1-17 [09:51:31] springle where are you from? [09:52:18] Mannamission: the real world [09:52:43] ok thanks [09:53:01] i will like you to share this message [09:53:03] Mannamission: this is a strictly technical channel, we're not interested [09:53:27] (03PS1) 10Giuseppe Lavagetto: stream: fix ip allocation, nginx configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/137282 [09:56:06] springle, paravoid ... as much as I'd like frimpressions to happen, I think I'm resigned to accepting that it was a pipe dream :( [09:56:35] Mannamission: you want to use #freenode [09:57:32] paravoid this is a message you should all listen to. Jesus Christ will be revealed to the world this year [09:59:33] mwalker: reading back over our email thread; which bit made it a pipe dream? [09:59:41] or have more things happened since [10:00:05] nothing new has changed -- it just seemed like a lot of work with unresolved questions [10:01:14] (03CR) 10Giuseppe Lavagetto: [C: 032] stream: fix ip allocation, nginx configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/137282 (owner: 10Giuseppe Lavagetto) [10:01:36] though; looking back at this; it seemed you didn't have a problem in theory with federation so long as our queries were sane? [10:02:59] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 21:59:16 UTC [10:03:58] mwalker: yes, either replication or federation was possible; i think we just kept forgetting the thread :) [10:04:42] (03PS1) 10Aude: Log wikidata cron jobs to /var/log/wikidata [operations/puppet] - 10https://gerrit.wikimedia.org/r/137284 [10:05:07] paravoid: could you please review https://gerrit.wikimedia.org/r/#/c/137284/ [10:05:32] to have the logs go some place that exists [10:08:06] /var/log/mediawiki/wikidata/ still exists [10:08:12] hmm [10:08:16] permission denied [10:08:21] ah right [10:08:38] * hashar out for lunch [10:08:50] should be apache:mwdeploy ? [10:08:58] /var/log/wikidata has logs with those names [10:09:07] right [10:09:14] i am a bit confused about this [10:09:21] (03PS2) 10Faidon Liambotis: Log wikidata cron jobs to /var/log/wikidata [operations/puppet] - 10https://gerrit.wikimedia.org/r/137284 (owner: 10Aude) [10:09:29] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Log wikidata cron jobs to /var/log/wikidata [operations/puppet] - 10https://gerrit.wikimedia.org/r/137284 (owner: 10Aude) [10:09:38] I am as well, but meh [10:09:52] let's see if this works [10:10:15] when running mwscript ... i still get "var/log/wikidata/dispatcher4.log: Permission denied" [10:11:10] springle, if the two options are a) writing data to a misc server and replicating that into the frack; or b) writing data to a misc server and federating that into frack -- I would prefer (a) as possibly naively being being less brittle [10:12:43] but we are running the cronjobs as apache so should work [10:13:03] springle, but (b) is conceptually simpler (and thus easier to verify it wont break anything in the frack) [10:13:21] mwalker: i'm still ok with a), but it's jeff's call [10:13:32] kk -- I'll poke him tomorrow morning [10:13:37] what TZ are you currently in? [10:13:55] (and by tomorrow morning I mean in about 5 hours) [10:14:10] mwalker: how about i just create the database and leave it in your court; you can replicate using the normal repl credentials, or federate with a new user [10:14:22] that works too :) [10:14:23] i'll help out if jeff asks [10:14:32] because we'll just forget the thread again [10:14:42] yes; yes we will [10:14:51] too many other high priority things to deal with [10:15:05] ok. doing that [10:15:48] *thumbs up* thanks! [10:15:56] * mwalker heads off to sleep land for a time [10:16:03] <_joe_> nuria: shoot [10:16:03] <_joe_> :) [10:16:26] ok, please take a look at this changeset [10:16:44] https://gerrit.wikimedia.org/r/#/c/137280/ [10:16:59] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.001 second response time [10:17:12] _joe_: i tried to do the same thing than you did for mediawiki but for eventlogging [10:17:39] looks like a bunch of logs were moved aruond [10:18:09] now, EL publishes to graphite from vanadium ( there is a listener on hafnium) [10:19:00] _joe_: so should the call to EL monitoring be on "manifests/role/graphite.pp"? [10:19:13] if so... any advise on how to test this? [10:20:16] <_joe_> nuria: do you have a change for this? [10:20:38] <_joe_> nuria: also, what do you want to test? [10:20:39] yes, sorry, pasted it above: https://gerrit.wikimedia.org/r/#/c/137280/ [10:21:12] _joe_: that alarms are triggered when thresholds are surpased [10:21:22] <_joe_> if you want to check the result of your check, just use check_graphite which is somewhere at files/icinga/ I think [10:21:26] <_joe_> and run it from your computer [10:21:46] <_joe_> nuria: as I just said :) [10:22:03] but in order tu run it it must be able to connect to graphite right? [10:23:59] <_joe_> yeah it must [10:25:32] <_joe_> graphite seems down atm [10:25:59] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.004 second response time [10:26:09] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 35 data above and 0 below the confidence bounds [10:26:59] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [10:30:09] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 53 data above and 9 below the confidence bounds [10:30:29] <_joe_> ok this may be interesting [10:37:36] hi [10:37:54] (03PS1) 10Ori.livneh: rcstream: listen on ipv6 too; apply lvs::realserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/137289 [10:38:09] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:38:59] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [10:41:46] (03CR) 10Aude: "appears the log location got changed I2cfbb34" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137284 (owner: 10Aude) [10:43:49] ori: help! [10:44:05] wikidata change dispatcher is broken (cron job) [10:44:34] aude: that didn't change the location [10:44:36] can't log to /var/log/wikidata or appears now we should log to /var/log/mediawiki/wikidata (permission denied both cases) [10:44:46] reedy changed it [10:44:58] we tried to change it back but still permission denied now [10:45:01] i'll debug in a moment [10:45:07] https://gerrit.wikimedia.org/r/#/c/83574/15/manifests/misc/maintenance.pp [10:45:13] then treid https://gerrit.wikimedia.org/r/#/c/137284/ [10:45:17] tried* [10:46:02] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Disable job restarter if run_jobs_enabled is false (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/137229 (owner: 10Aaron Schulz) [10:46:03] i also see https://gerrit.wikimedia.org/r/#/c/137165/ [10:47:17] which is in https://gerrit.wikimedia.org/r/#/c/83574/15/modules/mediawiki/manifests/mwlogdir.pp [10:47:34] (03CR) 10Giuseppe Lavagetto: contint: split Zuul server and merger (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/129292 (owner: 10Hashar) [10:48:45] https://git.wikimedia.org/commitdiff/operations%2Fpuppet.git/1da25e73dc39fc1bc675818a949ed001ac66cc99 might be relevant [10:50:09] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 52 data above and 9 below the confidence bounds [10:56:43] (03PS2) 10Ori.livneh: rcstream: listen on ipv6 too; apply lvs::realserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/137289 [10:57:09] ok, looking [10:58:49] (03CR) 10Giuseppe Lavagetto: [C: 032] rcstream: listen on ipv6 too; apply lvs::realserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/137289 (owner: 10Ori.livneh) [10:59:02] looks like things are owned by apache and apache user is doign the cronjob [10:59:06] not quite clear what is wrong [10:59:23] except /var/log/mediawiki/wikidata is owned by root:wikidev [10:59:49] probably should be apache:mwdeploy like the other dirs [11:00:36] with write permissions [11:09:58] aude: the crontab for apache shows things piping to /var/log/wikidata [11:10:00] which has the right perms [11:10:13] huh [11:10:17] we just changed it back [11:10:51] (03PS2) 10Nuria: [WIP] monitoring: monitor eventlogging thresholds [operations/puppet] - 10https://gerrit.wikimedia.org/r/137280 (https://bugzilla.wikimedia.org/65482) [11:11:25] but still get permission denied now [11:11:45] nothing writing to logs etc [11:11:53] oh, needs a puppet run probably [11:12:01] maybe [11:12:17] and if it's the right thing to do to change back the log location [11:12:18] ? [11:12:50] all of that stuff needs so much cleanup [11:12:58] let's just get it working again and worry about aesthetics later [11:14:47] ok [11:16:58] ori: looks like its running now [11:21:13] aude: great! i'm off to sleep [11:21:25] ok [11:21:29] thanks :) [12:12:36] ree [12:13:11] dy [12:29:32] (03CR) 10Hashar: contint: split Zuul server and merger (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/129292 (owner: 10Hashar) [12:29:37] (03PS11) 10Hashar: contint: split Zuul server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/129292 [12:29:55] _joe_: in the mood to help me break Jenkins/Zuul? :D [12:30:05] (03PS1) 10Alexandros Kosiaris: akosiaris dotfiles added [operations/puppet] - 10https://gerrit.wikimedia.org/r/137295 [12:30:08] <_joe_> hashar: not really :) [12:30:22] oh i "just" need a merge and handle all the rest myself [12:30:23] <_joe_> hashar: I do have to finish some work on rcstream [12:30:37] though I might poke after one hour to get a revert change merged hehe [12:30:41] <_joe_> and I do have to leave early [12:32:23] (03PS3) 10Nuria: Monitoring: monitor eventlogging thresholds [operations/puppet] - 10https://gerrit.wikimedia.org/r/137280 (https://bugzilla.wikimedia.org/65482) [12:33:02] _joe_: understood :) [12:33:07] (03CR) 10Filippo Giunchedi: [C: 031] contint: split Zuul server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/129292 (owner: 10Hashar) [12:33:33] _joe_: and thanks for the crontab MAILTO tweak [12:34:07] <_joe_> so, seems like godog is available :P [12:34:40] godog: be brave and merge! though I might poke for a follow up revert or tweak once I applied that in prod hehe [12:34:46] low impact, it is just Zuul / CI [12:35:17] I could just use that change to land in so I can push the puppet change on the server running zuul [12:35:30] hashar: need to step out for lunch now, will take a closer look/merge when I'm back [12:36:21] godog: bon apetit! [12:38:31] hashar: false alarm, looking/merging now :)) [12:38:53] dont starve! [12:41:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] contint: split Zuul server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/129292 (owner: 10Hashar) [12:41:57] haha no, merged btw [12:43:03] godog: thanks! [12:43:24] !log upgrading Zuul to split the merger part to an independent process. Short unscheduled downtime starting in a few minutes [12:43:29] Logged the message, Master [12:52:32] (03CR) 10Hashar: "recheck" [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 (owner: 10Hashar) [12:52:55] * hashar flex [12:53:48] !log Zuul upgraded (git tag wmf-deploy-20140604 ). Merges are now done by an indecent process zuul-merger [12:53:53] Logged the message, Master [12:54:02] (03Abandoned) 10Hashar: Jenkins job validation (DO NOT SUBMIT) [operations/puppet] - 10https://gerrit.wikimedia.org/r/89002 (owner: 10Hashar) [12:54:24] godog: all went fine apparently. Thanks! [12:55:07] yay for rcstream work too :) [12:57:59] <_joe_> YuviPanda: http://stream.wikimedia.org/rc [12:58:23] <_joe_> it is live - ssl is coming when we have a cert [12:59:32] _joe_: hmm, still giving me a 404. also interesting that if I try https, it resolves to an ipv6 address and fails to connect [12:59:33] curl: (7) Failed to connect to 2620:0:861:ed1a::3:15: No route to host [12:59:37] _joe_: but still YAY! :D [13:00:04] The time is nigh to deploy Fundraising (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140604T1300) [13:00:14] <_joe_> YuviPanda: the ipv6 address is the one of rcstream [13:00:27] <_joe_> YuviPanda: you need to connect with websockets :) [13:00:56] _joe_: aaah, right. nevermind me :D [13:01:15] * YuviPanda should setup an Redis pubsub rebroadcaster on tools at some point [13:02:03] <_joe_> YuviPanda: http://stream.wikimedia.org/rc/rcstream_status [13:02:29] wooo! [13:03:59] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 21:59:16 UTC [13:20:29] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 303 seconds [13:20:49] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 315 seconds [13:23:04] <_joe_> mmmh that does not really make sense [13:23:29] <_joe_> the replication lag is growing and I don't see replication statements flowing [13:24:26] _joe_: it's a schema change problem [13:24:50] <_joe_> springle: ok that was my next option [13:25:44] <_joe_> so what did you do to solve that? [13:25:58] <_joe_> it was just a transient lag due to an alter table? [13:27:11] (03PS4) 10Giuseppe Lavagetto: Monitoring: monitor eventlogging thresholds [operations/puppet] - 10https://gerrit.wikimedia.org/r/137280 (https://bugzilla.wikimedia.org/65482) (owner: 10Nuria) [13:27:29] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 94 seconds [13:27:49] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 48 seconds [13:27:59] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [13:28:35] (03PS5) 10Nemo bis: Monitoring: monitor eventlogging thresholds [operations/puppet] - 10https://gerrit.wikimedia.org/r/137280 (https://bugzilla.wikimedia.org/65482) (owner: 10Nuria) [13:30:12] <_joe_> Nemo_bis: grazie! I did not get that typo [13:30:27] :) [13:30:30] (03CR) 10Giuseppe Lavagetto: [C: 032] Monitoring: monitor eventlogging thresholds [operations/puppet] - 10https://gerrit.wikimedia.org/r/137280 (https://bugzilla.wikimedia.org/65482) (owner: 10Nuria) [13:31:02] _joe_: actually, i'm wrong. db1007 crashed [13:31:13] the schema change finished an hour ago [13:31:28] uptime 945s [13:31:44] mysqld uptime [13:31:46] <_joe_> and there was another one [13:32:09] <_joe_> springle: I've seen some bad dmesg things, but no one related to mysql [13:32:32] _joe_: another one what? [13:34:17] <_joe_> springle: I was still speaking with Nemo_bis sorry [13:34:21] <_joe_> :P [13:34:31] <_joe_> I'm on my mobile and I do have some lag [13:35:32] :) [13:38:42] (03PS1) 10Giuseppe Lavagetto: eventlogging: moving monitor graphite class in autoload layout [operations/puppet] - 10https://gerrit.wikimedia.org/r/137303 [13:41:11] (03CR) 10Giuseppe Lavagetto: [C: 032] eventlogging: moving monitor graphite class in autoload layout [operations/puppet] - 10https://gerrit.wikimedia.org/r/137303 (owner: 10Giuseppe Lavagetto) [13:47:59] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Wed 04 Jun 2014 10:47:34 UTC [13:49:12] (03Abandoned) 10coren: Tools: Alias tools.wmflabs.org to internal webproxy [operations/puppet] - 10https://gerrit.wikimedia.org/r/123149 (https://bugzilla.wikimedia.org/54052) (owner: 10Tim Landscheidt) [14:07:49] coren: labstore1001: LACP is not working for the link aggregation. The issue is definitely in the switch config [14:08:48] cmjohnson1: Yeah, but I don't know *what* is broken in the link config. [14:09:29] the ports are not partnering up [14:10:07] http://pastebin.com/tKL5es8v [14:21:20] ottomata: can you unlock archiva-deploy? [14:22:16] ja sorry, in meeting was about to do that, will do that soon [14:33:16] I got some high latency on fenari /home/ which is the pmtpa NFS nas [14:33:40] nas1-a.pmtpa.wmnet [14:38:02] (03CR) 10Tim Landscheidt: "No, it doesn't looking at gi11es: ping, 10 minutes until SWAT [14:50:44] manybubbles: Which of us wants to do SWAT today? [14:51:00] anomie: I'm not doing anything important right now so I'll do it [14:51:05] manybubbles: ok [14:51:06] I've been emailing all morning [14:51:31] I've been code-reviewing all morning [14:51:42] anomie: that is more productive, probably [14:51:49] anomie: manybubbles I'm going to be adding another couple of MobileApp patches as well. super-minor, just LESS changes for the app (fetched only by the app, which is going to go into beta later today) [14:51:52] manybubbles: unlocked [14:51:52] s/app/android app/ [14:52:19] oo, btw manybubbles, there is a newer version of Archiva out that supposedly works better with ldap [14:52:30] gi11es: I see some TODO disable by in that commit [14:52:32] would be really nice to be able to log in with our ldap creds and just set people's roles properly [14:52:34] ottomata: you wanna upgrade? I have no problem with it [14:52:36] one day... [14:52:46] YuviPanda: sure - just add it to the list and I'll have a look at it [14:54:18] manybubbles: hmm, looks like nobody in the mobile apps team has +2 perms on the wmfNN branches. think you can +2 them so I can submit submodule bump patches? [14:55:21] anomie: ^ [14:55:39] (03CR) 10Manybubbles: "These arrays look copy and paste-y. I'm happy to deploy them if we're sure they're right and its a temporary explosion. We should come u" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137276 (owner: 10Gilles) [14:56:04] YuviPanda: you want me to +2 the submodule update? [14:56:07] that's easy [14:56:25] YuviPanda: I don't see the patches I'm supposed to +2 anywhere? [14:57:09] anomie: manybubbles https://gerrit.wikimedia.org/r/#/q/project:mediawiki/extensions/MobileApp+status:open,n,z [14:57:33] !log cleaning up duplicate cronjobs on terbium - all log to /var/log/mediawiki now [14:57:38] Logged the message, Master [14:57:58] anomie: manybubbles no, these are the cherrypicks to 1.24wmf6 and 1.24wmf7 on the extension's repository. these are commits that are already merged on master, but not on the wmf branches since none of the 3 android engineers have +2 rights on wmf branches. [14:58:45] manybubbles: yes, they weren't updated, we decided to keep the older surveys running and remove them all at once [14:58:49] anomie: pong [14:59:37] (03CR) 10Dzahn: [C: 031] rename role::mediawiki::job_runner -> role::mediawiki::jobrunner [operations/puppet] - 10https://gerrit.wikimedia.org/r/137193 (owner: 10Ori.livneh) [15:00:04] manybubbles, anomie, gi11es: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140604T1500) [15:00:18] (03CR) 10Gilles: "They aren't copy pasted, the values are tailored for each metric on each wiki. I actually spent a couple of hours putting that together by" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137276 (owner: 10Gilles) [15:00:45] manybubbles: thanks. I'll submit the submodule bump patches and put them in the calendar right after the cherry-picks get merged. apologies for the last minute job [15:01:08] YuviPanda: does https://gerrit.wikimedia.org/r/#/c/137320/1/less/enwiki.less,cm need to be loaded only for enwiki? I'm not super conversant on android but it looks htmlish enough for me to have some opinions [15:01:23] (03CR) 10Gilles: "There is no way to make it shorter, as a blanket rate for all metrics, even for a given wiki, would mean that some metrics wouldn't have e" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137276 (owner: 10Gilles) [15:01:57] manybubbles: no, it's loaded on all wikis. It's named enwiki.less since that's where the files come from. Temporarily in place sinde the app does not load Mediawiki:Mobile.css (similar to Common.css) for each target wiki yet, and since enwiki's has a lot of styles that are re-used everywhere (.hlist for one), we decided to just bundle that in for now [15:02:28] YuviPanda: k [15:03:38] (03CR) 10Dzahn: [C: 032] beta: NFS no longer used for mediawiki deployment [operations/puppet] - 10https://gerrit.wikimedia.org/r/137232 (owner: 10BryanDavis) [15:03:41] YuviPanda: one last check - do I have to do anything special to get the less files deployed or will they deploy like normal, get rendered to css by mediawiki on the fly, and get cached? [15:03:55] manybubbles: nothing special, no. just normal deploys. [15:03:56] +2 on NFS no longer used! [15:04:48] cool. I'll wait on the submodule updates. in the mean time, gi11es, I'll do your changes. any objection to me doing them in one push? [15:04:57] manybubbles: sounds good to me [15:05:00] (03CR) 10Manybubbles: [C: 032] Reduce MediaViewer EventLogging sampling factor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137276 (owner: 10Gilles) [15:05:11] (03Merged) 10jenkins-bot: Reduce MediaViewer EventLogging sampling factor [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137276 (owner: 10Gilles) [15:05:13] (03CR) 10Manybubbles: [C: 032] Enable MediaViewer survey on enwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137237 (owner: 10Gergő Tisza) [15:05:19] (03Merged) 10jenkins-bot: Enable MediaViewer survey on enwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137237 (owner: 10Gergő Tisza) [15:05:50] (03CR) 10Dzahn: [C: 032] rename role::mediawiki::job_runner -> role::mediawiki::jobrunner [operations/puppet] - 10https://gerrit.wikimedia.org/r/137193 (owner: 10Ori.livneh) [15:07:56] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT deploy for media viewer (duration: 00m 13s) [15:08:00] Logged the message, Master [15:08:09] gi11es: synced - please verify everything looks as you expect [15:08:24] manybubbles: added to calendar, https://gerrit.wikimedia.org/r/#/c/137330/ and https://gerrit.wikimedia.org/r/#/c/137331/ [15:08:33] YuviPanda: thanks [15:09:14] manybubbles: is there a delay usually? in yesterday's SWAT I didn't see the change immediately, same thing here [15:09:18] YuviPanda: would you mind adding what patches are contained in the bump? [15:09:28] manybubbles: they are in the commit message [15:09:35] gi11es: not really [15:09:48] YuviPanda: kk. I usually try to put them on the page [15:10:17] manybubbles: ok, let me add them. [15:10:21] gi11es: normally it all gets synced immediately. a delay would mean I made a mistake. or something is broken [15:10:39] manybubbles: ok, I'll look into it further [15:10:52] gi11es: you said the other one had a delay? yesterday? [15:11:04] yes, another config change in the same area [15:11:18] gi11es: normally changes to the file a pretty immediate [15:11:45] gi11es: bd808|BUFFER is who i'd ping if everything looks like it worked but there was some kind of delay with the changes taking effect [15:12:15] there is some deep magic that I haven't dug into around how initializesettings is executed [15:12:15] ah, it just appeared [15:12:17] (03CR) 10Dzahn: [C: 04-2] "the module doesn't exist anymore under this name. Ori consolidated applicationserver and mediawiki modules in 98f3808af6dfc38c" [operations/puppet] - 10https://gerrit.wikimedia.org/r/122269 (owner: 10Matanya) [15:12:25] Krinkle, ori: can't say I'm thrilled to review frontend javascript code inside ops/puppet; can we move that piece of code somewhere else? [15:12:29] so yes, same thing as yesterday, there's just a delay between when you ask me to check and when it appears [15:12:40] gi11es: well, weird. maybe bd808|BUFFER will know more. [15:12:53] survey works on enwiki, now I'll check EventLogging [15:13:01] paravoid: It's backend javascript actually (it's the code that operates a phantomjs instance using nodejs) [15:13:56] (03Abandoned) 10Yuvipanda: mongo: Support newer yaml style configuration [operations/puppet] - 10https://gerrit.wikimedia.org/r/135499 (owner: 10Yuvipanda) [15:14:24] by frontend I meant that it's "browser" javascript [15:14:45] but in any case, it doesn't matter [15:14:54] manybubbles: I'm seeing the first EventLogging hits with the new values, lgtm [15:14:55] it's not like we have the expertise to review this, or particularly care about it [15:15:05] (03Restored) 10Odder: Close wikimania2013 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104726 (https://bugzilla.wikimedia.org/59157) (owner: 10Odder) [15:15:13] we could say "yeah whatever" and merge everything you submit, but that's kinda wrong in principle [15:15:25] (03PS2) 10Odder: Close wikimania2013 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104726 (https://bugzilla.wikimedia.org/59157) [15:15:31] !log manybubbles Synchronized php-1.24wmf7/extensions/MobileApp/: (no message) (duration: 00m 08s) [15:15:33] gi11es: cool! And the survery is there too? [15:15:36] Logged the message, Master [15:15:44] YuviPanda: that was wmf7 for your change - look good? [15:15:45] manybubbles: yes, survey works on enwiki [15:15:52] manybubbles: verifying now [15:15:54] cool! [15:16:01] gi11es: consider yourself SWATed [15:16:04] (03Abandoned) 10Dzahn: [WIP] Adding research posix group [operations/puppet] - 10https://gerrit.wikimedia.org/r/122401 (owner: 10Ottomata) [15:16:05] manybubbles: thanks a lot! [15:16:22] np, all part of the job [15:16:37] manybubbles: yes, testwiki looks good [15:16:55] greg-g: https://gerrit.wikimedia.org/r/#/c/104726/ can be SWATed, I think [15:17:08] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Wed Jun 4 15:17:02 UTC 2014 [15:19:08] twkozlowski: looks like there is consensus on that [15:19:32] * twkozlowski nods [15:19:56] I wonder if it'll let me merge it over reedy' ancient objection [15:20:59] twkozlowski: yep [15:21:44] !log manybubbles Synchronized php-1.24wmf6/extensions/MobileApp/: (no message) (duration: 00m 10s) [15:21:48] Logged the message, Master [15:21:48] YuviPanda: wmf6 is done for you [15:22:14] manybubbles: all good! :D thanks a lot! [15:22:24] * YuviPanda considers himself swatted [15:22:34] (03CR) 10Manybubbles: [C: 032] Close wikimania2013 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104726 (https://bugzilla.wikimedia.org/59157) (owner: 10Odder) [15:22:40] (03Merged) 10jenkins-bot: Close wikimania2013 wiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104726 (https://bugzilla.wikimedia.org/59157) (owner: 10Odder) [15:23:52] twkozlowski: syncing [15:23:53] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: close wikimania2013wiki (duration: 00m 10s) [15:23:57] done [15:23:58] Logged the message, Master [15:24:00] please verify [15:24:35] https://wikimania2013.wikimedia.org/w/index.php?title=Main_page&diff=27221&oldid=27136 [15:25:42] twkozlowski: so you are saying it isn't closed yet [15:26:08] Yes, I can still edit it. [15:26:21] Changes to InitializeSettings.php seem to be taking a few minutes to take effect this morning [15:26:24] which is odd [15:27:22] (03PS1) 10Dzahn: add simple .vimrc to my home dir [operations/puppet] - 10https://gerrit.wikimedia.org/r/137337 [15:28:24] !log manybubbles Synchronized closed.dblist: close wikimania2013wiki (duration: 00m 09s) [15:28:29] Logged the message, Master [15:28:47] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: close wikimania2013wiki for real (duration: 00m 10s) [15:28:52] Logged the message, Master [15:29:19] \o/ [15:29:19] twkozlowski: I made a mistake. I believe that should have fixed it [15:29:24] all closed? [15:29:25] It did, thanks! [15:29:33] wonderful. SWAT over for the day! [15:31:20] (03CR) 10Alexandros Kosiaris: [C: 032] akosiaris dotfiles added [operations/puppet] - 10https://gerrit.wikimedia.org/r/137295 (owner: 10Alexandros Kosiaris) [15:32:03] thanks manybubbles ! :) [15:32:26] greg-g: was an easy one today [15:32:34] glad to hear it [15:32:37] I was mostly ignoring [15:32:48] (03CR) 10Ottomata: [C: 031] "Qchris, is this ready to go? I just saw it sitting in my gerrit queue and remembered about it." [operations/puppet] - 10https://gerrit.wikimedia.org/r/133089 (https://bugzilla.wikimedia.org/64276) (owner: 10QChris) [15:34:03] (03CR) 10Dzahn: [C: 032] add simple .vimrc to my home dir [operations/puppet] - 10https://gerrit.wikimedia.org/r/137337 (owner: 10Dzahn) [15:36:27] (03CR) 10Faidon Liambotis: [C: 032] "This looks good to me, I'll leave you to deploy/babysit :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136317 (owner: 10Filippo Giunchedi) [15:38:36] (03CR) 10Faidon Liambotis: [C: 032] Revert "removing mw1151 from dsh groups to replace hard drive and reinstall" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136238 (owner: 10Cmjohnson) [15:38:45] (03PS2) 10Faidon Liambotis: Revert "removing mw1151 from dsh groups to replace hard drive and reinstall" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136238 (owner: 10Cmjohnson) [15:38:51] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Revert "removing mw1151 from dsh groups to replace hard drive and reinstall" [operations/puppet] - 10https://gerrit.wikimedia.org/r/136238 (owner: 10Cmjohnson) [15:44:19] (03PS1) 10Alexandros Kosiaris: Purge vimrc.local [operations/puppet] - 10https://gerrit.wikimedia.org/r/137340 [15:50:08] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [15:53:41] (03CR) 10Dzahn: [C: 031] service accounts get 'systemuser' group by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/137204 (owner: 10Rush) [15:54:20] (03PS4) 10Rush: service accounts get 'systemuser' group by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/137204 [15:54:33] (03CR) 10Rush: [C: 032 V: 032] service accounts get 'systemuser' group by default [operations/puppet] - 10https://gerrit.wikimedia.org/r/137204 (owner: 10Rush) [15:55:08] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [15:57:26] (03PS6) 10Christopher Johnson (WMDE): This command is in Perl and has several external Perl module dependencies. Most are available as debian packages, but JSON::Path is not. [operations/puppet] - 10https://gerrit.wikimedia.org/r/136095 [15:58:52] (03PS1) 10RobH: dns setup for server lead for mail use [operations/dns] - 10https://gerrit.wikimedia.org/r/137345 [15:59:29] !log killing puppet certs,salt keys for solr100[13].eqiad - decom [15:59:33] argh [15:59:34] Logged the message, Master [15:59:42] i hate how our dns templates are now a horrible mix of spaces and tabs intermingled [15:59:56] * robh realizes the answer is to just fix it, but isn't prepared to do that just yet ;) [16:00:08] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [16:00:43] (03CR) 10Alexandros Kosiaris: [C: 032] Adding Ferm rule for OCG HTTP [operations/puppet] - 10https://gerrit.wikimedia.org/r/137247 (owner: 10Mwalker) [16:01:24] (03CR) 10RobH: [C: 032] dns setup for server lead for mail use [operations/dns] - 10https://gerrit.wikimedia.org/r/137345 (owner: 10RobH) [16:01:49] akosiaris: ..just while i was reading it :) [16:02:06] it got moved to 8000 the other day [16:02:26] yes I noticed, thanks to you :-) [16:04:38] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 21:59:16 UTC [16:05:08] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [16:05:27] re: lvs3002 - puppet disabled (human or bug) [16:10:08] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [16:11:05] !log installing package upgrades on iron [16:11:10] Logged the message, Master [16:12:41] (03PS7) 10Christopher Johnson (WMDE): Icinga: new command "check_dispatch" for Wikidata [operations/puppet] - 10https://gerrit.wikimedia.org/r/136095 [16:13:03] !log installing package upgrades on bast1001 [16:13:08] Logged the message, Master [16:15:08] PROBLEM - check_mysql on lutetium is CRITICAL: Slave IO: Yes Slave SQL: No Seconds Behind Master: (null) [16:16:18] PROBLEM - DPKG on bast1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:17:18] RECOVERY - DPKG on bast1001 is OK: All packages OK [16:20:08] RECOVERY - check_mysql on lutetium is OK: Uptime: 4951935 Threads: 2 Questions: 29128309 Slow queries: 9648 Opens: 16486 Flush tables: 2 Open tables: 64 Queries per second avg: 5.882 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [16:20:59] chasemp: neon doesnt like the systemusers change [16:21:12] yeah saw it on aluminum checking it out now [16:21:16] cool [16:21:24] I think anything w/ two it's failing on dupe for that group [16:21:34] trying to figure out who we should handle it [16:21:47] * mutante nods [16:22:38] interestingly the dupe is supposed to be in the same place.. [16:23:10] i mean .. systemuser.pp at line 28; .. cannot redefine at systemuser.pp:28 [16:25:39] (03Abandoned) 10Tim Landscheidt: Revert "toollabs: Remove unused and empty webproxy role" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137243 (owner: 10Tim Landscheidt) [16:28:38] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [16:29:40] Hm.. I'm working on the incident report for the may 16/17 bits issue [16:29:56] I'm trying to figure out why this function returns false if memcached is unable to set/get any value [16:29:56] https://github.com/wikimedia/mediawiki-extensions-Gadgets/blob/c8cfeae0da860d3d7e1192121d1697c38b14c95f/Gadgets_body.php#L351-L414 [16:30:08] (03PS1) 10Rush: being systemuser group into class [operations/puppet] - 10https://gerrit.wikimedia.org/r/137348 [16:30:19] it looks like it will "just" be computational and calculate it each time [16:30:21] it shouldn't return false [16:30:31] (03PS8) 10BryanDavis: Labs: Fix beta to work with role::mediawiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/134519 [16:31:38] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Wed 04 Jun 2014 13:30:33 UTC [16:32:57] (03PS2) 10Rush: bring systemuser group into class systemuser::groups [operations/puppet] - 10https://gerrit.wikimedia.org/r/137348 [16:33:34] (03CR) 10Rush: [C: 032 V: 032] "needed to fix puppet" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137348 (owner: 10Rush) [16:36:38] !log blog.wikimedia.org updated to latest wp version [16:36:43] Logged the message, Master [16:38:42] (03CR) 10Jgreen: [C: 031] Allowing a configurable StatsD server [operations/puppet] - 10https://gerrit.wikimedia.org/r/137278 (owner: 10Mwalker) [16:41:51] (03CR) 10BryanDavis: "Removed l10nupdate_gid now that gid in labs is 10002" [operations/puppet] - 10https://gerrit.wikimedia.org/r/134519 (owner: 10BryanDavis) [16:53:59] (03PS3) 10Aaron Schulz: Disable job restarter if run_jobs_enabled is false [operations/puppet] - 10https://gerrit.wikimedia.org/r/137229 [16:58:46] bd808: MaxSem btw, I'm going to go camp out next to our Deploy Training room as soon as I can (aka: I'm going to try to get out of the SoS quickly). [16:59:35] greg-g: coolio [17:00:04] yurik: The time is nigh to deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140604T1700) [17:13:34] (03CR) 10Chad: [C: 031] "Can we get this live? This has been busted for the last ~24 hours." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137229 (owner: 10Aaron Schulz) [17:13:59] (03PS1) 10John F. Lewis: Add Portal to scowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137353 (https://bugzilla.wikimedia.org/66107) [17:15:40] ^d: heh, redis-jobqueue.log will probably crash vim if you open it [17:16:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] [17:16:38] (03CR) 10Chad: "Another hour, another spike." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137229 (owner: 10Aaron Schulz) [17:17:28] <^d> paravoid: Can we get a merge on this? ^ [17:18:28] ^d: I'm about to run into a meeting, then off for the day [17:18:33] so, I can merge, but I can't babysit it [17:18:40] chasemp maybe? [17:18:42] or ottomata? [17:18:55] <^d> Anyone who's willing and able :) [17:19:04] this https://gerrit.wikimedia.org/r/#/c/137229/ [17:19:05] ? [17:19:08] yeah [17:19:10] <^d> Yep. [17:19:30] looks correct to me from a quick glance but if it's not, it'll be high impact [17:19:44] I don't really understand the change [17:19:57] as I don't know what job resarter is :) [17:20:09] <^d> run_jobs_enabled => true is true on all job runners, except osmium (test host for hhvm) [17:20:25] <^d> osmium is still running this cron so we want it disabled as it's pulling jobs and failing. [17:20:26] so this should stop it on the one hsot [17:20:29] <^d> Yep. [17:20:30] got it [17:20:32] cool w/ that [17:20:42] (03PS4) 10Rush: Disable job restarter if run_jobs_enabled is false [operations/puppet] - 10https://gerrit.wikimedia.org/r/137229 (owner: 10Aaron Schulz) [17:21:01] (03CR) 10Rush: [C: 032] Disable job restarter if run_jobs_enabled is false [operations/puppet] - 10https://gerrit.wikimedia.org/r/137229 (owner: 10Aaron Schulz) [17:21:03] jouncebot: next [17:21:04] In 2 hour(s) and 38 minute(s): Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140604T2000) [17:21:13] jouncebot: refresh [17:21:14] I refreshed my knowledge about deployments. [17:21:21] jouncebot: next [17:21:21] In 0 hour(s) and 38 minute(s): Deployment training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140604T1800) [17:21:26] (03CR) 10Rush: [V: 032] Disable job restarter if run_jobs_enabled is false [operations/puppet] - 10https://gerrit.wikimedia.org/r/137229 (owner: 10Aaron Schulz) [17:21:41] bd808: good job/thinking there [17:22:01] jouncebot: pick deployer [17:22:07] jouncebot: random [17:22:47] looks ok on osmium notice: /Stage[main]/Mediawiki::Jobrunner/Cron[mw-job-restarter]/ensure: removed [17:23:17] mw1001-1016 are jobrunners [17:24:56] <^d> We don't need a force run there, should be zero change on mw1001-16. [17:24:58] looks ok on mw1001 [17:25:03] <^d> Good. [17:25:04] <^d> :) [17:25:22] ^d it seems you did not explode everything, well played [17:25:34] <^d> Yay no explosions! [17:25:37] <^d> Thanks for the merge. [17:25:55] (03CR) 10Dzahn: [C: 031] Purge vimrc.local [operations/puppet] - 10https://gerrit.wikimedia.org/r/137340 (owner: 10Alexandros Kosiaris) [17:26:26] ^d: now set a timer to see what happens next time ;) [17:27:56] !log stopping puppet/salt on solr100[13], removed from icinga [17:28:01] Logged the message, Master [17:28:54] will the job_runner change eventually fix "Number of mediawiki jobs queued" anomaly et al? [17:29:21] <^d> Yes. [17:29:24] :) [17:29:30] <^d> Well, not queued. Popped. [17:29:46] CRITICAL: Anomaly detected: 34 data above and 0 below the confidence bounds [17:29:57] 34 data = 34 jobs? [17:30:00] <^d> Oh, maybe not that. [17:30:15] ah,ok [17:30:20] hey i'm around [17:31:31] (03PS1) 10John F. Lewis: Add otherProjects for kowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137356 (https://bugzilla.wikimedia.org/66128) [17:33:43] ottomata: hey, i took the liberty to abandon your change about the research group, just because we did it in yaml meanwhile.. ok? [17:34:20] yup, danke [17:34:38] <^d> mutante: http://gdash.wikimedia.org/dashboards/jobq/ is what we're fixing here. See last ~24h with the crazy job pop rate. [17:34:42] (03CR) 10Revi: [C: 031] Add otherProjects for kowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137356 (https://bugzilla.wikimedia.org/66128) (owner: 10John F. Lewis) [17:34:54] <^d> *rate spikes [17:35:59] ^d: wow ... the 1 week graph [17:39:49] akosiaris: "Now that anyone can have his own personal account", what do you mean by that? [17:42:24] Krinkle: ssh akosiaris@whatever.cluster.wmnet|wikimedia.org [17:42:51] s/cluster/site but anyway [17:42:54] deploying zero ext... [17:44:20] !log yurik Synchronized php-1.24wmf6/extensions/ZeroRatedMobileAccess/: (no message) (duration: 01m 06s) [17:44:25] Logged the message, Master [17:44:54] mw1151 had perm error [17:45:31] akosiaris: Right, that's what I thought. But is that new? [17:46:34] PROBLEM - Puppet freshness on tmh1001 is CRITICAL: Last successful Puppet run was Wed 04 Jun 2014 14:46:03 UTC [17:46:46] Krinkle: on a variety of boxes yes [17:46:50] cool [17:47:15] akosiaris: btw, is there a recommended way of linking dotfiles within the cluster? [17:47:20] e.g. PS1 and various aliases [17:47:33] !log yurik Synchronized php-1.24wmf7/extensions/ZeroRatedMobileAccess/: (no message) (duration: 01m 07s) [17:47:38] Logged the message, Master [17:48:03] Krinkle: https://gerrit.wikimedia.org/r/#/c/137295/ [17:48:23] I frequently work on about 5-6 different boxes (mostly bastion, tin, gallium, and lanthanum). Woudl be nice to keep them in sync somehow [17:48:33] Hm.. I see [17:48:41] don't abuse though [17:49:28] I have a very modular dotfiles directory. Powered by git to push/pull changes [17:49:36] https://github.com/Krinkle/dotfiles [17:50:06] when the hostname is my own box, it even asks for root and goes into puppet-like provision (so that I can document and reprovision my local dev env when I loose my laptop) [17:50:13] anyhow.. [17:50:44] That doesn't work anywhere other than tin due to firewall (which makes sense) [17:50:55] and even there I don't use it for safety reasons [17:51:04] (manually scp'ed from local) [17:51:23] Alrightly, thx. I might compile it into a smaller set of files as minimal version and push that into puppet. [17:51:23] Thx [17:52:09] you are welcome [17:53:27] greg-g: bd808 yurik: re mw1151 /hw repair workflow .. https://gerrit.wikimedia.org/r/#/c/136238/ [17:53:39] akosiaris: when I don't have to worry abotu wmf-production this is pretty sweet. I use it in labs everywhere, toolserver, and random other boxes on the net. Just dotfiles-pull and it does a "safe" git fetch, shows a diff, asks for confirmation, and applies locally. [17:54:58] !log shutting down solr1001-1003 [17:55:03] Logged the message, Master [17:59:04] search people, you can have that hardware ^^ [17:59:10] quote "The search team might want to use them for Elasticsearch, in which case they will shortly submit another ticket for repartitioning with a dependency on this one." [17:59:24] ah, i see the ticket [17:59:35] (03CR) 10Dereckson: [C: 031] Add Portal to scowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137353 (https://bugzilla.wikimedia.org/66107) (owner: 10John F. Lewis) [18:00:05] MaxSem, bd808: The time is nigh to deploy Deployment training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140604T1800) [18:00:05] Krinkle: I like the quotes around the safe word :-). Honestly I think it is pretty cool. I stopped messing too much with my environment at some point. Probably around all the KDE3=>KDE4 (aaah falling back to gnome 2) situation. I probably should have persisted a bit more on the shell customizations. [18:01:48] PROBLEM - Host solr1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:01:48] PROBLEM - Host solr1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:03:20] wth. not authorized anymore for icinga commands but i'm logged in [18:03:24] any changes there? [18:03:53] (03PS1) 10Ori.livneh: migrate ::imagescaler -> ::mediawiki::multimedia [operations/puppet] - 10https://gerrit.wikimedia.org/r/137363 [18:03:54] maybe caps ? [18:03:55] (03Restored) 10Ottomata: Rsyncing slow-parse logs from fluorine to dumps.wikimedia.org. [operations/puppet] - 10https://gerrit.wikimedia.org/r/49678 (owner: 10Ottomata) [18:03:57] mutante: ^ [18:04:20] (03CR) 10John F. Lewis: [C: 031] "Seems good." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136284 (https://bugzilla.wikimedia.org/65938) (owner: 10Reza) [18:04:25] like DZahn ? or something like that [18:04:39] ACKNOWLEDGEMENT - Host solr1003 is DOWN: PING CRITICAL - Packet loss = 100% alexandros kosiaris dzahn said so [18:04:57] mutante: is that what you wanted ? [18:05:05] (03CR) 10jenkins-bot: [V: 04-1] migrate ::imagescaler -> ::mediawiki::multimedia [operations/puppet] - 10https://gerrit.wikimedia.org/r/137363 (owner: 10Ori.livneh) [18:05:18] i think that is what he watned! [18:05:22] solr hosts down [18:05:54] ACKNOWLEDGEMENT - Host solr1002 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn repurposed [18:06:03] (03PS2) 10Ori.livneh: migrate ::imagescaler -> ::mediawiki::multimedia [operations/puppet] - 10https://gerrit.wikimedia.org/r/137363 [18:06:13] akosiaris: yes, caps , thanks [18:09:47] ottomata: did you want to reinstall those? ( because i saw you comment on ticket) [18:10:11] <^d> I saw the comment earlier about them needing approval first? [18:10:39] yup, can do mutante [18:10:42] not going to get to it today [18:10:46] but its on my todo list anyway [18:11:02] they are already existing hardware, not sure about approval part honestly [18:11:12] but yea, paravoid moved them to procurement queue [18:11:16] ottomata: cool [18:12:33] (03CR) 10Ori.livneh: [C: 04-2] "Please see to it that bug 65591 is fixed rather than add a kludge to the mediawiki module. I will -2 any realm-branching in modules." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/134519 (owner: 10BryanDavis) [18:13:30] mutante: yeah, i guess, who approves that? [18:14:30] ottomata: i think Mark [18:14:59] well, if it's buying new stuff [18:15:19] let's ask paravoid ? [18:18:19] mark [18:22:18] fyi, do not reuse hostnames in labs if you can help it [18:22:26] things get wackyyyy [18:28:34] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Last successful Puppet run was Wed 04 Jun 2014 15:28:19 UTC [18:33:34] PROBLEM - Puppet freshness on stat1003 is CRITICAL: Last successful Puppet run was Wed 04 Jun 2014 15:33:07 UTC [18:35:00] (03CR) 10MaxSem: [C: 032] Removed obsolete $wmgZeroDisableImages and $wgZeroDisableImages [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136487 (owner: 10Yurik) [18:35:01] we're about to deploy some things along with the deploy training [18:35:04] that ^ [18:35:08] (03Merged) 10jenkins-bot: Removed obsolete $wmgZeroDisableImages and $wgZeroDisableImages [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136487 (owner: 10Yurik) [18:40:32] !log maxsem Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/136487/ (duration: 00m 04s) [18:40:44] Logged the message, Master [18:40:46] chasemp: err: Could not retrieve catalog from remote server: Error 400 on SERVER: trebuchet systemuser must specify an array for groups at /etc/puppet/modules/generic/manifests/systemuser.pp:26 on node tin.eqiad.wmnet [18:41:08] I assumed that is you [18:42:11] akosiaris: yes thanks! on it [18:43:01] !log mw1151 gave an ssh denied error for MaxSem during sync-dir [18:43:05] Logged the message, Master [18:43:37] ssh mw1151 -- Permission denied (publickey). for me as well [18:44:01] see also: https://bugzilla.wikimedia.org/show_bug.cgi?id=66050 [18:44:18] see also: https://bugzilla.wikimedia.org/show_bug.cgi?id=65424 [18:45:20] see also: https://gerrit.wikimedia.org/r/#/c/136238/ (put mw1151 back into dsh) [18:53:25] (03PS1) 10Rush: fix trebuchet groups to be an array [operations/puppet] - 10https://gerrit.wikimedia.org/r/137374 [18:53:49] (03PS2) 10Rush: fix trebuchet groups to be an array [operations/puppet] - 10https://gerrit.wikimedia.org/r/137374 [18:53:53] (03CR) 10Rush: [C: 032 V: 032] fix trebuchet groups to be an array [operations/puppet] - 10https://gerrit.wikimedia.org/r/137374 (owner: 10Rush) [18:54:13] here we go again, another deploy [18:58:55] !log maxsem Synchronized php-1.24wmf6/extensions/GeoData/: (no message) (duration: 00m 03s) [18:59:00] Logged the message, Master [19:00:04] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [19:00:15] bd808|deploy: ^^ [19:00:37] (03PS1) 10Manybubbles: Update plugins for Elasticsearch 1.2.1 [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/137376 [19:00:58] !log maxsem Synchronized php-1.24wmf7/extensions/GeoData/: (no message) (duration: 00m 04s) [19:01:02] Logged the message, Master [19:01:05] (03CR) 10Manybubbles: [C: 04-1] "Deployment must be synchronized with Elasticsearch 1.2.1. Will deploy to beta now." [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/137376 (owner: 10Manybubbles) [19:01:16] blerg. The elasticsearch nodes for logstash are gc thrashing [19:04:39] (03CR) 10Lydia Pintscher: [C: 031] "Good to go from Wikidata product management perspective" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137356 (https://bugzilla.wikimedia.org/66128) (owner: 10John F. Lewis) [19:04:40] !log Restarted elasticsearch on logstash1001; JVM OOM [19:04:44] Logged the message, Master [19:05:34] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 21:59:16 UTC [19:11:48] (03PS1) 10Rush: webperf use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137377 [19:11:50] (03PS1) 10Rush: txstatsd use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137378 [19:11:52] (03PS1) 10Rush: tcpircbot use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137379 [19:11:54] (03PS1) 10Rush: spamassassin use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137380 [19:11:56] (03PS1) 10Rush: fundraiding user add to systemusers [operations/puppet] - 10https://gerrit.wikimedia.org/r/137381 [19:11:58] (03PS1) 10Rush: rcstream use generic::systemusers [operations/puppet] - 10https://gerrit.wikimedia.org/r/137382 [19:12:00] (03PS1) 10Rush: puppetmaster use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137383 [19:12:02] (03PS1) 10Rush: ocg use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137384 [19:12:04] (03PS1) 10Rush: mysql_wmf use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137385 [19:12:06] (03PS1) 10Rush: mwprof use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137386 [19:12:08] (03PS1) 10Rush: mediawiki use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137387 [19:12:10] (03PS1) 10Rush: limn use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137388 [19:12:12] (03PS1) 10Rush: librenms use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137389 [19:12:14] (03PS1) 10Rush: jenkins user add to systemusers [operations/puppet] - 10https://gerrit.wikimedia.org/r/137390 [19:12:16] (03PS1) 10Rush: ipythong use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137391 [19:12:18] (03PS1) 10Rush: gitblit use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137392 [19:12:20] (03PS1) 10Rush: eventlogging use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137393 [19:12:22] (03PS1) 10Rush: kiwix use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137394 [19:12:24] (03PS1) 10Rush: authdns use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137395 [19:12:26] (03PS1) 10Rush: nova use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137396 [19:12:28] (03PS1) 10Rush: icinga use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137397 [19:12:45] new type of spam :D [19:13:08] stay tuned for the 7 oclock show [19:18:22] is something up with bits? [19:18:31] https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=true&lang=en&modules=mobile.app.pagestyles.android&only=styles&skin=vector is getting me different results now and then [19:18:33] not reedy spam [19:18:43] chasemp: why systemuser and not group / user ? [19:19:12] sometimes is returning me a version that should've been cleared out weeks ago [19:19:14] two reasons one ideological and one practical [19:19:17] kinda. /me is confused [19:19:50] ideology is we should manage system / service users through the intended defined type so we can make changes and / or deal with them en mass the there are no special snowflakes argument I guess [19:20:13] practical is teh new user / group cleanup logic assumes all users are in a supplementary group that implies use [19:20:21] even our defined system users [19:20:34] PROBLEM - Puppet freshness on stat1002 is CRITICAL: Last successful Puppet run was Wed 04 Jun 2014 16:19:33 UTC [19:20:39] this prevents them from getting messed with and it's done in one place [19:21:24] ^stat1002 I'm looking into [19:21:28] chasemp: hm, kk [19:21:44] (03PS2) 10Ori.livneh: ipython use generic::systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/137391 (owner: 10Rush) [19:21:49] (fixed typo) [19:21:52] thanks [19:23:55] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 36: active_shards: 103: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [19:24:23] \o/ Stupid elasticsearch host is less sad [19:24:33] bd808|deploy: what happened to it? [19:24:51] manybubbles: heap thrash up to oom [19:25:04] ah - well, at least the check isn't full of shit this time [19:25:08] its a real problem [19:25:12] Not sure why yet. I haven't looked at the logs deeply [19:25:34] PROBLEM - Puppet freshness on stat1001 is CRITICAL: Last successful Puppet run was Wed 04 Jun 2014 16:24:26 UTC [19:25:44] I think that many people were asking logstash questions and it filled the caches [19:26:36] bd808|deploy: the new version of Elasticsearch (1.2.1) should help with that somewhat - they are adding things like circuit breakers for queries that take up too much ram [19:27:49] Nice. I have a "purge the cache" script that can help if I get to it in time but I waited too long today :( [19:29:34] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [19:29:53] (03PS1) 10Rush: manifests/misc/statistics.pp specify group as array [operations/puppet] - 10https://gerrit.wikimedia.org/r/137400 [19:30:13] (03CR) 10Rush: [C: 032 V: 032] manifests/misc/statistics.pp specify group as array [operations/puppet] - 10https://gerrit.wikimedia.org/r/137400 (owner: 10Rush) [19:30:24] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Wed Jun 4 19:30:22 UTC 2014 [19:31:44] RECOVERY - Puppet freshness on stat1002 is OK: puppet ran at Wed Jun 4 19:31:43 UTC 2014 [19:32:17] (03PS1) 10Manybubbles: Dynamic scripting for Elasticsearch 1.2.X [operations/puppet] - 10https://gerrit.wikimedia.org/r/137404 [19:32:50] (03CR) 10Manybubbles: "When we upgrade to 1.3.X we'll need to deploy the mvel plugin or else our scripts will break...." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137404 (owner: 10Manybubbles) [19:33:14] RECOVERY - Puppet freshness on stat1003 is OK: puppet ran at Wed Jun 4 19:33:09 UTC 2014 [19:34:45] (03CR) 10Manybubbles: "Correction, mvel will remain available in Elasticsearch until 2.0 but we'll need to keep this setting to use it. After we upgrade to 1.3." [operations/puppet] - 10https://gerrit.wikimedia.org/r/137404 (owner: 10Manybubbles) [19:35:20] (03CR) 10Jgreen: [C: 032] Allowing a configurable StatsD server [operations/puppet] - 10https://gerrit.wikimedia.org/r/137278 (owner: 10Mwalker) [19:35:41] ottomata: that puppet change that I just added you to is required if I'm going to deploy elasticsearch 1.2.1 to beta which I'd _like_ to do today. I can wait if it needs it. [19:35:48] but if you don't have anything better to do.... [19:36:23] manybubbles: the dynamic scripting one? [19:36:48] yeah [19:41:37] csteipp: regarding the tls change https://gerrit.wikimedia.org/r/#/c/132393/ are you too busy with more important suff to answer? [19:43:36] jzerebecki: Yes, I've been working on mediawiki security fixes most of this week. I'm right now in the middle of looking through our event log and getting stats for https. I'm going to have analytics verify I'm doing it right, but for example, mobile devices sometimes have 10x connection overhead for ssl (just to put things in perspective :) [19:54:34] RECOVERY - Puppet freshness on stat1001 is OK: puppet ran at Wed Jun 4 19:54:29 UTC 2014 [19:55:59] greg-g, since max couldn't deploy my new extensions to the beta durig tutorial, what would be a good time for it? [20:00:04] gwicke, subbu: The time is nigh to deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140604T2000) [20:04:53] gwicke, could you let me know when you are done? There seems to be an hour before SWAT, maybe greg-g will let me have it ) [20:08:37] csteipp: a fresh session where the certificate and chain are much bigger than the actual content transfered? [20:17:18] oops, sorry manybubbles, i zoned out, i can merge that for ya [20:17:20] you ready? [20:17:34] ottomata: if you are ready, it is ready - it won't hurt anything to add it [20:17:40] sorry for the trouble [20:18:18] (03PS2) 10Krinkle: Remaining wikis other than enwiki and commonswiki to Cirrus as primary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136338 (owner: 10Chad) [20:18:22] (03PS2) 10Ottomata: Dynamic scripting for Elasticsearch 1.2.X [operations/puppet] - 10https://gerrit.wikimedia.org/r/137404 (owner: 10Manybubbles) [20:18:25] naw no trouble [20:18:30] (03CR) 10Ottomata: [C: 032 V: 032] Dynamic scripting for Elasticsearch 1.2.X [operations/puppet] - 10https://gerrit.wikimedia.org/r/137404 (owner: 10Manybubbles) [20:18:43] (03CR) 10Krinkle: "Is this scheduled?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136338 (owner: 10Chad) [20:19:18] (03CR) 10Chad: [C: 04-2] "Not yet, working on it. -2'd so no one gets trigger happy." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136338 (owner: 10Chad) [20:20:00] yurikR, will let you know [20:33:40] !log deployed Parsoid 165a2042 (deploy sha fc1b1ed4) [20:33:45] Logged the message, Master [20:38:19] yurikR, we're done [20:38:33] greg-g, ping ^ [20:38:43] gwicke, thx! [20:44:09] (03PS1) 10Dzahn: mediawiki module - lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/137452 [20:45:04] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [20:46:59] !log Truncating geo_killlist everywhere [20:47:04] Logged the message, Master [20:47:34] PROBLEM - Puppet freshness on tmh1001 is CRITICAL: Last successful Puppet run was Wed 04 Jun 2014 14:46:03 UTC [20:52:06] (03PS3) 10Yurik: Replace ZRMA with ZeroBanner+JsonConfig on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136503 [20:53:39] (03CR) 10MaxSem: [C: 031] Replace ZRMA with ZeroBanner+JsonConfig on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136503 (owner: 10Yurik) [20:57:24] (03PS1) 10BBlack: bump esams lvs txqlen to 20k [operations/puppet] - 10https://gerrit.wikimedia.org/r/137456 [20:58:03] (03CR) 10BBlack: [C: 032 V: 032] bump esams lvs txqlen to 20k [operations/puppet] - 10https://gerrit.wikimedia.org/r/137456 (owner: 10BBlack) [21:04:55] (03PS1) 10Yurik: * INCOMPLETE * Enable ZeroBanner & ZeroPortal in production [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137457 [21:05:31] (03CR) 10Yurik: [C: 04-2] * INCOMPLETE * Enable ZeroBanner & ZeroPortal in production [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137457 (owner: 10Yurik) [21:07:54] (03CR) 10Rush: [C: 031] "cool" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137452 (owner: 10Dzahn) [21:25:03] (03CR) 10Calak: [C: 031] Add Portal to scowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137353 (https://bugzilla.wikimedia.org/66107) (owner: 10John F. Lewis) [21:28:40] (03PS1) 10BryanDavis: beta: Add mediawiki/core/vendor to beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/137463 [21:28:52] (03CR) 10Ori.livneh: [C: 031] "Awesome, thanks" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137452 (owner: 10Dzahn) [21:29:34] PROBLEM - Puppet freshness on tmh1002 is CRITICAL: Last successful Puppet run was Wed 04 Jun 2014 15:28:19 UTC [21:30:14] * ori looks into tmh1002 [21:30:38] as well [21:30:53] ori: related to jobrunner rename? [21:30:59] it looks like it [21:31:07] so annoying! argh. [21:31:58] well, i'm proposing we rename 'imagescaler' to 'multimedia' in https://gerrit.wikimedia.org/r/#/c/137363/ , and my thinking there is that 'imagescaler' is a role [21:32:24] by the same token jobrunner is a role so maybe it's ok to rename the pieces in mediawiki:: to 'jobqueue' [21:32:31] yeah, i think i like that [21:32:59] (03CR) 10Calak: [C: 031] Add otherProjects for kowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137356 (https://bugzilla.wikimedia.org/66128) (owner: 10John F. Lewis) [21:33:56] mutante: thoughts? [21:35:22] it's interesting that it's fine on jobrunners, and that it's tmh1001 (videoscaler) that is complaining [21:35:45] ori: imagescaler -> multimedia sounds good. since it's not just images anymore [21:36:28] ori: not sure about the jobqueue and role part right now [21:36:51] yes, why is it even duplicate there [21:38:02] fine on jobrunners, fine on image scalers, not fine on video scalers [21:39:37] mutante: if you +1 the multimedia change (https://gerrit.wikimedia.org/r/#/c/137363/) we can try applying that, which might resolve it since it changes the ::imagescaler includes for the videoscalers [21:39:44] ori: btw roles.. would it be nicer to have role/mediawiki/videoscaler.pp and just one file per actual role? [21:39:54] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.001 second response time [21:39:56] like puppet style guide says for modules [21:39:59] one class per file [21:40:49] mutante: could be; there's still cruft to torch so i'm trying to defer designing the final puppet layout until i'm persuaded that all the crufty stuff is gone [21:40:55] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.010 second response time [21:40:55] ori, mutante, i think that rule isn't as strict for roles...but someone else might not agree :/ [21:41:04] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 35 data above and 0 below the confidence bounds [21:41:14] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 35 data above and 0 below the confidence bounds [21:42:04] RECOVERY - Puppet freshness on lvs3002 is OK: puppet ran at Wed Jun 4 21:41:59 UTC 2014 [21:42:05] (03PS4) 10Yurik: Replace ZRMA with ZeroBanner+JsonConfig on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136503 [21:43:24] MaxSem, greg-g, the new version ^ only uses labs files, plus one minor change to mobilelanding.php [21:44:36] (03CR) 10MaxSem: [C: 031] Replace ZRMA with ZeroBanner+JsonConfig on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136503 (owner: 10Yurik) [21:47:25] (03PS7) 10BryanDavis: beta: bring in mediawiki/skins.git [operations/puppet] - 10https://gerrit.wikimedia.org/r/136325 (https://bugzilla.wikimedia.org/65868) (owner: 10Hashar) [21:47:28] MaxSem, but i am still unsure of how to deploy - do i need to add it as a submodule to core? [21:47:59] add what? [21:48:06] (03PS2) 10BryanDavis: beta: Add mediawiki/core/vendor to beta [operations/puppet] - 10https://gerrit.wikimedia.org/r/137463 [21:48:18] MaxSem, the two new extensions [21:48:33] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Case_1d:_new_extension [21:48:45] (03PS1) 10Ori.livneh: mediawiki::jobrunner -> mediawiki::jobqueue [operations/puppet] - 10https://gerrit.wikimedia.org/r/137464 [21:48:49] on labs, it will happen automatically [21:49:13] MaxSem, ok, so what are my next steps??? +2 the patch? [21:49:38] yes, and deploy in prod [21:49:45] to ensure consistency [21:50:05] ok, deploying... [21:50:49] (03CR) 10Yurik: [C: 032] Replace ZRMA with ZeroBanner+JsonConfig on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136503 (owner: 10Yurik) [21:50:55] (03Merged) 10jenkins-bot: Replace ZRMA with ZeroBanner+JsonConfig on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136503 (owner: 10Yurik) [21:51:09] (03PS5) 10Ottomata: Add CDH5 support, drop CDH4 support [operations/puppet/cdh4] (cdh5) - 10https://gerrit.wikimedia.org/r/135494 [21:51:46] (03CR) 10Dzahn: [C: 031] "+1 to fix the puppet run on tmh hosts due to the puppet bug, simple rename" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137464 (owner: 10Ori.livneh) [21:51:56] (03PS2) 10Ori.livneh: mediawiki::jobrunner -> mediawiki::jobqueue [operations/puppet] - 10https://gerrit.wikimedia.org/r/137464 [21:52:00] (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki::jobrunner -> mediawiki::jobqueue [operations/puppet] - 10https://gerrit.wikimedia.org/r/137464 (owner: 10Ori.livneh) [21:53:24] RECOVERY - Puppet freshness on tmh1001 is OK: puppet ran at Wed Jun 4 21:53:16 UTC 2014 [21:53:24] RECOVERY - Puppet freshness on tmh1002 is OK: puppet ran at Wed Jun 4 21:53:16 UTC 2014 [21:53:31] mutante: ^ :) [21:53:33] nice [21:54:41] !log yurik Synchronized mobilelanding.php: (no message) (duration: 01m 07s) [21:54:45] Logged the message, Master [21:56:12] (03PS2) 10Dzahn: mediawiki module - lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/137452 [21:56:49] !log yurik Synchronized wmf-config/: https://gerrit.wikimedia.org/r/#/c/136503/ (duration: 01m 07s) [21:56:54] Logged the message, Master [21:57:03] (03PS1) 10BBlack: disable ondemand service w/ cpufrequtils [operations/puppet] - 10https://gerrit.wikimedia.org/r/137465 [21:57:16] (03CR) 10BBlack: [C: 032 V: 032] disable ondemand service w/ cpufrequtils [operations/puppet] - 10https://gerrit.wikimedia.org/r/137465 (owner: 10BBlack) [21:57:20] (03CR) 10Dzahn: [C: 032] mediawiki module - lint fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/137452 (owner: 10Dzahn) [22:17:10] (03PS1) 10Ori.livneh: simplify mediawiki::php [operations/puppet] - 10https://gerrit.wikimedia.org/r/137470 [22:19:06] (03PS2) 10Ori.livneh: simplify mediawiki::php [operations/puppet] - 10https://gerrit.wikimedia.org/r/137470 [22:19:35] ori: is this still relevant? https://gerrit.wikimedia.org/r/#/c/137263 [22:19:48] I thought I saw _joe_ handle that somewhere [22:19:49] chasemp: totally [22:20:07] chasemp: no, _joe_ had a typo (he referred to it as 'monitoring' somewhere when it should have been 'monitor') [22:20:34] but he made the mistake because that's the local convention, so it's natural to expect it [22:20:39] doesn't yours also use the 'ing ? [22:20:54] or is this two simultanious things one 'ing and one 'ingless [22:21:01] _joe_ invoked it as 'monitoring' when it was still 'monitor' [22:21:16] when he realized his mistake, he did the smallest thing to correct the typo, which is to refer to it as 'monitor' [22:21:25] which was the appropriate thing to do [22:21:57] (03CR) 10Rush: [C: 031] "ok" [operations/puppet] - 10https://gerrit.wikimedia.org/r/137263 (owner: 10Ori.livneh) [22:22:03] I think I get it, sounds good [22:22:08] excellent, thanks :) [22:23:54] Hi guys. I'm trying to access analytics1010.eqiad.wmnet, but I get and error. I wonder if I have access rights to analytics1010 anyway. how can I check on that? [22:25:27] well...what's the error? [22:25:29] HaithamS: you don't appear to have a home dir there [22:26:08] mutante : oh, I see. what should i do in this case? [22:26:22] the group memberships are in puppet/modules/admin/data/data.yaml , you are in "statistics-users" [22:26:52] HaithamS: please send a quick access request to the email address in topic [22:27:18] ok, will do. [22:27:21] thanks [22:28:19] (03PS2) 10Ori.livneh: mediawiki::monitor -> mediawiki::monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/137263 [22:28:22] HaithamS: it's a request to be "analytics-users" [22:28:29] (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki::monitor -> mediawiki::monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/137263 (owner: 10Ori.livneh) [22:28:32] that group is on 1010 [22:30:36] PROBLEM - Puppet freshness on labstore1001 is CRITICAL: Last successful Puppet run was Tue 03 Jun 2014 16:21:49 UTC [22:30:41] mutante: sent! [22:32:34] (03PS1) 10BBlack: switch RPS to using up_command [operations/puppet] - 10https://gerrit.wikimedia.org/r/137472 [22:35:48] (03CR) 10BBlack: [C: 032 V: 032] switch RPS to using up_command [operations/puppet] - 10https://gerrit.wikimedia.org/r/137472 (owner: 10BBlack) [22:37:04] SWAT question: the doc suggest I should make a submodule update patch if I request a SWAT [22:37:25] HaithamS: received. your ticket number is 7625, the person listed as "on duty" here will handle it [22:37:27] extension SWAT I mean [22:37:59] (03PS1) 10MaxSem: Redirect language variant URLs to mobile [operations/puppet] - 10https://gerrit.wikimedia.org/r/137476 (https://bugzilla.wikimedia.org/51753) [22:38:15] does this mean I should actually merge the patches for the extension's wmfx branch on my own? [22:39:47] Thanks, mutante! [22:39:48] tgr: Yes, if you have the rights to do this [22:39:59] The instructions were written not considering the fact that some people may not have those rights [22:40:27] AIUI the SWAT team won't hate you forever if you don't build a submodule update patch, it's just that you'll save them work if you do it [22:40:58] so what happens if the patch does not get SWATted for some reason? [22:41:19] (I am on the SWAT team and that's my view: I'll build submodule update commits for you if needed, and if you're able to build them yourself that's a nice extra and saves me some time. But I don't know what other SWATters do.) [22:41:25] Yeah that's a good point against this practice [22:41:43] Ideally the wmfN branch is minimally out of sync with production [22:55:22] (03PS1) 10Krinkle: webperf/deprecate: Log jqmigrate to statsd under mw.js.deprecate [operations/puppet] - 10https://gerrit.wikimedia.org/r/137484 [22:56:22] (03PS3) 10BBlack: Switch LVS servers to include standard [operations/puppet] - 10https://gerrit.wikimedia.org/r/20681 (owner: 10Faidon Liambotis) [22:57:14] (03CR) 10BBlack: [C: 04-2] "Nothing has really changed here, I just wanted to rebase since it was so old." [operations/puppet] - 10https://gerrit.wikimedia.org/r/20681 (owner: 10Faidon Liambotis) [23:00:04] RoanKattouw, mwalker, ori, MaxSem, JohnLewis: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140604T2300) [23:00:24] Oh sweet, it pings those who have patches now :p [23:00:33] i'll do it [23:00:43] ori, you did it yesterday? [23:00:44] whoa, that's cool [23:00:45] I can do it [23:01:03] or I can do it if someone you know, :p [23:01:25] greg-g, JohnLewis; it'll notify anyone identified with an {{ircnick}} template [23:01:40] mwalker: sure, that'd be awesome [23:01:41] * greg-g highfives mwalker [23:01:46] * tacotuesday gets no notifications [23:01:51] mwalker: Nice [23:01:51] /ignore all the bots! [23:02:27] mwalker: didn't work for me though [23:02:35] hmm... that's interesting [23:03:00] (03CR) 10Mwalker: [C: 032] Add otherProjects for kowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137356 (https://bugzilla.wikimedia.org/66128) (owner: 10John F. Lewis) [23:03:07] (03CR) 10Mwalker: [C: 032] Add Portal to scowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137353 (https://bugzilla.wikimedia.org/66107) (owner: 10John F. Lewis) [23:03:09] (03Merged) 10jenkins-bot: Add otherProjects for kowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137356 (https://bugzilla.wikimedia.org/66128) (owner: 10John F. Lewis) [23:03:19] (03Merged) 10jenkins-bot: Add Portal to scowiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137353 (https://bugzilla.wikimedia.org/66107) (owner: 10John F. Lewis) [23:03:23] {{ircnick|tgr|Gergő Tisza}} - maybe the utf-8 threw it off during parsing? [23:03:37] (03CR) 10Mwalker: [C: 032] Enable MergeHistory for Persian Wikipedia admins [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136284 (https://bugzilla.wikimedia.org/65938) (owner: 10Reza) [23:03:37] EMAILS FFS [23:03:44] (03Merged) 10jenkins-bot: Enable MergeHistory for Persian Wikipedia admins [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/136284 (https://bugzilla.wikimedia.org/65938) (owner: 10Reza) [23:04:29] tacotuesday: you don't get notifications because it's not tuesday :) [23:04:41] And I'm not using my normal nick :p [23:05:55] MergeHistory is being enabled somewhere? [23:05:56] Scary. [23:07:16] Krenair, why scary? [23:07:25] I wasn't aware we'd turned that on anywhere. [23:07:37] * greg-g sighs [23:07:49] Guess we had. [23:07:50] It already on ckbwiki and eswiki something. wikivoyage I think [23:08:00] Can we make a list of "extensions WMF doesn't want to install, or has historical reservations about"? :P [23:08:12] greg-g: Special:Oversight :p [23:08:32] greg-g, sure: ^.*$ [23:08:36] lol [23:08:42] greg-g: It's not an extension, it's core :p [23:08:53] I don't think we even enable it by default in master. [23:08:56] greg-g, this had a bug report with some semblance of community support -- but I'm happy to revert if we dont think its a good idea [23:09:13] greg-g, this is actually an obscure core feature. [23:09:33] mwalker: I have no reason to recommend that right now unless someone else in here makes a case for it [23:09:33] nano ~/.bashrc [23:09:38] mwalker: If it wasn't enabled on a few wikis already; Would I be mean to put it on SWAT? :) (yes but that's not the point :p) [23:09:39] grrr [23:09:57] mwalker: for removing it, that is. [23:10:10] greenhac, kk [23:10:16] *greg-g [23:10:17] :) [23:10:24] It seems to be fine on eswikivoyage and ckbwiki anyway :) [23:10:58] brb, going to go try to solve this issue [23:11:12] greg-g: Which issue :o [23:13:06] I was unaware it was enabled anywhere [23:13:56] Krenair: I think it's been on eswikivoyage was a yearish now. Let me check [23:14:56] 8 months actually. [23:15:54] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.002 second response time [23:18:23] !log mwalker Started scap: Scapping for SWAT; MultiMedia viewer and config changes [23:18:28] Logged the message, Master [23:20:03] !log on searchidx1001: as a temporary hack to work around scap disk full errors, set up a bind mount at /usr/local/apache/common-local linking to a directory in /a, by local modification of /etc/fstab [23:20:08] Logged the message, Master [23:22:15] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 34 data above and 0 below the confidence bounds [23:22:54] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.010 second response time [23:23:04] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 34 data above and 0 below the confidence bounds [23:23:52] JohnLewis, https://es.wikivoyage.org/w/index.php?title=Especial%3ARegistro&type=merge [23:24:19] Krenair: Hey, I never said they used it. [23:25:10] https://ckb.wikipedia.org/w/index.php?title=%D8%AA%D8%A7%DB%8C%D8%A8%DB%95%D8%AA%3A%D9%84%DB%86%DA%AF&type=merge [23:25:33] Heh, if nobody uses it can we go back to leaving it off? :p [23:25:59] tacotuesday: eswikivoyage don't use it shall we say :p [23:26:13] We could just remove it from MW :p [23:26:17] ... it could be a bug that it's not logging... [23:26:30] Less likely. [23:27:36] TimStarling, I just scapped out a change that enabled mergehistory on fawiki -- it seems like that's going to be OK; but I was wondering if you know any reason to roll that change back? [23:27:59] *I ask because I know only what the feature does, not how it works / any of the history around it [23:28:30] is it enabled on other wikis? [23:28:55] It's on eswikivoyage & unused, ckbwiki with 3 log entries. [23:30:10] that doesn't sound like much of an endorsement... [23:31:16] heh; not really no -- that's why I'm trying to find people who know why we might not have turned it on for everything -- it's been in core since 1.12 [23:31:48] Because it was scary, irreversible and very untested. [23:31:59] Well, (2) and (3) lead to (1) [23:35:31] it's also an O(N) write query, just UPDATE revision SET rev_page=X WHERE rev_page=Y, no limit [23:36:14] Probably need to touch the MultimediaViewer files, seeing caching weirdness on mw.org [23:36:29] marktraceur, I suspect that is because scap is not yet done [23:36:30] it doesn't count the rows first [23:36:38] Ah, that would do it [23:36:55] * marktraceur gets out of mwalker's groove [23:37:13] marktraceur, such a good movie [23:37:30] And SMASH IT WITH A HAMMER! [23:37:37] ^ our new deploy strategy [23:39:03] I'm all for history merges that are less retarded than the delete/undelete trick [23:39:14] but this seems to be roughly equally retarded [23:40:40] !log mwalker Finished scap: Scapping for SWAT; MultiMedia viewer and config changes (duration: 22m 16s) [23:40:45] Logged the message, Master [23:41:03] marktraceur, tacotuesday -- try it now [23:41:25] Huh try what? [23:41:59] ah; sorry tacotuesday; I didn't look and assumed you were tgr [23:42:10] I am not :) [23:42:33] mwalker: works, thanks! [23:42:33] now; if you were ^tacotuesday; that would've made much more sense [23:43:13] !log on searchidx1001: restarting lsearchd and indexer [23:43:18] Logged the message, Master [23:45:28] TimStarling: Thanks for giving a hand here. I'll respond on the VPT thread. [23:45:50] I'm not done yet [23:47:41] Whoops too late, I saw the !log about restarting. [23:47:48] Sorry. [23:50:55] I'm not sure if it is fixed [23:51:30] I've got a patch that should help I think [23:51:44] PROBLEM - Disk space on gallium is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 16 MB (3% inode=99%): [23:51:46] there is this log message: [23:51:48] Error reading InitialiseSettings.php from url file:///a/search/conf/InitialiseSettings.php : For input string: "NS_MAIN" [23:51:57] That's been like that...for a long time [23:52:05] not a problem then? [23:52:17] I remember seeing that error when we were futzing with things with Ram. [23:52:26] I guess I'll just run it and watch the timestamps [23:52:28] Never ran it down, but probably not our immediate worry. [23:52:53] there wasn't much else in the log besides that [23:53:01] (03PS1) 10Chad: Fixup lsearchd config slightly [operations/puppet] - 10https://gerrit.wikimedia.org/r/137498 [23:53:18] That should make the mount a little less necessary. [23:53:55] right, it is updating the status files, so I guess it is actually fixed [23:54:04] (03PS2) 10Chad: Fixup lsearchd config slightly [operations/puppet] - 10https://gerrit.wikimedia.org/r/137498 [23:54:47] it will take a while to catch up, of course [23:55:01] TimStarling: does it deserve a revert/disable? [23:55:11] TimStarling: gah, sorry, I was scrolled up /me reads [23:55:21] greg-g: MergeHistory? [23:55:46] TimStarling: Yeah understandable. Again, thanks though. Mostly my fault here. [23:55:50] TimStarling: yeah, mergehistory [23:57:29] yeah, I think it should probably be disabled everywhere, including the wikis where it was previously enabled [23:57:47] TimStarling: gotcha. [23:57:50] it's not terrible, it's just not a feature I'd approve if I saw it in gerrit today [23:58:05] I'll open a bug for the problems with it [23:58:09] thanks [23:58:17] I'll follow up as needed