[00:01:07] (03CR) 10Dzahn: [C: 04-1] "user needs to be created in LDAP as Coren said already" [puppet] - 10https://gerrit.wikimedia.org/r/152062 (owner: 10Yuvipanda) [00:02:34] (03CR) 10Dzahn: "ori,_joe_,reedy: are any of the changes with "using mod_proxy_fcgi" still current?" [puppet] - 10https://gerrit.wikimedia.org/r/147484 (owner: 10Reedy) [00:05:02] yurikR1: Not going to lie to you, I think whatever patch you have left might not go out today [00:05:17] marktraceur, that one is sooo tiny :( [00:05:20] oh well :) [00:05:45] will push it myself tmrw morning [00:08:11] Herp derp [00:08:20] Takes forever for scap to start a round of rsyncing [00:08:35] (03CR) 10Dzahn: [C: 04-1] "snmtrapd and related is not only used for puppet_freshness, it is also used for the checks on the power strips" [puppet] - 10https://gerrit.wikimedia.org/r/143304 (owner: 10Alexandros Kosiaris) [00:09:20] sorry, it's after 5, gotta ctrl+c it, marktraceur [00:09:24] :) [00:09:42] Don't apologize to me, apologize to RoanKattouw [00:11:48] (03CR) 10Dzahn: "this caused monitoring of power strips to be broken" [puppet] - 10https://gerrit.wikimedia.org/r/143306 (owner: 10Alexandros Kosiaris) [00:12:11] (03CR) 10Dzahn: "this caused monitoring of power strips to be broken" [puppet] - 10https://gerrit.wikimedia.org/r/143305 (owner: 10Alexandros Kosiaris) [00:15:05] This is sort of a long time for new-scap, innit? [00:16:03] Ugh, one left. Come on, little server. [00:17:01] marktraceur: yeah, what's it doing right now? [00:17:23] "it's thinking." [00:18:07] * greg-g nods [00:18:20] It is. [00:18:23] happens to the best of us, just gently nudge him and he'll come to [00:18:40] It's on sync-apaches, one server remaining [00:18:45] * greg-g nods [00:18:53] I think there was grumbling about fenari recentnly. [00:20:15] bd808: just happened to find this on old RT https://gerrit.wikimedia.org/r/#/c/125888/2/modules/beta/files/scripts/beta-apaches [00:21:22] greg-g: Was there any actual meaning to "nudge him" or am I stuck waiting for scap to finish [00:21:51] marktraceur: it was a bad joke. [00:21:57] 's fine [00:22:23] it's probably fenari again if it's down to just one [00:22:32] Probably yes. [00:23:08] (03CR) 10Dzahn: "mark, is this good to go meanwhile?" [puppet] - 10https://gerrit.wikimedia.org/r/143887 (owner: 10Faidon Liambotis) [00:23:08] "fenari INFO - Finished rsync common (duration: 13m 45s)" [00:23:20] Oh, it just finished. :) [00:23:46] most other servers too ~2:30 vs 13:45 :( [00:23:54] *took [00:24:43] (03CR) 10Dzahn: "now it would hardcore a specific python version in the past, but it doesn't add a symlink to that or so.. is that really better?" [puppet] - 10https://gerrit.wikimedia.org/r/144848 (owner: 10Tim Landscheidt) [00:25:01] I don't know hwat happened to fenari or the route to pmtpa but that is rediculous [00:25:25] And how [00:25:34] I guess in October everything will be more better [00:25:41] About 70% through cdbs now. [00:25:44] https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=fenari.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1410999905&g=load_report&z=large&c=Miscellaneous%20pmtpa [00:26:34] lots and lots and lots of iowait -- https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=fenari.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2&st=1410999905&g=cpu_report&z=large&c=Miscellaneous%20pmtpa [00:27:55] Ugh now it's last again. [00:27:57] (03CR) 10Dzahn: [C: 031] "it seemed to me on my project that while it worked before, recently labsdebrepo stopped working. as in, once setup and copying newer versi" [puppet] - 10https://gerrit.wikimedia.org/r/145573 (owner: 10Tim Landscheidt) [00:32:38] !log marktraceur Finished scap: [SWAT] Move things out of assets/ and into resources/assets/ (duration: 35m 28s) [00:32:43] (03CR) 10Dzahn: "matanya, do you still want to keep it after reading the comment above? it's a couple months ago and it just might be needed later, but may" [puppet] - 10https://gerrit.wikimedia.org/r/127909 (owner: 10Matanya) [00:32:43] Logged the message, Master [00:32:46] marktraceur: fenari is sadly underpowered. 2 cpu cores. 120+ runing processes [00:32:48] Finallyyyyyy [00:32:57] bd808: Well geez. [00:33:08] OK, I'm going to stick around to make sure things are cool [00:34:10] (03CR) 10Dzahn: [C: 04-2] "Diederik, hi. I suggest to abandon this and replace it with asking how you get the same thing out of phabricator. That is going to be acti" [puppet] - 10https://gerrit.wikimedia.org/r/111152 (owner: 10Diederik) [00:36:51] (03CR) 10Dzahn: [C: 04-1] "this whole thing _might_ be outdated since we removed the entire snmp setup from neon, but also see my mail to ops list today, PDU monitor" [puppet] - 10https://gerrit.wikimedia.org/r/127246 (owner: 10ArielGlenn) [00:37:17] RoanKattouw: If you wanted to test your change [00:37:20] Now's the time [00:37:33] Not totally sure if it worked...I don't see any file type icons anywhere. [00:37:34] thanks marktraceur [00:38:24] (03CR) 10BryanDavis: "Better link to beta version: Ia351ef7e997c3acc4a4d44b1c5e757bfc838a2cb and followup at I4af2c99af2a255c8f14763e701c2ab79a6fb8da6" [puppet] - 10https://gerrit.wikimedia.org/r/160953 (owner: 10Alexandros Kosiaris) [00:41:21] marktraceur: OK I'll test it [00:42:09] Hmm it's still hitting the wrong URLs for some reason [00:42:46] http://bits.wikimedia.org/static-1.24wmf21/resources/assets/file-type-icons/fileicon-ogg.png is present but the URLs are still wrong [00:42:53] I scapped for nothing!? Arghh [00:44:18] No you scapped for something [00:44:24] It may be caching [00:46:48] (03CR) 10Dzahn: "being bold and abandoning per "Delivery to the following recipient failed permanently"" [puppet] - 10https://gerrit.wikimedia.org/r/111152 (owner: 10Diederik) [00:50:44] Oh you know what [00:50:49] It's #$@@#$ing TimedMediaHandler, I bet [00:51:18] mmm, so many rays of love towards TMH [00:51:24] RoanKattouw: we updated a bunch of extensions for /assets [00:53:11] legoktm: Ugh. Well they're all going to have to be updated again [00:53:15] (03CR) 10Dzahn: "list of services still using the webserver::apache class:" [puppet] - 10https://gerrit.wikimedia.org/r/153971 (owner: 10Dzahn) [00:53:19] Like https://gerrit.wikimedia.org/r/161147 [00:55:48] mutante: i don't think the etherpad class actually needs to include webserver::apache any longer [00:55:56] it uses apache::site to actually declare the site [00:58:09] (03PS1) 10Dzahn: remove webserver::apache from etherpad role [puppet] - 10https://gerrit.wikimedia.org/r/161149 [00:58:11] ori: thanks ^ [00:59:51] (03CR) 10Ori.livneh: [C: 031] remove webserver::apache from etherpad role [puppet] - 10https://gerrit.wikimedia.org/r/161149 (owner: 10Dzahn) [01:31:20] (03CR) 10Dzahn: "also see https://bugzilla.wikimedia.org/show_bug.cgi?id=70981 currently there is another issue with it" [puppet] - 10https://gerrit.wikimedia.org/r/145573 (owner: 10Tim Landscheidt) [02:07:40] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3616 MB (3% inode=99%): [02:36:47] !log LocalisationUpdate completed (1.24wmf20) at 2014-09-18 02:36:46+00:00 [02:36:54] Logged the message, Master [02:39:30] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [02:43:36] (03PS1) 10Yurik: Fix namespace casing for the graph ext [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161155 [02:44:00] greg-g, around? seems like i forgot to do this ^, can i do a quick depl of that [02:44:12] otherwise graph ext is a bit broken :) [02:45:11] yurikR: Probably too late [02:45:18] Tomorrow first thing though [02:45:39] marktraceur, graph ns is broken on all enabled wikis otherwise :( [02:46:34] according to the rules, if its out of the regular workhours, use best judgement... and something clearly easy to fix and broken... i would risk it if there are multiple experts around [02:46:58] I'll watch, at least [02:47:07] You have deploy rights, yeah? [02:47:11] yep [02:47:23] here we go... [02:47:23] * marktraceur is +0.5 [02:48:03] (03CR) 10Yurik: [C: 032] Fix namespace casing for the graph ext [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161155 (owner: 10Yurik) [02:48:08] (03Merged) 10jenkins-bot: Fix namespace casing for the graph ext [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161155 (owner: 10Yurik) [02:50:31] MaxSem, in case you are around [02:50:32] ^ [02:50:40] patching commonsettings :( [02:52:30] !log yurik Fixing graph ext namespace name - otherwise get screen of WMF death on graph: ns visits [02:52:36] Logged the message, Master [02:53:09] !log yurik Synchronized wmf-config/CommonSettings.php: (no message) (duration: 01m 53s) [02:53:15] Logged the message, Master [02:53:27] marktraceur, done, any issues from where you sit? :) [02:56:18] Nothing obvious [02:56:28] gj, thanks [02:58:31] thx for keeping an eye out ) [02:59:26] My pleasure! [02:59:51] PROBLEM - puppet last run on mw1073 is CRITICAL: CRITICAL: Puppet has 1 failures [03:00:26] hmm.. hope its not me ... [03:00:51] RECOVERY - Disk space on virt0 is OK: DISK OK [03:01:42] PROBLEM - puppet last run on mw1196 is CRITICAL: CRITICAL: Puppet has 1 failures [03:01:58] I doubt puppet errors would happen because of a deploy [03:05:26] (03PS1) 10Springle: Default to Aria engine for Tendril's uses many explicit temp tables, which InnoDB isn't terribly enthusiastic about. [puppet] - 10https://gerrit.wikimedia.org/r/161157 [03:06:50] (03PS2) 10Springle: Default to Aria engine for Tendril's many explicit temp tables, which InnoDB isn't terribly enthusiastic about. [puppet] - 10https://gerrit.wikimedia.org/r/161157 [03:07:16] (03CR) 10Springle: [C: 032] Default to Aria engine for Tendril's many explicit temp tables, which InnoDB isn't terribly enthusiastic about. [puppet] - 10https://gerrit.wikimedia.org/r/161157 (owner: 10Springle) [03:09:44] !log LocalisationUpdate completed (1.24wmf21) at 2014-09-18 03:09:44+00:00 [03:09:51] Logged the message, Master [03:16:50] RECOVERY - puppet last run on mw1073 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [03:19:44] RECOVERY - puppet last run on mw1196 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [03:45:43] (03PS1) 10Ori.livneh: mediawiki::cgroup: add docs [puppet] - 10https://gerrit.wikimedia.org/r/161162 [03:47:15] (03PS2) 10Ori.livneh: mediawiki::cgroup: add docs [puppet] - 10https://gerrit.wikimedia.org/r/161162 [03:49:46] (03PS3) 10Ori.livneh: mediawiki::cgroup: add docs [puppet] - 10https://gerrit.wikimedia.org/r/161162 [04:08:48] (03PS1) 10Springle: raise extra_max_connections to 10 for extra_port 3307 on MariaDB 10 [puppet] - 10https://gerrit.wikimedia.org/r/161168 [04:10:36] (03CR) 10Springle: [C: 032] raise extra_max_connections to 10 for extra_port 3307 on MariaDB 10 [puppet] - 10https://gerrit.wikimedia.org/r/161168 (owner: 10Springle) [04:17:11] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [04:20:10] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [04:20:56] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Sep 18 04:20:56 UTC 2014 (duration 20m 55s) [04:21:02] Logged the message, Master [04:51:25] (03PS1) 10Jforrester: Enable Flow on [[mw:User talk:Jdforrester (WMF)]] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161172 [04:52:13] James_F: :) [04:52:19] ebernhardson: :-) [04:52:25] * James_F wants to play. [05:35:25] (03CR) 10BBlack: "I think we need something a little more complex like bd808's, although maybe not identical. The problem with the "mw*" glob is that somet" [puppet] - 10https://gerrit.wikimedia.org/r/160953 (owner: 10Alexandros Kosiaris) [05:39:01] (03CR) 10BBlack: [C: 031] "+1 for this in general, although yeah we need to make sure we don't have tools relying on it first. Maybe grep logs for root ssh and ferr" [puppet] - 10https://gerrit.wikimedia.org/r/160628 (owner: 10Matanya) [05:39:47] (03Abandoned) 10Matanya: purge_slow_digest: adding cron to terbium [puppet] - 10https://gerrit.wikimedia.org/r/127909 (owner: 10Matanya) [05:41:28] (03CR) 10BBlack: "Check this w/ compiler: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/353/" [puppet] - 10https://gerrit.wikimedia.org/r/160959 (owner: 10Mark Bergsma) [05:41:34] (03CR) 10BBlack: [C: 032] "Check this w/ compiler: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/353/" [puppet] - 10https://gerrit.wikimedia.org/r/160959 (owner: 10Mark Bergsma) [05:42:51] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [05:53:13] (03PS1) 10Chmarkine: lists.wm.org - raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/161177 (https://bugzilla.wikimedia.org/38516) [06:21:49] (03PS1) 10BBlack: cp300[12] do not exist [puppet] - 10https://gerrit.wikimedia.org/r/161178 [06:28:31] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Epic puppet fail [06:30:00] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:01] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:10] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:20] PROBLEM - puppet last run on db1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:40] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:41] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:42] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:42] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:50] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:52] PROBLEM - puppet last run on analytics1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:02] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:02] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:05] (03CR) 10BBlack: [C: 032] cp300[12] do not exist [puppet] - 10https://gerrit.wikimedia.org/r/161178 (owner: 10BBlack) [06:31:11] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:11] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:24] (03PS1) 10BBlack: Unified nginx ssl on varnish at all sites [puppet] - 10https://gerrit.wikimedia.org/r/161180 [06:31:30] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:42] ^ seems to be puppetmaster issues, not host issues [06:42:31] (03CR) 10BBlack: [C: 031] "Works in puppet compiler: http://puppet-compiler.wmflabs.org/355/change/161180/html/" [puppet] - 10https://gerrit.wikimedia.org/r/161180 (owner: 10BBlack) [06:45:20] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:45:21] RECOVERY - puppet last run on db1002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:45:25] !log removing pybal symlink "$site/text-varnish", seems to be a remnant no longer in use [06:45:30] Logged the message, Master [06:45:50] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:10] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:46:12] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:13] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:20] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:46:40] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:46:40] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:46:41] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:46:50] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:46:51] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:51] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:47:00] RECOVERY - puppet last run on analytics1010 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:47:36] !log removing pybal symlink "$site/ipv6", also unused (old ipv6 protoproxying) [06:47:41] Logged the message, Master [06:50:25] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Niah, it was not using the snmptraps but rather snmpgets. I think this can be safely abandoned" [puppet] - 10https://gerrit.wikimedia.org/r/127246 (owner: 10ArielGlenn) [06:53:29] !log removing pybal cfg "esams/wikimedialbsecure" (unused, points at maerlant) [06:53:35] Logged the message, Master [06:55:44] (03PS1) 10Alexandros Kosiaris: Ensure snmp package on icinga for PDU monitoring [puppet] - 10https://gerrit.wikimedia.org/r/161182 [06:57:44] (03PS1) 10BBlack: rename misc_web_https to misc_web-https for consistency [puppet] - 10https://gerrit.wikimedia.org/r/161183 [06:58:11] (03CR) 10BBlack: [C: 032 V: 032] rename misc_web_https to misc_web-https for consistency [puppet] - 10https://gerrit.wikimedia.org/r/161183 (owner: 10BBlack) [07:09:50] (03PS1) 10Matanya: varnish:qualify vars [puppet] - 10https://gerrit.wikimedia.org/r/161184 [07:09:56] !log removing pybal cfg "eqiad/misc_web_https" (unused now, https://gerrit.wikimedia.org/r/161183) [07:10:01] Logged the message, Master [07:11:30] _joe_: hello there :) did you really get hiera() deployed on beta already ? :) [07:11:39] we were wondering about it with greg yesterday during our 1/1 [07:16:03] anyone awake knowledge about tmh100[12] and "videos.wikimedia.org"? [07:16:16] s/knowledge/knowledgeable/ [07:20:23] (because it seems kinda half-baked as a public-facing service, maybe not in production use?) [07:21:15] <_joe_> about tmh, I know something [07:21:17] (or at least, not in production use directly facing the internet, but used internally?) [07:21:33] <_joe_> hashar: yes [07:21:43] <_joe_> hashar: but it's "inert" [07:22:05] <_joe_> meaning hiera.yaml is in place, but no data in /etc/puppet/hiera, so... [07:22:19] now that I know it's coming soon, I find new reasons to wish we already had heira going every day [07:22:41] <_joe_> bblack: It's coming today I guess [07:22:54] <_joe_> that is, if this stomach bug gives me a pause :/ [07:23:00] I want to refactor so much data into it [07:23:09] <_joe_> me too :P [07:23:16] _joe_: at least that means we are close to be using hiera :-] [07:23:34] _joe_: we were in shadow and had now clue what was the progress. That sounds exciting to me [07:23:42] <_joe_> https://gerrit.wikimedia.org/r/#/c/160924/ [07:27:49] _joe_: whenever you have the bandwidth for it, we are highly interested in migrating all the beta cluster conf variances to be in hiera [07:27:55] that has a bunch of impacts on prod though [07:28:09] _joe_: so re: tmh100[12], they seem pretty idle, do you know if it's actually in-use yet? [07:28:27] <_joe_> bblack: they act as videoscalers AFAIK [07:28:45] <_joe_> doing the same work as the imagescalers do for images [07:28:58] <_joe_> so if you upload a video to commons they do the re-encoding work [07:30:12] the bit that's puzzling me right now is we have a "videos.wikimedia.org" with different default pages on http vs https. For http it's just an alias for our normal text varnish service. for https it's the same in esams/ulsfo, but eqiad goes directly to tmh1002 for https (bypassing varnish). [07:30:39] I'm not sure if videos.wikimedia.org is really supposed to exist or work right publicly, or over https, and why the hell eqiad skips varnish and uses 1/2 servers directly. [07:30:42] <_joe_> bblack: wat? [07:31:28] <_joe_> http://videos.wikimedia.org/ just shows the default page with links to all our projects to me (esams) [07:32:10] yeah but https://videos, if it hits eqiad, gives one of those "Hey your webserver is installed" default pages :) [07:32:28] (because it's hitting tmh1002 directly) [07:32:33] https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/protoproxy.pp#L138 [07:32:56] geodns->eqiad for https is the lone exception, due to 10.16.64.146 above ^ [07:33:11] all sites for http, and ulsfo/esams for https, go via text varnish [07:33:45] (and don't even ask me what [2620:0:862:3::80:2] is in there, it doesn't exist anywhere else in config/dns) [07:34:09] (and it's an esams address specified on eqiad :p) [07:35:38] anyways, it doesn't seem like videoscalers are something we're supposed to expose directly in the first place. it's probably just used internally for redis, and videos.wikimedia.org is handled by regular appservers with internal magic that may use videoscalers [07:35:57] <_joe_> no they don't [07:36:01] I suspect that the videos ssl thing in protoproxy should be removed completely. [07:36:03] <_joe_> lemme check the vhosts [07:37:33] <_joe_> bblack: videos.wikimedia.org is not defined in any vhost on the appservers, so it gets incercepted by nonexistent.conf [07:37:39] <_joe_> which shows the default page [07:37:42] (03CR) 10Mark Bergsma: "Let's add SNI and the individual star certs etc first before we do this." [puppet] - 10https://gerrit.wikimedia.org/r/161180 (owner: 10BBlack) [07:39:57] heh [07:39:57] so probably the DNS for it should be killed too if we're not even using it [07:39:57] but regardless, I may double-check with a sniffer and then kill the SSL protoproxy part [07:40:48] I did some git history checking, and the videos thing seems to be prehistoric. it first appeared back in 2011 in some import of unpuppetized stuff or something [07:41:13] and then just got mangled and/or stripped down a bit during various refactors and cleanups [07:41:31] 2011 is prehistoric now? oy oy [07:41:42] yes! [07:41:44] apparently got removed from protoproxy by mark with https://gerrit.wikimedia.org/r/#/c/102120/ [07:41:55] yeah kill it [07:42:30] maybe that was related to timedmediahandler / video scaling [07:42:36] to be used in place of upload.wikimedia.org [07:44:49] (03PS1) 10BBlack: kill dead/strange videos protoproxy [puppet] - 10https://gerrit.wikimedia.org/r/161186 [07:44:53] <_joe_> https://trac.torproject.org/projects/tor/wiki/doc/meek pretty neat [07:45:43] bblack: the dns entry got added back in March 2012 with commit message: "Adding a CNAME for videos to proxy to transcoded wikimania videos" ( 564fd8daa1e in operations/dns.git ) [07:46:41] well we could leave the DNS for a while, it doesn't hurt anything for it to hit default nonexistent.conf [07:46:44] and apparently ended up being hosted on tool server under wmvids.toolserver.org ... [07:47:14] (03CR) 10BBlack: [C: 032 V: 032] "Doesn't seem to be used or usable, also:" [puppet] - 10https://gerrit.wikimedia.org/r/161186 (owner: 10BBlack) [07:48:54] !log re-enabled icinga notifications for ms-be1001 [07:48:59] Logged the message, Master [07:55:11] mark: was there any other holdup on switching from unified to SNI? something about some class of mobile devices not supporting it or something? [07:56:36] <_joe_> bblack: Androing <=2.3 [07:56:45] <_joe_> Android [07:56:48] ah [07:57:06] I'm not an SSL negotiation expert. is it possible for us to go star certs and SNI and still fallback to unified for them? [07:57:07] <_joe_> which is still like, a lot of devices [07:57:23] <_joe_> mmmh not sure about the answer [07:57:28] yeah me either [07:57:44] <_joe_> but that's something we could study [07:58:08] <_joe_> I don't think it is [07:58:43] <_joe_> at least not without patching nginx [07:59:37] yes in the protocol there is a default cert you use when you do not see SNI, but that probably depends on the ssl terminator in use [08:00:13] <_joe_> jzerebecki: yes I was taking a look, we may be able to do that, I just have to experiment a little with nginx [08:00:58] <_joe_> http://serverfault.com/questions/488427/how-to-define-which-ssl-certificate-nginx-sends-first-with-sni [08:01:19] <_joe_> bblack: ^^ so, if we set the default_server to send the unified cert [08:01:22] <_joe_> it may work [08:01:27] <_joe_> but we need to do testing [08:01:55] ah! [08:01:57] ok [08:02:03] that doesn't sound too bad then [08:02:10] <_joe_> no it does not [08:02:21] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [08:02:26] i wonder if one could offer different ciphers if there is no sni to get around breaking ie6 winxp if one were to switch off undesirable ones [08:05:04] <_joe_> jzerebecki: mmmh well, given how we basically force clients to DTRT that's not that important [08:06:11] (03PS1) 10Giuseppe Lavagetto: Make use of dh_auto_install to install php.ini [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161187 [08:09:00] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [08:10:35] yes that is true. for java6 that trick would allow one to offer DHE with 2k keys to SNI browsers without breaking it as java6 does not offer SNI. [08:15:52] (03PS1) 10Jeremyb: raise account creation throttle: Puget Sound [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161191 (https://bugzilla.wikimedia.org/70953) [08:16:13] (03PS1) 10Jeremyb: add www.soumaya.com.mx to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161192 (https://bugzilla.wikimedia.org/70986) [08:17:40] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [08:20:20] (03PS1) 10Giuseppe Lavagetto: Imported Upstream version 3.3+20140918 [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161190 [08:20:45] <_joe_> it only took me 20 mins to upload this :P [08:22:36] <_joe_> so now I'm facing an interesting problem [08:22:52] (03PS1) 10BBlack: Add role::cache::ssl::sni [puppet] - 10https://gerrit.wikimedia.org/r/161193 [08:23:33] <_joe_> which is - how do I submit a rebase to gerrit? [08:23:38] <_joe_> is that even possible? [08:23:57] you mean manually rebase a patchset that it can't do with the rebase button? [08:24:16] <_joe_> bblack: rebasing the "master" branch onto the "upstream" branch [08:24:21] <_joe_> not a patchset [08:24:25] <_joe_> an entire tree [08:24:27] oh debs [08:24:32] <_joe_> yep [08:24:47] <_joe_> also, https://gerrit.wikimedia.org/r/#/c/161190/ won't load [08:24:51] every time I have something scary like that to do that involves a ton of new commits (which has only ever happened with varnish upstream merges) [08:25:04] I just pushed directly to the branch after rebase, instead of going through gerrit refs/for :) [08:25:30] <_joe_> oh so push --force directly? is that possible? [08:25:32] I think I may have had to edit gerrit permissions to make it work, too [08:25:35] not --force [08:25:49] just push origin master instead of push origin HEAD:refs/for/master [08:26:01] <_joe_> oh fuck. I hate gerrit and git review [08:26:35] in cases like debs with upstream merges/rebases, gerrit doesn't make a lot of sense anyways. treat it like real git and bypass where possible [08:26:52] then go back through gerrit for your final commit afterwards for the debian/changelog update or whatever so that it's noted that something happened [08:27:01] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:27:28] (at least, that's my take on it) [08:27:47] <_joe_> bblack: fair enough, I think it makes sense [08:28:03] (03Abandoned) 10Giuseppe Lavagetto: Imported Upstream version 3.3+20140918 [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161190 (owner: 10Giuseppe Lavagetto) [08:28:32] I don't really remember the gerrit perms thing. likely the push attempt will fail, and you'll have to go into the project permissions in the gerrit web interface and allow something [08:28:47] (I did it temporarily and then backed it out when I was done, before) [08:28:53] <_joe_> bblack: thanks [08:34:58] (03CR) 10BBlack: "^ You mean like Ia4bbdb00113dfe8c6740568c6ed0fa16e2c338a1 ?" [puppet] - 10https://gerrit.wikimedia.org/r/161180 (owner: 10BBlack) [08:36:10] (03CR) 10Filippo Giunchedi: [C: 031] lists.wm.org - raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/161177 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [08:40:32] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [08:42:50] (03PS3) 10Filippo Giunchedi: admin: add subbu and gwicke to ocg-render-admins [puppet] - 10https://gerrit.wikimedia.org/r/160497 (owner: 10Matanya) [08:42:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] admin: add subbu and gwicke to ocg-render-admins [puppet] - 10https://gerrit.wikimedia.org/r/160497 (owner: 10Matanya) [08:47:14] (03PS3) 10Filippo Giunchedi: metrics: move from stat1001 to varnish [puppet] - 10https://gerrit.wikimedia.org/r/160926 [08:48:18] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] metrics: move from stat1001 to varnish [puppet] - 10https://gerrit.wikimedia.org/r/160926 (owner: 10Filippo Giunchedi) [08:48:55] (03CR) 10Dan-nl: "commented in the bug as well:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161192 (https://bugzilla.wikimedia.org/70986) (owner: 10Jeremyb) [08:56:33] !log xtrabackup clone db1016 to db2010 [08:56:37] Logged the message, Master [08:58:36] (03PS3) 10Filippo Giunchedi: metrics: point to misc-web-lb.eqiad [dns] - 10https://gerrit.wikimedia.org/r/160925 [08:58:41] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] metrics: point to misc-web-lb.eqiad [dns] - 10https://gerrit.wikimedia.org/r/160925 (owner: 10Filippo Giunchedi) [09:00:46] !log updated authdns to 0c2225d [09:00:50] Logged the message, Master [09:10:57] (03CR) 10Filippo Giunchedi: "what's the relationship between this and using salt in https://gerrit.wikimedia.org/r/#/c/160953/2 ? seem to be going in two different di" [puppet] - 10https://gerrit.wikimedia.org/r/159636 (owner: 10Hoo man) [09:15:56] (03PS1) 10Mark Bergsma: Revert "disable wmfusercontent.org site on misc for now, nginx borked" [puppet] - 10https://gerrit.wikimedia.org/r/161200 [09:16:37] (03PS2) 10Mark Bergsma: Revert "disable wmfusercontent.org site on misc for now, nginx borked" [puppet] - 10https://gerrit.wikimedia.org/r/161200 [09:17:24] (03CR) 10Mark Bergsma: [C: 032] Revert "disable wmfusercontent.org site on misc for now, nginx borked" [puppet] - 10https://gerrit.wikimedia.org/r/161200 (owner: 10Mark Bergsma) [09:40:42] (03PS1) 10Filippo Giunchedi: swift: lower container availability threshold to >1 host [puppet] - 10https://gerrit.wikimedia.org/r/161203 [09:42:05] (03PS2) 10Filippo Giunchedi: swift: lower container availability threshold to >1 host [puppet] - 10https://gerrit.wikimedia.org/r/161203 [09:42:11] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: lower container availability threshold to >1 host [puppet] - 10https://gerrit.wikimedia.org/r/161203 (owner: 10Filippo Giunchedi) [09:45:11] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [09:51:23] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures [09:56:40] (03CR) 10Nemo bis: [C: 031] "Hm. https://en.wikipedia.org/w/index.php?title=Special:Log&page=User:Coffee/Userpage" [puppet] - 10https://gerrit.wikimedia.org/r/161098 (owner: 10Dzahn) [10:00:11] (03PS1) 10Mark Bergsma: Add protoproxy::localssl server_aliases parameter [puppet] - 10https://gerrit.wikimedia.org/r/161205 [10:01:17] (03PS2) 10Mark Bergsma: Add protoproxy::localssl server_aliases parameter [puppet] - 10https://gerrit.wikimedia.org/r/161205 [10:03:04] (03CR) 10Mark Bergsma: [C: 032] Add protoproxy::localssl server_aliases parameter [puppet] - 10https://gerrit.wikimedia.org/r/161205 (owner: 10Mark Bergsma) [10:06:59] _joe_: hphpize injecting PWD is apparently fixed upstream :-) might want to cherry pick their patch. I posted some details at https://bugzilla.wikimedia.org/show_bug.cgi?id=68944#c7 [10:09:34] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [10:18:30] (03PS1) 10Mark Bergsma: Remove protoproxy::localssl enabled parameter [puppet] - 10https://gerrit.wikimedia.org/r/161208 [10:20:00] (03CR) 10Mark Bergsma: [C: 032] Remove protoproxy::localssl enabled parameter [puppet] - 10https://gerrit.wikimedia.org/r/161208 (owner: 10Mark Bergsma) [10:21:40] _joe_: you can also push all those patches to a new branch directly (which needs additional permissions) and just push the merge of that branch into review. [10:25:43] (03PS3) 10Filippo Giunchedi: metrics: disable SSL virtualhost and cert [puppet] - 10https://gerrit.wikimedia.org/r/160927 [10:43:19] (03CR) 10Mark Bergsma: [C: 032] nginx: drop 'enabled' parameter [puppet/nginx] - 10https://gerrit.wikimedia.org/r/160256 (owner: 10Ori.livneh) [10:45:40] (03PS1) 10Giuseppe Lavagetto: Updating patches to sync with upstream [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161215 [10:45:42] (03PS1) 10Giuseppe Lavagetto: Version bump [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161216 [10:48:34] (03CR) 10Alexandros Kosiaris: "I am up for it. Could we do something though with a GET ? As far as monitoring goes it tends to be a bit more intuitive and easy to reprod" [puppet] - 10https://gerrit.wikimedia.org/r/160412 (owner: 10Alexandros Kosiaris) [10:48:40] (03PS1) 10Mark Bergsma: Update submodule nginx [puppet] - 10https://gerrit.wikimedia.org/r/161217 [10:49:02] (03CR) 10Mark Bergsma: [C: 032] Update submodule nginx [puppet] - 10https://gerrit.wikimedia.org/r/161217 (owner: 10Mark Bergsma) [10:49:14] (03CR) 10Mark Bergsma: [V: 032] Update submodule nginx [puppet] - 10https://gerrit.wikimedia.org/r/161217 (owner: 10Mark Bergsma) [10:49:40] mark: time to take a quick look into our changed exim regex ? https://gerrit.wikimedia.org/r/#/c/155753/ [10:51:21] (03PS4) 10Filippo Giunchedi: metrics: disable SSL virtualhost and cert [puppet] - 10https://gerrit.wikimedia.org/r/160927 [10:51:27] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] metrics: disable SSL virtualhost and cert [puppet] - 10https://gerrit.wikimedia.org/r/160927 (owner: 10Filippo Giunchedi) [10:53:42] PROBLEM - puppet last run on ssl1008 is CRITICAL: CRITICAL: Epic puppet fail [10:54:12] PROBLEM - puppet last run on ssl1005 is CRITICAL: CRITICAL: Epic puppet fail [10:54:32] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:54:52] PROBLEM - puppet last run on ssl1007 is CRITICAL: CRITICAL: Epic puppet fail [10:55:07] PROBLEM - puppet last run on ssl1001 is CRITICAL: CRITICAL: Epic puppet fail [10:56:00] (03PS2) 10Giuseppe Lavagetto: Make use of dh_auto_install to install php.ini [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161187 [10:56:13] (03PS1) 10Mark Bergsma: Remove protoproxy::init enabled parameter [puppet] - 10https://gerrit.wikimedia.org/r/161219 [10:56:35] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Make use of dh_auto_install to install php.ini [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161187 (owner: 10Giuseppe Lavagetto) [10:56:37] PROBLEM - puppet last run on ssl1006 is CRITICAL: CRITICAL: Epic puppet fail [10:56:37] PROBLEM - puppet last run on ssl3001 is CRITICAL: CRITICAL: Epic puppet fail [10:56:54] (03CR) 10Alexandros Kosiaris: [C: 032] "Totally agreed. One minor point is that the sysctls that webserver::apache would set via webserver::base which probably are not so importa" [puppet] - 10https://gerrit.wikimedia.org/r/161149 (owner: 10Dzahn) [10:57:12] (03CR) 10Mark Bergsma: [C: 032] Remove protoproxy::init enabled parameter [puppet] - 10https://gerrit.wikimedia.org/r/161219 (owner: 10Mark Bergsma) [10:57:35] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:58:19] (03PS15) 10Yuvipanda: [WIP] Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [10:58:23] PROBLEM - puppet last run on ssl1003 is CRITICAL: CRITICAL: Epic puppet fail [10:58:29] (03PS2) 10Giuseppe Lavagetto: Updating patches to sync with upstream [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161215 [10:59:53] RECOVERY - puppet last run on ssl1005 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [11:07:25] (03CR) 10Alexandros Kosiaris: [C: 032] Ensure snmp package on icinga for PDU monitoring [puppet] - 10https://gerrit.wikimedia.org/r/161182 (owner: 10Alexandros Kosiaris) [11:09:23] (03PS2) 10Giuseppe Lavagetto: Version bump [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161216 [11:10:27] (03Abandoned) 10Alexandros Kosiaris: Just modularize webserver.pp [puppet] - 10https://gerrit.wikimedia.org/r/137682 (owner: 10Alexandros Kosiaris) [11:11:45] RECOVERY - puppet last run on ssl1008 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [11:12:52] RECOVERY - puppet last run on ssl1007 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [11:13:03] RECOVERY - puppet last run on ssl1001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [11:13:33] RECOVERY - puppet last run on ssl1006 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [11:15:43] RECOVERY - puppet last run on ssl3001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [11:17:33] RECOVERY - puppet last run on ssl1003 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [11:20:00] (03PS16) 10Yuvipanda: [WIP] Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [11:20:19] (03PS2) 10Alexandros Kosiaris: nfs cleanups [puppet] - 10https://gerrit.wikimedia.org/r/160984 [11:20:44] (03CR) 10Alexandros Kosiaris: [C: 032] swift: remove ganglia stats via ganglia-logtailer [puppet] - 10https://gerrit.wikimedia.org/r/159705 (owner: 10Filippo Giunchedi) [11:33:10] (03PS17) 10Yuvipanda: [WIP] Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [11:40:34] !log forgot to log this earlier: manually started salt minion on radon, elastic1015, searchidx1001, it wasn't running there [11:40:41] Logged the message, Master [11:44:04] (03PS18) 10Yuvipanda: [WIP] Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [11:45:53] PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: Epic puppet fail [11:49:11] godog: any good doc on how image scaling is done on the cluster ? [11:49:37] <_joe_> matanya: with convert(1) and a lot of grease [11:49:40] matanya: btw, I got my first proper shinken alert! \o/ (http://en.wikipedia.beta.wmflabs.org/ is down) [11:50:04] YuviPanda: yoohooo!! [11:50:09] I realized I can't actually use resource collection at all, since the labs hosts run a different puppetmaster than the machine shinken will live in (labmon1001) [11:50:14] <_joe_> at least, that's my understanding [11:50:15] _joe_: that is a good doc! :) [11:50:56] YuviPanda: you can still store the resources and pull, don't you? [11:51:03] matanya: I can, yeah [11:51:08] still... [11:51:15] <_joe_> YuviPanda: ugh what the fuck's going on with beta? [11:51:19] less than optimal, but doable [11:51:30] _joe_: it's been dead since a while, the docroot changes fucked it up [11:51:39] it's serving them as plain files rather than php [11:51:58] because apache has to reconfigure paths or something? unsure [11:53:40] <_joe_> YuviPanda: looking at the backends, they work [11:53:59] _joe_: backends as in the mediawiki-* machines? [11:54:09] <_joe_> YuviPanda: it's deployment-mediawiki03 which is fucked up [11:54:12] <_joe_> I dunno why [11:54:15] ah I see [11:54:26] <_joe_> which should NOT be in the backends pool [11:54:32] <_joe_> so, turning down apache there [11:54:44] ok [11:55:02] <_joe_> http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page wmf now [11:55:24] it does, but http://en.wikipedia.beta.wmflabs.org/ still doesn't [11:56:57] also hehe 'wmf' now, you've been assimilated? :) [11:59:44] PROBLEM - puppet last run on mw1095 is CRITICAL: CRITICAL: Puppet has 1 failures [12:00:02] (03PS1) 10Giuseppe Lavagetto: Updating patches to sync with upstream [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161221 [12:00:24] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Puppet has 1 failures [12:00:36] PROBLEM - puppet last run on mw1058 is CRITICAL: CRITICAL: Puppet has 1 failures [12:00:45] PROBLEM - puppet last run on virt0 is CRITICAL: CRITICAL: Puppet has 1 failures [12:00:53] PROBLEM - puppet last run on tmh1002 is CRITICAL: CRITICAL: Puppet has 1 failures [12:01:16] PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: Puppet has 1 failures [12:01:33] PROBLEM - puppet last run on fenari is CRITICAL: CRITICAL: Puppet has 1 failures [12:03:50] <_joe_> salt hiccups [12:06:22] <_joe_> YuviPanda: I purged the varnish cache, should be ok now [12:06:36] _joe_: indeed, thanks! [12:08:53] (03CR) 10Mark Bergsma: [C: 031] Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [12:09:53] RECOVERY - puppet last run on db2002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [12:17:53] RECOVERY - puppet last run on mw1095 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [12:18:13] RECOVERY - puppet last run on mw1019 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [12:18:24] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [12:18:45] RECOVERY - puppet last run on mw1058 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:18:54] RECOVERY - puppet last run on tmh1002 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [12:18:54] RECOVERY - puppet last run on virt0 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:19:44] RECOVERY - puppet last run on fenari is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [12:42:32] mark: thanks for that. hope to get that deployed today ! [12:47:03] matanya: not afaik, there might be something on wikitech in the swift section, but in short there's a swift middleware that looks at the URLs and forwards requests for thumbs to the imagescaler fleet IIRC [12:47:57] godog: thanks, i'm more interested in the video scaling, the only page on wikitech i found was: https://wikitech.wikimedia.org/wiki/Media_storage [12:48:08] which isn't very informative [12:48:51] indeed [12:54:30] (03PS19) 10Yuvipanda: [WIP] Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [12:54:31] matanya: it doesn't seem horribly wrong though [12:55:02] that is good. [12:57:44] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 10484 seconds [12:58:51] oh [13:33:05] (03PS1) 10Manybubbles: Add stats group for mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/161233 [13:34:27] (03CR) 10Ottomata: "I believe puppet will schedule a ferm refresh, but won't actually restart it until all of the resources that the ferm service subscribes t" [puppet] - 10https://gerrit.wikimedia.org/r/160802 (owner: 10Ottomata) [13:34:52] * aude coming to swat today :) [13:36:46] (03CR) 10Ottomata: metrics: move from stat1001 to varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/160926 (owner: 10Filippo Giunchedi) [13:42:39] (03PS20) 10Yuvipanda: Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [13:43:16] (03CR) 10jenkins-bot: [V: 04-1] Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 (owner: 10Yuvipanda) [13:43:45] (03PS21) 10Yuvipanda: Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [13:47:51] (03PS22) 10Yuvipanda: Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [13:51:15] (03CR) 10Hoo man: "This would then only allow deployers to restart apache on individual machines (or by per hand invoking dsh which is evil), but I think thi" [puppet] - 10https://gerrit.wikimedia.org/r/159636 (owner: 10Hoo man) [13:52:56] (03PS23) 10Yuvipanda: Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 [13:58:02] (03CR) 10Filippo Giunchedi: [C: 031] Add stats group for mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/161233 (owner: 10Manybubbles) [14:00:08] manybubbles: can I search by file type with cirrus ? [14:11:33] matanya: you can search the file extension in title :) [14:11:55] thanks Nemo_bis, not quite what i'm looking for :) [14:12:26] too bad [14:12:31] there's a bug report though [14:12:36] (and this channel is totally wrong) [14:13:51] mark: ping [14:14:22] (03PS1) 10Yuvipanda: beta: Move monitoring role into module [puppet] - 10https://gerrit.wikimedia.org/r/161240 [14:18:04] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [14:26:19] (03CR) 10Filippo Giunchedi: [C: 031] Updating patches to sync with upstream [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161215 (owner: 10Giuseppe Lavagetto) [14:31:29] matanya: hmmm - i don't think so unfortunately. I remember we were going to do it but then we didn't - probably just forgot [14:34:44] manybubbles: let me know if you'd need to merge 161233 soonish! [14:38:54] godog: no harm in merging it! [14:39:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Add stats group for mwgrep [puppet] - 10https://gerrit.wikimedia.org/r/161233 (owner: 10Manybubbles) [14:39:56] manybubbles: ack, going for it [14:44:14] manybubbles, marktraceur, ^d: I'll SWAT today unless one of you really wants it. [14:48:02] mutante: anything of interest on https://rt.wikimedia.org/Ticket/Display.html?id=2846 ? [14:49:45] Reedy: greg-g: anyone else: Do you think you could help me get this EducatoionProgram patch on today's deployment train? https://gerrit.wikimedia.org/r/#/c/160901/ [14:50:26] It fixes an issue currently now present on production, not urgent enough for a special fix deploy, but it would be really great to get it in the pipeline [14:50:31] aude, hoo, MatmaRex: Ping for SWAT in 10 minutes [14:50:37] yes [14:52:57] anomie: No ty. ;) good luck! [14:55:12] anomie: here [15:00:04] aude: Let's do the Wikidata fix first [15:00:05] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140918T1500). Please do the needful. [15:00:16] ok [15:01:58] !log anomie Synchronized php-1.24wmf21/extensions/Wikidata/: SWAT: Update Wikidata to fix broken xml api output [[gerrit:161232]] (duration: 00m 38s) [15:02:00] aude: ^ test please [15:02:04] MatmaRex: You're next [15:02:04] Logged the message, Master [15:02:09] yes, yes, yes [15:02:21] ok [15:02:43] anomie: looks good :) [15:05:25] (03CR) 10Alexandros Kosiaris: [C: 031] partially enable outbound SMTP STARTTLS support [puppet] - 10https://gerrit.wikimedia.org/r/160632 (owner: 10Filippo Giunchedi) [15:06:37] !log anomie Synchronized php-1.24wmf21/resources/src/mediawiki.action/mediawiki.action.view.redirectPage.css: SWAT: mediawiki.action.view.redirectPage: Correct a CSS selector [[gerrit:161239]] (duration: 00m 23s) [15:06:39] MatmaRex: ^ test please [15:06:43] Logged the message, Master [15:07:17] anomie: works as expected. thanks [15:07:19] * anomie is done with SWAT! [15:08:20] mark: ping [15:08:26] any chance of reaching him today? [15:09:19] Reedy: greg-g: quick update: the patch is now merged to master (thx legoktm) so nothing special needed for EP deployment train now :) [15:09:32] Jeff_Green: ^ around ? [15:10:31] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Updating patches to sync with upstream [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161221 (owner: 10Giuseppe Lavagetto) [15:11:02] (03Abandoned) 10Giuseppe Lavagetto: Updating patches to sync with upstream [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161215 (owner: 10Giuseppe Lavagetto) [15:11:45] (03PS1) 10Giuseppe Lavagetto: Version bump [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161244 [15:12:11] (03Abandoned) 10Giuseppe Lavagetto: Version bump [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161216 (owner: 10Giuseppe Lavagetto) [15:12:26] tonythomas: yep, [15:13:48] time to roll in our [15:13:53] gerrit.wikimedia.org/r/#/c/155753/ :) [15:14:27] mark: gave his +1 some hours before [15:14:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] "@Daniel, no this is not possible with ferm. ferm will evaluate the new fules and if everything is correct it will try to enforce as a tran" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/160802 (owner: 10Ottomata) [15:15:12] tonythomas: ok [15:15:38] I think what I'd like to do is first generate the expected config somewhere, and copy it over to an mx [15:15:58] we are getting this to deployment.wikimedia.beta.wmflabs.org right ? [15:16:00] k [15:16:10] I don't know if it's possible to override realm in labs or something? [15:16:24] (03PS2) 10Giuseppe Lavagetto: Version bump [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161244 [15:16:46] in labs ? [15:16:47] i can hack it by hand if necessary, i just want to test the syntax and see how it behaves before merge [15:16:55] okey [15:17:08] i.e. on your labs puppet test host [15:17:15] its running and up in our labs instance 'mediawiki-verp' [15:17:41] right, now we just need to generate exim4.conf with production values [15:18:07] with production values. ok. [15:18:55] generating 'labs' was easy, but 'production' - you will have the access right [15:20:05] (03PS2) 10BBlack: Add role::cache::ssl::sni [puppet] - 10https://gerrit.wikimedia.org/r/161193 [15:20:43] (03PS3) 10Giuseppe Lavagetto: Version bump [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161244 [15:21:15] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Version bump [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161244 (owner: 10Giuseppe Lavagetto) [15:23:01] _joe_: \O/ :-] [15:23:25] will upgrade hhvm on contint whenever that lands in [15:23:59] omg [15:24:36] i wonder if we'll be able to merge wikibase, though good to force us to fix whatever issues there might be [15:24:52] <_joe_> hashar: eh just realized I screwed up something spectacularly [15:25:16] <_joe_> not *so* much [15:25:23] <_joe_> meaning, it's recoverable [15:25:32] hashar: or will we be able to run both or one/other ? [15:25:50] aude: what do you mean by merging wikibase ? [15:25:54] Wikibase to core ? [15:25:55] https://travis-ci.org/wikimedia/mediawiki-extensions-Wikibase/jobs/35629216 [15:26:12] i think i looked at that but not sure what the issue is now [15:26:15] or same [15:26:17] bouh :-( [15:26:25] I have yet to add hhvm to the contint system [15:26:29] tonythomas: maybe we just live-hack the manifest that sets realm on that labs instance [15:26:31] need a few more weeks [15:26:37] hhvm does not like our class_aliases for one, i think memory leak or something [15:26:47] we will eliminate them but it's not that simple [15:32:00] Jeff_Green: oh. was out. yeah. we live hack in [15:38:29] !log restarting Zuul just to be safe [15:38:33] Logged the message, Master [15:38:44] (03CR) 10Ottomata: Add nickel to $MONITORING_HOSTS network, rename ferm::rule icinga-all to monitoring-all (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/160802 (owner: 10Ottomata) [15:39:47] robh: Any chance I can catch mark over here today... or shall I rather go via mail? [15:39:54] (03PS2) 10Ottomata: Add nickel to $MONITORING_HOSTS network, rename ferm::rule icinga-all to monitoring-all [puppet] - 10https://gerrit.wikimedia.org/r/160802 [15:40:14] ? [15:40:31] hoo: i dont know what you mean, are you asking me how you should contact mark bergsma? [15:40:43] robh: yep [15:40:52] uh, well, he got pinged but yea, i guess email [15:40:57] Saying "hey mark" is a good start [15:41:07] meh, not sure I can make my point via mail as good, but whatever [15:42:20] * robh is presently tring to fix his irc bouncer which decided to die a horrible death overnight [15:50:30] (03PS3) 10BBlack: Add role::cache::ssl::sni [puppet] - 10https://gerrit.wikimedia.org/r/161193 [15:57:59] (03PS3) 10Alexandros Kosiaris: Remove the now defunct backup::server class [puppet] - 10https://gerrit.wikimedia.org/r/159284 [15:59:32] (03CR) 10Alexandros Kosiaris: [C: 032] Add nickel to $MONITORING_HOSTS network, rename ferm::rule icinga-all to monitoring-all [puppet] - 10https://gerrit.wikimedia.org/r/160802 (owner: 10Ottomata) [16:00:26] (03CR) 10Alexandros Kosiaris: [C: 032] Remove the now defunct backup::server class [puppet] - 10https://gerrit.wikimedia.org/r/159284 (owner: 10Alexandros Kosiaris) [16:02:37] (03PS4) 10BBlack: Add role::cache::ssl::sni [puppet] - 10https://gerrit.wikimedia.org/r/161193 [16:09:34] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [16:09:38] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [16:10:16] ottomata, jgage, can hadooplogstash04 and hadooplogstash06 survive a bit of downtime? I'd like to move them to a different labs host. [16:10:46] jgage: knows, but I betcha the answer is yes [16:11:34] ottomata: I phrased that question like it was optional, but really the options are me doing it now while you're expecting it or at a totally unpredictable time over the weekend :) [16:13:14] andrewbogott: jgage uses (used?) those for testing logstash stuff [16:13:23] ok, I'll wait for him to get in [16:13:28] just for testing stuff, so if they go down I doubt it will be a problem [16:13:32] i'm 95% it is fine [16:13:49] if it isn't, he should be able to recreate them very easily [16:13:57] he was usin git for testing deb packages and puppet stuff [16:14:01] all of which has been committed [16:14:11] (03PS46) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [16:14:25] (03PS47) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [16:14:35] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [16:14:35] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [16:15:05] greg-g, anyone deploying? need an i18n message pushed out asap - partner is testing in real time :( [16:15:30] https://gerrit.wikimedia.org/r/#/c/161255/ [16:18:38] yurikR: ori has next deploy is less than two hours so no deploys til then according to wikitech at least [16:19:05] JohnFLewis, it should be fundraising right now, but i can't find them online [16:19:19] yurikR: #wikimedia-fundraising [16:19:29] shouldn't they be here when deploying? :) [16:19:41] Are they deploying? [16:19:49] they are on schedule [16:20:08] yurikR: which? Wikitech shows a clear gap for now [16:20:22] Yep, I think you're wrong there [16:20:37] https://wikitech.wikimedia.org/wiki/Deployments#Thursday.2C.C2.A0September.C2.A018 [16:20:54] oh, wrong timezone. GRR [16:20:55] sorry [16:21:11] Thu Sep 18 16:20:57 UTC 2014 [16:21:13] (date --utc) [16:21:28] :p [16:21:39] yep, my bad. Ok, will push. If i am adding a new translation to portugeese, do i need to scap? [16:21:47] yes :S [16:22:05] i'm not adding a new message, only translating existing that wasn't translated before [16:22:17] any way to quickly push it? [16:22:25] if not, scap it is [16:22:54] mw-update-l10n ? [16:22:59] Not sure that still works [16:23:21] bd808: ^ [16:23:44] scap scap scap scap [16:23:46] scap [16:23:50] jsut scap [16:23:51] bleh, ok [16:24:19] Does anyone know about the brand-new deployment-pdf02 instance? [16:24:20] bd808: So what about mw-update-l10n is it broken? [16:24:35] scap.UpdateL10n.run() [16:25:15] it builds the cache in the staging area on tin but doesn't ship to the cluster [16:25:29] or wait no [16:25:54] it puts the json files back together as cdbs on each host? [16:25:54] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Alternative_to_scap [16:26:09] I have never dared to do that [16:26:38] as per https://gerrit.wikimedia.org/r/#/c/155753/45/manifests/role/mail.pp we made the 'production' to point to the host ( http://meta.wikimedia.org/w/api.php ) as a dummy, but it looks like /usr/bin/curl -H 'Host: appservers.svc.eqiad.wmnet' http://meta.wikimedia.org/w/api.php -d "action=bouncehandler" --data-urlencode "jgreen_just_testing@wikimedia.org" [16:26:38] gives not the desirable output [16:26:54] hoo: it would technically work but I'd be willing to bet $50 that it is actually slower than running scap [16:27:16] Oh... wow [16:27:20] update wikitech then? [16:27:27] Or test it during some quite time [16:27:39] Jeff_Green thinks that the webserver running at http://meta.wikimedia.org doesn't seem to recognize the hostname appservers.svc.eqiad.wmnet [16:27:51] hoo, bd808 any thoughts on this ? [16:27:56] hoo: sync-l10nupdate-1 rsyncs the cdb files as binary blobs [16:28:19] rsync is not good at that [16:28:48] tonythomas: the webservers actually are at appservers.svc.eqiad.wmnet [16:29:17] Thats the internal name for pybal right? [16:29:24] hoo: looks like our command was correct then? [16:29:27] hoo@tin:~$ curl -H 'Host: meta.wikimedia.org' appservers.svc.eqiad.wmnet/wiki/Special:Version [16:29:32] works [16:29:55] of course that only works from production [16:29:58] not even beta or something [16:30:50] tonythomas: why would the Host be an internal service alias? [16:30:52] hoo: do you think we interchanged the Host: url ? [16:31:09] It looks like those may be backwards to me [16:31:09] tonythomas: Oh... dunno, haven't looked that much at the recent PSs [16:31:26] bd808: For internal access those are the way to go according to mark (or so) [16:31:32] Those are the LVS backends, yeah [16:31:47] pybal is the thing we have on top of LVS to depool stuffs [16:31:51] Right, post to lvs but give a real Host for apache [16:32:25] bd808: so something like. like /usr/bin/curl -H 'Host: meta.wikimedia.org' appservers.svc.eqiad.wmnet/w/api.php -d "action=bouncehandler" --data-urlencode "testing" [16:32:26] tonythomas: Is verp_bounce_post_url only evaluated by the exim server? [16:32:33] yeah. [16:32:57] yup. okey. let me try the prod ones in labs [16:33:09] typo -- lab ones [16:33:10] That would hit *some* MW host via the lvs and ask to talk to the meta apache vhost [16:33:17] k [16:34:53] (03PS5) 10BBlack: Add role::cache::ssl::sni [puppet] - 10https://gerrit.wikimedia.org/r/161193 [16:35:33] (03CR) 10jenkins-bot: [V: 04-1] Add role::cache::ssl::sni [puppet] - 10https://gerrit.wikimedia.org/r/161193 (owner: 10BBlack) [16:37:02] godog: hi. re: 2846 so, first of all, very nice! 2012 hah, so i forgot most of it but good cleanup to ping me. what i remember is this: we wanted to let non-root users read /var/log/syslog and /var/log/messages on labs. we puppetized it in base / define syslogs::readable() {. then somebody, i guess Ryan, said it's ok for now but we should also fix the root cause (not sure anymore which one exactly, i guess the permissions being like that [16:37:10] godog: ugh, was that too long :) [16:38:17] bd808: curl -H 'Host: meta.wikimedia.org' appservers.svc.eqiad.wmnet/w/api.php -d "action=bouncehandler" doesnt works from deployment-mediawiki02 [16:38:30] mutante: haha well the other thing is that users in labs could just have root no? but without a definite root cause is hard to tell [16:38:31] can you or hoo try that in a prod server ? [16:38:40] try what? [16:38:49] curl -H 'Host: meta.wikimedia.org' appservers.svc.eqiad.wmnet/w/api.php -d "action=bouncehandler" [16:38:57] oh, that wont work from beta, as stated above [16:39:21] <error code="unknown_action" info="Unrecognized value for parameter &#039;action&#039;: bouncehandler" xml:space="preserve"> [16:39:24] true that. I just want to conifrm before PS 47 :D [16:39:33] I guess the API is not yet on prod. right? [16:39:37] not yet [16:39:42] looks good then [16:39:44] (03PS6) 10BBlack: Add role::cache::ssl::sni [puppet] - 10https://gerrit.wikimedia.org/r/161193 [16:40:12] anyway, it wont hit the API by any chance, and do you think I should be adding that in the PS ? [16:40:36] (03PS1) 10Andrew Bogott: Revert "(Re-)enable VisualEditor for Wikitech (labswiki)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161262 [16:40:41] ok, deploying zero and scaping [16:40:45] greg-g, ^ [16:41:20] (03PS7) 10BBlack: Add role::cache::ssl::sni [puppet] - 10https://gerrit.wikimedia.org/r/161193 [16:41:35] we have 2 steps of checking - one the regex check for the VERP pattern, ans secondly, a verp_domain check which is a check to null currently ( in prod ) [16:42:15] (03CR) 10Jforrester: "Can we help fix Parsoid then?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161262 (owner: 10Andrew Bogott) [16:42:29] !log yurik Started scap: (no message) [16:42:35] Logged the message, Master [16:42:56] yurikR: g'morning, just reading scrollback. One thing, put a message in your scap ;) [16:43:14] greg-g, lol, was tying the log command :) sorry [16:43:18] (03CR) 10Andrew Bogott: "I disabled parsoid on Reedy's advice. I welcome a coherent effort to enable this and make it work, but throwing arbitrary settings into t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161262 (owner: 10Andrew Bogott) [16:43:26] yurikR: no worries [16:43:42] !log yurik scaping zero - partner needs an l10n message asap [16:43:47] Logged the message, Master [16:44:19] andrewbogott: What is a work sprint? [16:44:34] andrewbogott: We put in quite a bit of work to get Parsoid working on wikitech, and yet weren't contacted when it was removed because of issues. [16:44:37] Hm, and reedy +2'd the patch to enable it. [16:44:41] curious [16:44:42] Yeah. [16:44:56] It was creating job queue jobs that weren't being processed [16:45:02] Easiest thing at the time was to just disable it [16:45:12] Reedy: ok, but we can't disable parsoid and enable VE, can we? [16:45:21] Ah, right. Is the Parsoid PHP extension not installed? [16:45:21] (03PS8) 10BBlack: Add role::cache::ssl::sni [puppet] - 10https://gerrit.wikimedia.org/r/161193 [16:45:21] As we were dealing with other job queue issues [16:45:25] Does wikitech need a parsoid server [16:45:27] godog: they can only have root if they are project admins, not if they are mere project members [16:45:32] bd808: Yes. [16:45:43] James_F: what I mean by 'work sprint' is: I don't know enough to mess with this, and I need y'all to discuss and get on the same page :) [16:45:55] godog: i think we want to let everybody read the logs but not make everybody root [16:46:17] godog: i dunno, could check what the current status is in labs.. [16:47:18] andrewbogott: Sure. :-) [16:47:28] andrewbogott: I was a bit surprised it was merged, TBH. [16:47:46] mutante: ah ok, the current status looks like the same as in prod, syslog:adm [16:48:12] jeremyb: Due to a combination of me being wrong and OpenStack being wrong, the new deployment-pdf02 instance got scheduled on virt1006, the host slated for demolition. Mind if I move it out of the way? A bit of downtime OK? [16:48:58] (This is aggravating -- the more things I move off of virt1006, the more inviting it looks to the scheduler for new instances. It's like bailing a leaky boat) [16:49:53] (03PS48) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [16:50:34] hoo: this one looks good ? https://gerrit.wikimedia.org/r/#/c/155753/48/manifests/role/mail.pp [16:51:46] tonythomas: That looks better to me [16:52:14] bd808: yay. :) [16:52:19] tonythomas: Yeah [16:52:41] ok. we hope to get this deployed today :) [16:53:58] godog: hmm.. let's close it for being too unspecific before it gets imported into phab, or let labs people decide [16:54:36] tonythomas: Ok... but before that hits production you still have to fix the issue that is always poking metawiki [16:54:44] although that's not really desired [16:55:10] hoo: you mean the fact that it points to meta.wikimedia.org ? [16:55:20] mutante: sounds like a plan! [16:55:38] tonythomas: yep [16:55:44] not everyone even has an account tehre [16:56:15] hoo: yeah. actually, we were thinking of putting a random wiki name as of now, and I put meta.wikimedia.org [16:56:33] currently we plan to test only on beta. right ? [16:56:38] Sure [16:56:40] just saying [16:57:09] yeah. so do you think I should change that to null ? We are so sure that no bounce will ever make that curl request to meta though [16:57:19] as per the sec measures we put in [16:57:57] the domains to accept from -- or the +verp_domains, will be = null in prod. so it make sure that no bounce reach this transport [16:58:23] yeah, that's ok for beta [16:58:37] but once you plan to actually hit production you have to figure that [16:58:43] or maybe use loginwiki there [16:58:59] true. I thought CentralAuth should help. [16:59:05] loginwiki ? [16:59:38] got it. login.wikimedia.org [16:59:41] yeah [16:59:53] not sure we auto-create accoutns on meta htese days [17:00:09] we may do, but the assumption that everyone has an account there doesn't hold [17:00:16] also not every account is global (yet) [17:00:38] okey. login wiki seems more like a target, than meta [17:00:50] would be nice if I change it rightaway ? [17:01:15] would take the risk( due to some worst case) off from meta though [17:01:20] I'm not entirely sure what your extension does, so that will need some more consideration with csteipp [17:01:56] we have enabled centralAuth support in our extension though [17:02:13] and made it work if CentralAuth class never exists too [17:02:14] Keep it for beta for now [17:02:33] but those are points that need to be considered before hitting production [17:02:41] you might want to add a comment into the role, though [17:02:56] true. but I think this change would be reflecting places in prod too [17:03:21] so - better change to loginwiki ? [17:04:16] ( only because, every time I show people the PS - they become curious on seeing meta.wikimedia.org ) :P [17:05:56] !log yurik Finished scap: (no message) (duration: 23m 26s) [17:06:00] Logged the message, Master [17:06:37] tonythomas: Do as you favour... I'm just saying that this point shouldn't be forgotten... you can either change it now and keep a comment that this needs to be checked [17:06:44] or just keep it and add a comment [17:06:49] or do nothing and keep it in mind [17:06:52] I don't really care [17:06:55] just keep it in mind, right [17:07:16] hoo: ha :) I will change and keep a comment. [17:08:35] !log replacing failed disk es1005 [17:08:40] Logged the message, Master [17:11:01] AaronSchulz: "Number of mediawiki jobs queued" CRITICAL: Anomaly detected: 47 data above and 0 below the confidence bounds . duration: 5d 18h ... should that not really be critical. i'm not sure how the check works [17:11:51] in the old check that just displayed number of jobs, 47 would have been nothing [17:15:53] <_joe_> mutante: that's the number of anomalous datapoints [17:16:00] <_joe_> but I guess we should remove that alert [17:16:04] <_joe_> the data are too messy [17:16:42] _joe_: ah, ok, thanks [17:18:07] _joe_: i have one other question for you. should we keep all the changes that Reedy uploaded and are about "Apache config for .. using mod_fcgi" [17:18:17] mod_proxy_fcgi [17:18:30] <_joe_> mutante: yes please for the moment [17:18:39] _joe_: ok, alright [17:18:41] <_joe_> next week I'll wrap that up [17:18:56] cool [17:19:16] (03PS49) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 [17:24:17] mutante: STOP TRYING TO ABANDON ALL MY CHANGES! :P [17:24:53] Reedy: hahah [17:28:50] (03PS2) 10BBlack: Unified nginx ssl on varnish at all sites [puppet] - 10https://gerrit.wikimedia.org/r/161180 [17:29:26] (03PS3) 10BBlack: SNI nginx ssl on varnish boxes at all sites [puppet] - 10https://gerrit.wikimedia.org/r/161180 [17:29:59] (03CR) 10BBlack: [C: 04-1] "Needs ssl::sni commit first, and testing of that." [puppet] - 10https://gerrit.wikimedia.org/r/161180 (owner: 10BBlack) [17:30:23] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [17:32:58] (03PS1) 10Giuseppe Lavagetto: Fix hhvm-dev paths [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161270 [17:40:24] (03PS2) 10Alexandros Kosiaris: WIP: module/role for url-downloader [puppet] - 10https://gerrit.wikimedia.org/r/159738 [17:40:59] (03PS1) 10Reedy: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161271 [17:41:01] (03PS1) 10Reedy: testwiki to 1.24wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161272 [17:41:03] (03PS1) 10Reedy: Wikipedias to 1.24wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161273 [17:41:05] (03PS1) 10Reedy: group0 to 1.24wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161274 [17:41:45] (03CR) 10Reedy: [C: 032] Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161271 (owner: 10Reedy) [17:41:49] (03Merged) 10jenkins-bot: Add symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161271 (owner: 10Reedy) [17:42:04] (03CR) 10Reedy: [C: 032] testwiki to 1.24wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161272 (owner: 10Reedy) [17:42:09] (03Merged) 10jenkins-bot: testwiki to 1.24wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161272 (owner: 10Reedy) [17:44:22] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Fix hhvm-dev paths [debs/hhvm] - 10https://gerrit.wikimedia.org/r/161270 (owner: 10Giuseppe Lavagetto) [17:45:23] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [17:48:59] !log reedy Started scap: testwiki to 1.24wmf22 and build l10n cache [17:49:04] Logged the message, Master [17:50:13] godog: https://gerrit.wikimedia.org/r/#/c/161162/ is docs + a notify, if you ahve a sec [17:51:00] (03PS1) 10BryanDavis: Stop making so many virtual resources [puppet] - 10https://gerrit.wikimedia.org/r/161276 (https://bugzilla.wikimedia.org/70971) [17:51:13] (03PS1) 10Dzahn: add wikimania.org zone [dns] - 10https://gerrit.wikimedia.org/r/161277 [17:51:39] ori: You'll love that "feature" we stumbled on with labs-vagrant ^^ [17:52:04] bd808: hahaha WHOA. [17:52:29] (03CR) 10Filippo Giunchedi: [C: 031] mediawiki::cgroup: add docs [puppet] - 10https://gerrit.wikimedia.org/r/161162 (owner: 10Ori.livneh) [17:52:29] 2,293,353 failed stat calls per puppet run is a bit excessive [17:52:40] ori: yup, LGTM [17:52:44] thanks [17:52:51] (03CR) 10Dzahn: [C: 031] "wow @ "full Puppet run on the same" [puppet] - 10https://gerrit.wikimedia.org/r/161276 (https://bugzilla.wikimedia.org/70971) (owner: 10BryanDavis) [17:52:59] (03CR) 10Andrew Bogott: "Does this load properly on labs? I would've thought it had to be in a file named monitoring/graphite.pp" [puppet] - 10https://gerrit.wikimedia.org/r/161240 (owner: 10Yuvipanda) [17:53:02] bd808: haha "full Puppet run on the same [17:53:04] testing host generated 2,293,353 failed stat calls" [17:53:07] that was nice [17:53:17] (03CR) 10BBlack: [C: 031] add wikimania.org zone [dns] - 10https://gerrit.wikimedia.org/r/161277 (owner: 10Dzahn) [17:53:55] bd808: ahahaha [17:55:23] Anybody want to merge that and make disks and cpus on labs-vagrant hosts much happier? [17:55:47] (03CR) 10Ori.livneh: [C: 031] Stop making so many virtual resources [puppet] - 10https://gerrit.wikimedia.org/r/161276 (https://bugzilla.wikimedia.org/70971) (owner: 10BryanDavis) [17:56:15] i'm trying to take it easy with the puppet merges [17:56:25] bd808: is File[$install_directory] still defined someplace? [17:56:47] It is but I shouldn't be relying on it. Did I miss one? [17:57:01] doh Will amend [17:57:22] (03PS2) 10BryanDavis: Stop making so many virtual resources [puppet] - 10https://gerrit.wikimedia.org/r/161276 (https://bugzilla.wikimedia.org/70971) [17:58:01] bd808: any idea what scap actually does if your ssh connection dies? [17:58:11] At first I thought I had to keep it to make the initial directory, but git::clone takes care of that [17:58:31] Reedy: it will die a horrible death because it needs to get to your agent [17:58:35] (03PS9) 10BBlack: Add role::cache::ssl::sni [puppet] - 10https://gerrit.wikimedia.org/r/161193 [17:59:25] Reedy: Last thing logged was "17:55:35 tin INFO - Started sync-apache" [17:59:26] sorta expected [17:59:27] doesn't !log it though [17:59:43] It is probably hanging waiting for the agent to respond [17:59:58] I don't remember what the timeout for that is [18:00:03] ah [18:00:05] Reedy, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140918T1800). [18:00:56] andrewbogott: it should load fine, that's how the other monitoring classes are setup too [18:01:29] (03CR) 10Andrew Bogott: [C: 032] Stop making so many virtual resources [puppet] - 10https://gerrit.wikimedia.org/r/161276 (https://bugzilla.wikimedia.org/70971) (owner: 10BryanDavis) [18:02:00] Thanks andrewbogott [18:02:18] bd808: I'm also building a fresh labs-vagrant instance to verify that everything will still build from scratch [18:02:20] (03PS10) 10BBlack: Add role::cache::ssl::sni [puppet] - 10https://gerrit.wikimedia.org/r/161193 [18:02:26] perfect [18:02:43] (03CR) 10Andrew Bogott: [C: 032] beta: Move monitoring role into module [puppet] - 10https://gerrit.wikimedia.org/r/161240 (owner: 10Yuvipanda) [18:05:15] (03CR) 10Faidon Liambotis: "I think a dryrun => true mode that would ACCEPT by default, possibly have LOG as the last action too, would help adoption. If other review" [puppet] - 10https://gerrit.wikimedia.org/r/160480 (owner: 10Ottomata) [18:07:01] (03PS4) 10Ori.livneh: mediawiki::cgroup: add docs [puppet] - 10https://gerrit.wikimedia.org/r/161162 [18:07:08] (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki::cgroup: add docs [puppet] - 10https://gerrit.wikimedia.org/r/161162 (owner: 10Ori.livneh) [18:07:55] andrewbogott: ty [18:08:00] (03PS3) 10Alexandros Kosiaris: WIP: module/role for url-downloader [puppet] - 10https://gerrit.wikimedia.org/r/159738 [18:08:56] (03CR) 10Ori.livneh: [C: 04-1] "No dashes in names; convert them to underscores, please." [puppet] - 10https://gerrit.wikimedia.org/r/159738 (owner: 10Alexandros Kosiaris) [18:13:04] (03CR) 10Dzahn: "a README.md in the module root is nice to have because it gets parsed by puppet doc and shows up on doc.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/159738 (owner: 10Alexandros Kosiaris) [18:18:58] !log reedy Started scap: testwiki to 1.24wmf22 and build l10n cache [18:21:11] (03CR) 10Andrew Bogott: [C: 032] Initial shinken setup for labs [puppet] - 10https://gerrit.wikimedia.org/r/160626 (owner: 10Yuvipanda) [18:22:36] (03CR) 10Dzahn: [C: 032] add wikimania.org zone [dns] - 10https://gerrit.wikimedia.org/r/161277 (owner: 10Dzahn) [18:24:22] (03PS3) 10Faidon Liambotis: Allocate sandbox vlans for codfw and ulsfo [dns] - 10https://gerrit.wikimedia.org/r/158636 (owner: 10Mark Bergsma) [18:24:24] (03PS2) 10Faidon Liambotis: Allocate IPv4/IPv6 for RIPE Atlas codfw/ulsfo [dns] - 10https://gerrit.wikimedia.org/r/158939 [18:24:33] (03PS1) 10Dzahn: add wikimania.com, link to wikimania.org [dns] - 10https://gerrit.wikimedia.org/r/161284 [18:24:43] wooh.. faidon is back [18:25:34] not really :) [18:25:45] dang! [18:25:49] (03CR) 10Dzahn: [C: 04-1] "typo" [dns] - 10https://gerrit.wikimedia.org/r/161284 (owner: 10Dzahn) [18:26:36] paravoid: aaww [18:27:45] (03PS2) 10Dzahn: add wikimania.com, link to wikimania.org [dns] - 10https://gerrit.wikimedia.org/r/161284 [18:32:56] (03CR) 10Dzahn: [C: 032] add wikimania.com, link to wikimania.org [dns] - 10https://gerrit.wikimedia.org/r/161284 (owner: 10Dzahn) [18:44:14] !log testing exim configuration change on lead.wm.o [18:44:19] Logged the message, Master [18:44:45] (03PS1) 10Yuvipanda: shinken: Experimental monitoring for betacluster [puppet] - 10https://gerrit.wikimedia.org/r/161289 [18:44:49] Jeff_Green: changed ? [18:45:00] not yet. disabling puppet [18:45:04] okey [18:46:32] tonythomas: now changed [18:47:07] great. [18:47:16] (03PS2) 10Yuvipanda: shinken: Experimental monitoring for betacluster [puppet] - 10https://gerrit.wikimedia.org/r/161289 [18:47:39] no notifications from the bot yet :D [18:49:21] !log reedy Finished scap: testwiki to 1.24wmf22 and build l10n cache (duration: 30m 23s) [18:49:26] Logged the message, Master [18:50:44] so far things look normal [18:51:01] (03PS3) 10Yuvipanda: shinken: Experimental monitoring for betacluster [puppet] - 10https://gerrit.wikimedia.org/r/161289 [18:52:18] (03PS4) 10Yuvipanda: shinken: Experimental monitoring for betacluster [puppet] - 10https://gerrit.wikimedia.org/r/161289 [18:53:41] (03PS11) 10BBlack: Add role::cache::ssl::sni [puppet] - 10https://gerrit.wikimedia.org/r/161193 [18:55:27] (03CR) 10BBlack: [C: 032] "Compiler says no effect via refactor of ::unified on current localssl hosts (ulsfo caches). Pushing this through so I can test it on a de" [puppet] - 10https://gerrit.wikimedia.org/r/161193 (owner: 10BBlack) [18:56:36] Is fenari highly loaded/in swap death again? [18:57:10] load average: 17.01, 13.97, 10.48 [18:57:26] it certainly doesn't seem very responsive? what's the again? and why would a server that's having services pulled from it in general be getting worse? [18:57:44] (03PS5) 10Yuvipanda: shinken: Experimental monitoring for betacluster [puppet] - 10https://gerrit.wikimedia.org/r/161289 [18:57:46] it's angry because it feels abandoned [18:57:53] ha, puppet [18:58:14] well puppet's at the top of load [18:58:21] but it's the apaches eating all the memory [18:58:26] yeah [18:59:06] (03CR) 10Reedy: [C: 032] Wikipedias to 1.24wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161273 (owner: 10Reedy) [18:59:12] (03Merged) 10jenkins-bot: Wikipedias to 1.24wmf21 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161273 (owner: 10Reedy) [18:59:24] !log restarting apache on fenari [18:59:29] Logged the message, Master [18:59:33] are people putting weird things in public_html now? :) [18:59:41] hmmm [18:59:41] /etc/init.d/apache2: 55: [: nice: unexpected operator [18:59:45] (03PS6) 10Yuvipanda: shinken: Experimental monitoring for betacluster [puppet] - 10https://gerrit.wikimedia.org/r/161289 [18:59:49] what is that crap? [19:00:10] that error looks old/familiar [19:01:20] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.24wmf21 [19:01:25] Logged the message, Master [19:01:55] the apache procs immediately jumped back to ~650MB of virt, odd [19:01:59] better than they were, though. [19:02:54] (03CR) 10Reedy: [C: 032] group0 to 1.24wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161274 (owner: 10Reedy) [19:02:56] (03PS7) 10Yuvipanda: shinken: Experimental monitoring for betacluster [puppet] - 10https://gerrit.wikimedia.org/r/161289 [19:03:00] (03Merged) 10jenkins-bot: group0 to 1.24wmf22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161274 (owner: 10Reedy) [19:03:58] (03PS8) 10Yuvipanda: shinken: Experimental monitoring for betacluster [puppet] - 10https://gerrit.wikimedia.org/r/161289 [19:04:16] (03PS4) 10Alexandros Kosiaris: module/role for url-downloader [puppet] - 10https://gerrit.wikimedia.org/r/159738 [19:04:35] (03PS9) 10Yuvipanda: shinken: Experimental monitoring for betacluster [puppet] - 10https://gerrit.wikimedia.org/r/161289 [19:04:58] (03CR) 10Alexandros Kosiaris: "Both done. Testing on the catalog compiler for linne" [puppet] - 10https://gerrit.wikimedia.org/r/159738 (owner: 10Alexandros Kosiaris) [19:05:00] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: rest of group0 to 1.24wmf22 [19:05:05] Logged the message, Master [19:05:49] (03PS10) 10Yuvipanda: shinken: Experimental monitoring for betacluster [puppet] - 10https://gerrit.wikimedia.org/r/161289 [19:07:19] still a lot of traffic on fenari for various conf/ things [19:07:22] 1 Warning: Error 0 executing convert +repage -trim -border 3 -bordercolor '#004C92' -background '#004C92' -fill '#FFFFFF' -font '/usr/share/fonts/truetype/ttf-de [19:07:23] javu/DejaVuSans.ttf' -pointsize '10' 'label:Free Wikipedia from Aircel' gif:-: mkdir: cannot create directory `/sys/fs/cgroup/memory/mediawiki/job/14615': No such file [19:07:23] or directory#012limit.sh: failed to create the cgroup.#012convert: memory allocation failed `-' @ error/quantize.c/QuantizeImage/2656.#012convert: memory allocation fai [19:07:23] led `-' @ error/gif.c/WriteGIFImage/1621.#012GIF89a in /srv/mediawiki/php-1.24wmf21/includes/debug/MWDebug.php on line 302 [19:07:25] ffs [19:07:38] yurikR1: 1 Warning: Error 0 executing convert +repage -trim -border 3 -bordercolor '#004C92' -background '#004C92' -fill '#FFFFFF' -font '/usr/share/fonts/truetype/ttf-dejavu/DejaVuSans.ttf' -pointsize '10' 'label:Free Wikipedia from Aircel' gif:-: mkdir: cannot create directory `/sys/fs/cgroup/memory/mediawiki/job/14615': No such file or directory#012limit.sh: failed to create the cgroup.#012convert: memory allocation failed `-' [19:07:38] @ error/quantize.c/QuantizeImage/2656.#012convert: memory allocation failed `-' @ error/gif.c/WriteGIFImage/1621.#012GIF89a in /srv/mediawiki/php-1.24wmf21/includes/debug/MWDebug.php on line 302 [19:08:08] Reedy, thx. Grrr [19:09:00] Reedy, does convert use some sort of a temp dir by any chance? [19:09:07] presumably [19:09:25] does it have the proper rights for that? [19:09:55] everyone knows there's a hard date coming up very shortly where fenari will die, right? [19:10:18] (03PS5) 10Alexandros Kosiaris: module/role for url-downloader [puppet] - 10https://gerrit.wikimedia.org/r/159738 [19:10:20] long live fenari [19:10:41] Reedy, that command executes fine in shell, i presume some machine has no right to create a temp file [19:11:25] I'm getting an error on Special:Contributions at mediawiki.org [19:11:44] PHP fatal error in /srv/mediawiki/php-1.24wmf22/extensions/Flow/includes/Data/Utils/MultiDimArray.php line 95: [19:11:44] Cannot create references to/from string offsets nor overloaded objects [19:11:49] bblack: yes. and not only that but somehow the last 2 days apache leaks memory. Way too much memory [19:12:01] ragesoss: :S [19:12:13] re-copying the important bits of mark's email to engineering@ here, just in case someone didn't get the memo: [19:12:16] ragesoss: which user? [19:12:16] On October 1st we plan to shutdown the few remaining Wikimedia servers left in Tampa. [19:12:19] This gives us a few days of leeway to deal with unexpected issues, as on October 6th, Chris Johnson will start actually removing equipment from the racks. [19:12:21] ebernhardson: ragesoss. [19:12:22] He’ll be working there for a few days, and by October 12th we expect to be out of the facility completely. [19:12:25] (Even if we were to cancel/move this plan, by October 17th, we’d lose all remaining connectivity to Tampa and it would drop off the Internet...) [19:12:28] fenari is in tampa [19:12:29] I just submitted an OAuth request. [19:12:59] Is the OAuth proposal action what is choking it? [19:13:00] ebernhardson: https://www.mediawiki.org/wiki/Special:RecentChanges [19:13:04] any user [19:13:05] I still see a bunch of apache traffic on fenari for various .cdb and .dblist files [19:14:19] aude: not *any*. This works for me. https://www.mediawiki.org/wiki/Special:Contributions/Guillom [19:14:30] aude: sigh ... sec will have afix [19:14:31] ooo [19:14:39] i'll make a bug with stacktrace [19:15:39] https://bugzilla.wikimedia.org/show_bug.cgi?id=71014 [19:15:39] (03PS1) 10BryanDavis: Remove legacy manifests.d/vagrant-managed.pp [puppet] - 10https://gerrit.wikimedia.org/r/161325 [19:15:42] ebernhardson: ^ [19:16:04] !log disabling puppet on polonium, lead, sodium, iridium, magnesium, and iodine to monitor rollout of https://gerrit.wikimedia.org/r/155753 [19:16:09] Logged the message, Master [19:16:56] ebernhardson: it's probably unrelated but note that i deleted a flow post recently [19:17:04] spam* [19:21:44] i think this is a difference in php versions and reference handling, but not sure what yet :S only fatals on 5.3.10 in the cluster, not on 5.5.9 or hhvm (that ship in vagrant) [19:21:53] sec gotta figure out how to get 5.3.10 into my vagrant [19:22:07] !log reedy Synchronized php-1.24wmf22: (no message) (duration: 00m 57s) [19:22:12] Logged the message, Master [19:22:17] ebernhardson: good luck [19:23:16] * aude wants them to be roles or something easy to use [19:24:41] ebernhardson: You could do it with phpenv I suppose. Or use the precise-compat branch in MWV to build a 12.04 VM [19:25:12] (03CR) 10Jgreen: [C: 032 V: 031] Added the bouncehandler router to catch in all bounce emails [puppet] - 10https://gerrit.wikimedia.org/r/155753 (owner: 1001tonythomas) [19:26:21] aude: Part of our problem is making things work without anyone specifying roles. I haven't figured out how to trick puppet into having a default setup that is replaced when you enable a different role. [19:26:33] i see [19:27:20] Jeff_Green: yay ! and were in ! [19:27:33] The php5 role Ori made doesn't really remove hhvm. It just sort of moves it out of the way. [19:27:34] *we're :) [19:27:40] (03CR) 10Andrew Bogott: [C: 032] Remove legacy manifests.d/vagrant-managed.pp [puppet] - 10https://gerrit.wikimedia.org/r/161325 (owner: 10BryanDavis) [19:27:52] tonythomas: yup. now I go through each host and vet the change [19:28:38] Jeff_Green: ok. [19:29:42] tonythomas: curl -H 'Host: login.wikimedia.org' appservers.svc.eqiad.wmnet/w/api.php -d "action=bouncehandler" .... [19:29:58] (03CR) 10Andrew Bogott: [C: 032] shinken: Experimental monitoring for betacluster [puppet] - 10https://gerrit.wikimedia.org/r/161289 (owner: 10Yuvipanda) [19:30:03] was the [http://]appservers intentionally removed? [19:30:36] I think hoo and bd808 agreed on that [19:30:57] but you have it for $verp_bounce_post_url in labs [19:31:32] Jeff_Green: true that. But this one seems to work without the 'http://' though [19:31:33] mh? [19:31:34] curl default I guess, works anyway [19:32:06] curl defaults to http, so no need to specify that [19:32:09] but also doesn't hurt [19:32:12] if that was the question [19:32:22] yeah [19:32:41] ok [19:33:00] and also - that looked like a weird url to give the prefix, and I cut that one down [19:33:30] "appservers.svc.${::mw_primary}.wmnet/w/api.php" [19:33:44] ok [19:35:47] !log lead.wm.o exim conf checked, puppet reenabled [19:35:51] Logged the message, Master [19:41:52] (03PS1) 10Ori.livneh: mediawiki::hhvm: add additional documentation [puppet] - 10https://gerrit.wikimedia.org/r/161331 [19:41:54] (03PS1) 10Ori.livneh: salt: make grain-ensure.py operate locally [puppet] - 10https://gerrit.wikimedia.org/r/161332 [19:42:55] (03CR) 10Ori.livneh: [C: 032] "comments-only change" [puppet] - 10https://gerrit.wikimedia.org/r/161331 (owner: 10Ori.livneh) [19:43:21] (03PS1) 10Yuvipanda: shinken: Add hashar and cmcmahon to betacluster alert group [puppet] - 10https://gerrit.wikimedia.org/r/161333 [19:43:24] andrewbogott: ^ [19:44:18] (03CR) 10Andrew Bogott: [C: 032] shinken: Add hashar and cmcmahon to betacluster alert group [puppet] - 10https://gerrit.wikimedia.org/r/161333 (owner: 10Yuvipanda) [19:44:19] !log polonium.wm.o exim conf checked, puppet reenabled [19:44:24] Logged the message, Master [19:44:31] YuviPanda: I am already defined in icinga somewhere [19:45:03] hashar: indeed, but icinga != shinken :) [19:45:09] shinken is shinier icinga [19:45:10] (03CR) 10Hashar: "Should use my other email address instead: hashar@free.fr" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/161333 (owner: 10Yuvipanda) [19:45:16] !log iodine.wm.o exim conf checked, puppet reenabled [19:45:20] hashar: ah, damn. let me modify [19:45:21] Logged the message, Master [19:45:21] YuviPanda: yeah figured that out eventually [19:45:22] oops , sorry hashar [19:45:37] my wmf address is only for internal stuff [19:45:49] not a big deal, just need 1 more commit hehe [19:45:56] (03PS1) 10Yuvipanda: shinken: Use preferred email to notify hashar [puppet] - 10https://gerrit.wikimedia.org/r/161334 [19:45:58] andrewbogott: ^ [19:46:14] if there is not too many alert, we might consider spamming the qa-alerts mailing list as well [19:46:16] (03CR) 10Andrew Bogott: [C: 032] shinken: Use preferred email to notify hashar [puppet] - 10https://gerrit.wikimedia.org/r/161334 (owner: 10Yuvipanda) [19:46:30] hashar: once things have stabilized, I suppose :) [19:46:44] YuviPanda: yeah [19:46:53] and ideally poke a lot more people [19:48:37] !log reedy Synchronized php-1.24wmf22/extensions/Flow/: (no message) (duration: 00m 16s) [19:48:41] Logged the message, Master [19:49:00] hashar: yeah. ideally, it will figure out which particular thing failed, and poke the people responsible :) [19:50:08] !log sodium.wm.o exim conf checked, puppet reenabled [19:50:12] Logged the message, Master [19:54:24] !log magnesium.wm.o exim conf checked, puppet reenabled [19:54:28] Logged the message, Master [19:57:38] !log iridium.wm.o exim conf checked, puppet reenabled [19:57:43] Logged the message, Master [20:02:56] * YuviPanda wonders if symlinks in the puppet repo are frowned upon [20:03:07] basically, we already have a check_graphite for icinga that I also want to use for shinken [20:03:14] question is should I copy or symlink [20:03:33] * YuviPanda wonders if ori has an opinion [20:03:33] neither? [20:03:51] Jeff_Green: well, I could refer to the other one directly, but it's 'files/icinga/check_graphite', which feels wrong on *two* levels [20:03:57] (the icinga reference, and it's not in a module) [20:04:19] and if the efforts to make icinga into a module succeed, then this will have to change as well [20:04:29] I don't fully understand [20:04:41] Jeff_Green: so, shinken is an icinga replacement we're trying out for labs [20:04:54] another nagios fork? [20:04:59] Jeff_Green: nope, a rewrite in python [20:05:12] but it reuses nagios checks? [20:05:14] http://www.shinken-monitoring.org/ [20:05:21] it is compatible with them, yeah [20:05:24] bleh [20:05:51] imo plugins written for nagios belong under a nagios module [20:05:52] ottomata: Can I migrate wikimetrics1 now? That'll be a reboot and 10 mins or so of downtime. (I've moved quite a lot of other instances now, they generally seem to survive.) [20:06:07] and if we run something else that uses them, we'd reference them in the nagios module [20:06:17] not under an icinga or shrinken module [20:06:19] hmm, that does make sense. [20:06:34] but we don't have a nagios module, and I'm hesitant to create one just to put our custom plugins in [20:08:26] people got too enthusiastic about switching to icinga :-P [20:08:33] RECOVERY - RAID on es1005 is OK: OK: optimal, 1 logical, 2 physical [20:08:38] Jeff_Green: heh [20:08:56] it would reference them at the current location then [20:09:10] Jeff_Green: yeah, that's what I'm going to do, and leave a note on the file itself [20:09:25] hmm, or I suppose whoever breaks it will notice [20:09:43] Jeff_Green: we still have a manifests/nagios.pp and a manifests/misc/icinga.pp :) [20:09:44] then when we get too enthusiastic about shinken we can move them [20:09:54] heh [20:10:00] YuviPanda: yeah [20:10:12] it's still nagios on the clients, no? [20:10:49] * Jeff_Green not so secretly thinks everything nagios or derived from nagios should be banned [20:11:31] grumblegrumblegrumble [20:11:50] Jeff_Green: it could, yeah. although I think for now we'll just use graphite based checks for labs stuff instead of NRPE/etc based ones [20:12:41] ya, that's a step in the right direction imho [20:12:59] * YuviPanda agrees [20:13:21] Jeff_Green: so right now, labs shinken will only do graphite based checks + some 'active' ones (check_http mostly) [20:14:04] Jeff_Green: we've diamond running on almost all labs hosts sending data to graphite.wmflabs.org, and most passive checks will just be graphite based ones [20:14:14] and writing a new diamond collector is simple, so we'll just write new ones for metrics we want [20:14:20] * YuviPanda should write one for toollabs OGE soon [20:15:06] cool [20:26:20] (03PS1) 10RobH: setting ms-be2005's mac info in lease file [puppet] - 10https://gerrit.wikimedia.org/r/161347 [20:39:18] ori: instance 'maps-dj' in project 'maps' -- ok if I move and reboot? (You created it, it looks like) [20:40:24] (03CR) 10RobH: [C: 032] setting ms-be2005's mac info in lease file [puppet] - 10https://gerrit.wikimedia.org/r/161347 (owner: 10RobH) [20:52:39] andrewbogott: it's thedj's [20:52:42] thedj: ping? [20:56:42] (03PS1) 10BryanDavis: labs-vagrant: Fix initial clone and ownership [puppet] - 10https://gerrit.wikimedia.org/r/161353 [20:56:46] when we do this "+2 but don't submit"-thing on patch sets of others, it means when it _actually_ gets merged we dont see anything on IRC [20:57:18] grrrit-wm: please also report a "submit" ? [20:59:26] mutante: are we willing to have it ping us twice as it outputs one for the +2 and another for the submit? [20:59:34] cuz i betcha thats the result [20:59:41] (03CR) 10Dzahn: "yea, merging this was a noop on zirconium, but yes, we have these:" [puppet] - 10https://gerrit.wikimedia.org/r/161149 (owner: 10Dzahn) [20:59:41] (im cool with it, cuz what you point out also annoys me) [21:00:53] robh: yes, i think we do, it's 2 separate actions [21:01:28] (03PS1) 10BBlack: Add cp1008.wikimedia.org DNS [dns] - 10https://gerrit.wikimedia.org/r/161354 [21:02:16] (03CR) 10BBlack: [C: 032] Add cp1008.wikimedia.org DNS [dns] - 10https://gerrit.wikimedia.org/r/161354 (owner: 10BBlack) [21:02:29] (03CR) 10Dzahn: [C: 032] "hmm, yea, the wiki log looks a bit weird. wiki break?" [puppet] - 10https://gerrit.wikimedia.org/r/161098 (owner: 10Dzahn) [21:02:34] PROBLEM - puppet last run on amssq59 is CRITICAL: CRITICAL: Epic puppet fail [21:04:31] (03CR) 10Dzahn: "one is about the normal restarting as part of a deploy, which would only mean executing something from tin that then runs on all appserver" [puppet] - 10https://gerrit.wikimedia.org/r/159636 (owner: 10Hoo man) [21:07:53] andre__: fyi, andrewbogott is moving phab-01 to a different host [21:08:07] mutante: fine fine. [21:08:07] mutante: I sent a message in -dev [21:08:24] Better to get two notifications than zero. :) [21:09:10] andre__: I'm going to wait until earlier tomorrow, upon yuvi's request [21:09:36] Tss. I always thought that Yuvi knows no timezones and never sleeps. But okay. :) [21:10:08] (03PS1) 10BBlack: Add cp1008.wm.o to puppet [puppet] - 10https://gerrit.wikimedia.org/r/161355 [21:10:56] (03CR) 10BryanDavis: "I don't have a project setup with a self-hosted puppetmaster to test this on, but I was getting a "Permission denied" setting up a new lab" [puppet] - 10https://gerrit.wikimedia.org/r/161353 (owner: 10BryanDavis) [21:11:01] (03PS2) 10BBlack: Add cp1008.wm.o to puppet [puppet] - 10https://gerrit.wikimedia.org/r/161355 [21:11:20] (03CR) 10BBlack: [C: 032 V: 032] Add cp1008.wm.o to puppet [puppet] - 10https://gerrit.wikimedia.org/r/161355 (owner: 10BBlack) [21:12:13] PROBLEM - Unmerged changes on repository puppet on virt0 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [21:13:13] RECOVERY - Unmerged changes on repository puppet on virt0 is OK: No changes to merge. [21:21:20] (03CR) 10Dzahn: [C: 031] lists.wm.org - raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/161177 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [21:22:33] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [21:23:55] (03CR) 10John F. Lewis: [C: 031] lists.wm.org - raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/161177 (https://bugzilla.wikimedia.org/38516) (owner: 10Chmarkine) [21:30:54] (03CR) 10Dzahn: "these files are identical to contacts files for icinga/nagios. when it comes to them we argue they can't be public and put them into priva" [puppet] - 10https://gerrit.wikimedia.org/r/161333 (owner: 10Yuvipanda) [21:32:00] (03CR) 10Yuvipanda: "Feel free to do that. I do not even have access to the private repo, so unsure :) Also this runs on labs, so can't access the private repo" [puppet] - 10https://gerrit.wikimedia.org/r/161333 (owner: 10Yuvipanda) [21:32:04] (03CR) 10Dzahn: "is it really a good idea to have 2 separate competing places to define contacts in the identical syntax?" [puppet] - 10https://gerrit.wikimedia.org/r/161333 (owner: 10Yuvipanda) [21:35:20] (03CR) 10Ori.livneh: "I asked Ryan Lane to take a look and he said that it should work." [puppet] - 10https://gerrit.wikimedia.org/r/161332 (owner: 10Ori.livneh) [21:35:22] (03PS1) 10BBlack: oops, not cp1008.e.wm.o ... [puppet] - 10https://gerrit.wikimedia.org/r/161356 [21:35:39] (03CR) 10BBlack: [C: 032 V: 032] oops, not cp1008.e.wm.o ... [puppet] - 10https://gerrit.wikimedia.org/r/161356 (owner: 10BBlack) [21:36:46] (03CR) 10Dzahn: "hmm. you just asked me personally to create contacts several times in the last couple days and i think you do shinken for the reason that " [puppet] - 10https://gerrit.wikimedia.org/r/161333 (owner: 10Yuvipanda) [21:37:23] (03PS1) 10BBlack: fix sni_star def [puppet] - 10https://gerrit.wikimedia.org/r/161357 [21:37:35] (03CR) 10BBlack: [C: 032 V: 032] fix sni_star def [puppet] - 10https://gerrit.wikimedia.org/r/161357 (owner: 10BBlack) [21:38:09] (03CR) 10Yuvipanda: "The fact that config files are reusable isn't too big a factor, IMO. It's more that it scales horizontally better." [puppet] - 10https://gerrit.wikimedia.org/r/161333 (owner: 10Yuvipanda) [21:39:13] (03CR) 10Dzahn: "jzerebecki: see above for the discussion if email address can be public or not ..^" [puppet] - 10https://gerrit.wikimedia.org/r/161333 (owner: 10Yuvipanda) [21:40:01] (03CR) 10Yuvipanda: "re: emails. All the people involved were ok with it being public, and I'll check to make sure before adding anyone." [puppet] - 10https://gerrit.wikimedia.org/r/161333 (owner: 10Yuvipanda) [21:41:22] (03PS1) 10BBlack: fix server_aliases for sni_star [puppet] - 10https://gerrit.wikimedia.org/r/161358 [21:42:06] (03CR) 10Cmcmahon: "I fully expect my professional email address to be public. It is easily found: http://www.mediawiki.org/wiki/User:Cmcmahon(WMF)" [puppet] - 10https://gerrit.wikimedia.org/r/161333 (owner: 10Yuvipanda) [21:43:06] (03CR) 10BBlack: [C: 032] fix server_aliases for sni_star [puppet] - 10https://gerrit.wikimedia.org/r/161358 (owner: 10BBlack) [21:46:04] (03PS1) 10BBlack: Fix monitoring for star cert stuff [puppet] - 10https://gerrit.wikimedia.org/r/161359 [21:46:16] (03CR) 10BBlack: [C: 032 V: 032] Fix monitoring for star cert stuff [puppet] - 10https://gerrit.wikimedia.org/r/161359 (owner: 10BBlack) [22:10:08] ori: I just emailed ops@ about how to best share code between icinga/shinken in puppet, and since it's a puppet code quality/style issue, do respond if you've any thoughts :) [22:48:20] hey, ops folks, how's SWATting? [22:48:40] i'd like to piggy back on swat and do a quick OCG deploy, if nothing's on fire. [22:49:23] also, i'm hoping that salt on beta has recovered since my last deploy? [23:00:05] RoanKattouw, ^d, marktraceur, MaxSem: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140918T2300). Please do the needful. [23:01:20] I'll do it [23:03:24] cscott: Ouch. How long does OGC merge normally take? [23:04:00] cscott: https://integration.wikimedia.org/zuul/ – lots of things backed up waiting for it to finish. [23:04:17] cscott: (Including, selfishly, a patch I've got for SWAT.) [23:04:18] !log catrope Synchronized php-1.24wmf21/extensions/UploadWizard/: SWAT (duration: 00m 08s) [23:04:23] Logged the message, Master [23:04:46] It runs npm test [23:04:51] Which I suppose might be slow [23:04:54] Reedy, does wikitech use the same interwiki stuff as production now? [23:07:40] James_F: well, it completes in less than two minutes on travis.... [23:07:57] cscott: Hmm. Is it stuck? [23:08:39] tgr: Your UploadWizard SWAT went out a few minutes ago [23:09:51] cscott, James_F: I killed the Jenkins job, it had been running for 20 minutes [23:10:06] yeah, it was probably stuck. i'm looking at it. [23:10:16] RoanKattouw: it's working, thanks! [23:11:10] does jenkins set any environment variables so I know i'm running on jenkins? [23:13:07] RoanKattouw: 161370 [23:13:16] RoanKattouw: (It's in the Deployments calendar.) [23:13:55] !log catrope Synchronized php-1.24wmf22/extensions/VisualEditor/: SWAT (duration: 00m 08s) [23:13:59] Logged the message, Master [23:23:04] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/161375/ [23:25:14] !log catrope Synchronized php-1.24wmf22/resources/lib/oojs-ui/: oojs-ui bugfixes (duration: 00m 06s) [23:25:19] Logged the message, Master [23:31:44] (03PS1) 10Aaron Schulz: Set $wgBloomFilterStores in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161382 [23:45:11] RoanKattouw: I have an extension update to push (with Greg's OK) whenever you're done [23:46:08] (03PS1) 10Ori.livneh: Add HHVM to beta feature whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161388 [23:46:20] (03CR) 10Ori.livneh: [C: 031] Set $wgBloomFilterStores in labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161382 (owner: 10Aaron Schulz) [23:46:35] ori: I'm done, go for it [23:46:42] RoanKattouw: cool, thanks [23:47:07] (03PS1) 10BBlack: fix server_name for star-cert stuff [puppet] - 10https://gerrit.wikimedia.org/r/161389 [23:48:04] (03CR) 10BBlack: [C: 032] fix server_name for star-cert stuff [puppet] - 10https://gerrit.wikimedia.org/r/161389 (owner: 10BBlack) [23:50:23] (03CR) 10Ori.livneh: [C: 032] Add HHVM to beta feature whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161388 (owner: 10Ori.livneh) [23:50:28] (03Merged) 10jenkins-bot: Add HHVM to beta feature whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161388 (owner: 10Ori.livneh) [23:50:46] ori: umm, should it be 'HHVM' uppercase? [23:51:36] !log ori Synchronized php-1.24wmf21/extensions/WikimediaEvents: Update WikimediaEvents for cherry-picks (duration: 00m 06s) [23:51:38] legoktm: doesn't matter [23:51:40] Logged the message, Master [23:52:15] it doesn't? !in_array( $key, $wgBetaFeaturesWhitelist ) [23:52:22] !log ori Synchronized php-1.24wmf22/extensions/WikimediaEvents: Update WikimediaEvents for cherry-picks (duration: 00m 06s) [23:52:28] Logged the message, Master [23:52:28] legoktm: oh, for the feature name. ahem. yep. [23:53:23] (03PS1) 10Ori.livneh: Correct case of HHVM beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161390 [23:53:25] legoktm: ^ [23:53:40] (03CR) 10Legoktm: [C: 031] Correct case of HHVM beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161390 (owner: 10Ori.livneh) [23:53:53] (03CR) 10Ori.livneh: [C: 032] Correct case of HHVM beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161390 (owner: 10Ori.livneh) [23:53:57] (03Merged) 10jenkins-bot: Correct case of HHVM beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161390 (owner: 10Ori.livneh) [23:54:50] also everything on https://www.mediawiki.org/wiki/HHVM/About looked fine/correct? [23:54:51] !log ori Synchronized wmf-config/InitialiseSettings.php: I2466f6b6e: Add HHVM to beta feature whitelist (duration: 00m 08s) [23:54:55] Logged the message, Master [23:55:35] ori: ahhh, needs scap for messages. [23:55:36] !log ori Started scap: Add HHVM as a beta feature [23:55:41] Logged the message, Master [23:55:42] :D [23:55:54] legoktm: yeah, I know. I wanted to check if everything else looked OK before doing that. [23:56:20] thanks