[00:00:05] RoanKattouw, ^d, marktraceur, MaxSem: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141219T0000). Please do the needful. [00:00:13] Thank you! [00:00:40] All done :) [00:01:53] * James_F waits for merges. :-) [00:04:31] * aude here for swat [00:05:26] !log Reloading Zuul to deploy I3333f5e45 [00:05:29] Grumble grumble slow jenkins grumble. :-) [00:05:32] Logged the message, Master [00:05:34] Krinkle: Argh. [00:05:43] Krinkle: During a SWAT? Seriously? [00:05:44] Zuul is graceful. You won't notice a thing [00:05:52] Not good timing, though. [00:05:57] takes exactly 4ms for the actual reload. [00:06:33] I also breath during a SWAT. I can log it in a different channel :P [00:06:41] Krinkle: :-P [00:07:50] Krinkle: Hmm. https://gerrit.wikimedia.org/r/#/c/180912/ definitely shouldn't be blocked by https://gerrit.wikimedia.org/r/#/c/180701/ – they're in different branches. [00:08:20] James_F: the new 'mediawiki-gate' blocks all mw-related repos on each other. Regardless of branch or direction. [00:09:38] Krinkle: That seems… inefficient. [00:11:16] What on Earth? [00:14:37] mhm, no one to deploy again? [00:15:40] okay, I can do it again [00:15:52] MaxSem: thanks [00:15:54] :/ [00:15:58] thanks [00:16:57] (03CR) 10MaxSem: [C: 032] Turning off WikiGrok test on en.wiki, turning on WikiGrok on test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180867 (owner: 10Kaldari) [00:17:13] (03Merged) 10jenkins-bot: Turning off WikiGrok test on en.wiki, turning on WikiGrok on test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180867 (owner: 10Kaldari) [00:18:26] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/180867 (duration: 00m 06s) [00:18:30] Logged the message, Master [00:18:35] kaldari|2, ^ [00:19:00] checking [00:21:34] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [00:21:50] MaxSem: hmm, enwiki is still returning true for mw.config.get( 'wgMFEnableWikiGrok' ); [00:22:30] wtf, I made pulled [00:22:52] maxsem@tin:/srv/mediawiki-staging$ git log -1 [00:22:59] Merge "Turning off WikiGrok test on en.wiki, turning on WikiGrok on test2" [00:23:38] try debug mode? [00:26:51] mw.config.get( 'wgMFEnableWikiGrok' ); is false for me [00:26:55] on enwiki [00:26:58] yep, caching [00:28:31] !log maxsem Synchronized php-1.25wmf13/extensions/Wikidata/: https://gerrit.wikimedia.org/r/#q,181003,n,z (duration: 00m 12s) [00:28:35] aude, ^^ [00:28:38] Logged the message, Master [00:28:42] checking [00:28:52] it works [00:30:02] !log maxsem Synchronized php-1.25wmf12/extensions/Wikidata/: https://gerrit.wikimedia.org/r/#q,181000,n,z (duration: 00m 13s) [00:30:09] Logged the message, Master [00:30:22] yay [00:30:25] https://www.wikidata.org/wiki/Special:SetSiteLink/Q1/dewiktionary [00:30:26] looks good [00:30:44] (03CR) 10MaxSem: [C: 032] Enable otherProjectsLinks on it.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180818 (owner: 10Glaisher) [00:32:50] ffuck you jerkins [00:34:17] (03CR) 10MaxSem: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180818 (owner: 10Glaisher) [00:34:47] (03CR) 10Ori.livneh: "Not sure what is going on in this patch. It doesn't disable xhprof profiling; it disables xenon traces. Xenon has nothing to do with xhpro" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180791 (owner: 10Giuseppe Lavagetto) [00:35:21] (03PS3) 10MaxSem: Enable otherProjectsLinks on it.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180818 (owner: 10Glaisher) [00:35:33] (03PS1) 10Ori.livneh: Revert "Temporarily disable xhprof profiling, due to stability issues" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181007 [00:38:30] Krinkle, zuul appears down [00:38:51] MaxSem: Based on what? [00:39:00] MaxSem: It seems to be sending jobs to Jenkins as of 2 seconds ago [00:39:29] he queue looks locked up, no ^^^ mediawiki-config change can be seen [00:39:38] s/he/the/ [00:40:12] Lets wait for the queue to empty. It's definitely still up or that page would show an error (there's no caching) [00:40:31] MaxSem: This one? -- https://gerrit.wikimedia.org/r/#/c/180818/ [00:40:40] yep [00:40:52] Try reviewing with 0 and the +2 again [00:40:57] *and then [00:41:27] it processed a +2 in 5(!) minutes, but still I don't see it in the queue [00:41:39] It doesn't just "miss" events. That won't help. It'll show up after the queue is empty. It's quite rigorous about events. Unless someone shut Zuul down for a restart, it won't miss events. [00:41:50] (adding more events will only make it wait longer) [00:42:30] There's a MF job and a core job in the qeuee being run just fine. Zuul is very inefficient in how it handles the queue, always been that way, don't know why. if they don't show up after these 2 items are finished I'll restart. [00:43:40] A donationinterface commit just showed up so I guess its' working [00:44:07] Which was submitted to gerrit less than a minute ago. [00:44:20] So whatever is older than 1 minute and not on https://integration.wikimedia.org/zuul/ is not gonna show up [00:44:30] Aka "not working"? :-) [00:44:59] it's working. If I restart it now it will no nothing other than remove what it's in there and continue listening the same way. Those events mus've been missed somehow. [00:45:18] The DI commit entered the queue a minute ago and is already running in Jenkins. [00:45:34] the config changes never take long at all [00:45:46] must have been missed [00:49:24] (03CR) 10MaxSem: Enable otherProjectsLinks on it.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180818 (owner: 10Glaisher) [00:49:29] (03CR) 10MaxSem: [C: 032] Enable otherProjectsLinks on it.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180818 (owner: 10Glaisher) [00:52:36] (03Merged) 10jenkins-bot: Enable otherProjectsLinks on it.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180818 (owner: 10Glaisher) [00:53:37] !log demon Synchronized php-1.25wmf13/includes/Html.php: (no message) (duration: 00m 05s) [00:53:44] Logged the message, Master [00:53:52] <^d> greg-g: that's a thing I heard you were wondering about ^ [00:54:17] ^d: yeppers, thanks man [00:54:22] <^d> np [00:54:35] alright, i think we can shut 'er down for the holidays now :) [00:54:54] <^d> You kidding? It's time to Do All The Things! [00:55:04] shhhhhHh! [00:56:41] !log maxsem Synchronized php-1.25wmf12/resources/lib/oojs-ui/oojs-ui.js: https://gerrit.wikimedia.org/r/#/c/180860/ (duration: 00m 08s) [00:56:48] Logged the message, Master [00:57:06] !log maxsem Synchronized php-1.25wmf12/extensions/VisualEditor/: https://gerrit.wikimedia.org/r/#/c/180860/ (duration: 00m 07s) [00:57:12] Logged the message, Master [00:57:14] James_F, ^^^ [00:57:23] MaxSem: Thanks! [01:00:12] MaxSem: (Confirmed working.) [01:06:31] (03CR) 10Bartosz Dziewoński: "Is this happening? I'd love to see it happen this year. :)" [puppet] - 10https://gerrit.wikimedia.org/r/179407 (owner: 1020after4) [01:08:55] (03CR) 10MZMcBride: "Rush: this is a paper cut bug. Can this change please be merged and deployed?" [puppet] - 10https://gerrit.wikimedia.org/r/179407 (owner: 1020after4) [01:10:17] (03PS2) 10Krinkle: phabricator: Change security_topic from "default: none" to "default: default" [puppet] - 10https://gerrit.wikimedia.org/r/179407 (owner: 1020after4) [01:10:37] (03CR) 10Krinkle: [C: 031] phabricator: Change security_topic from "default: none" to "default: default" [puppet] - 10https://gerrit.wikimedia.org/r/179407 (owner: 1020after4) [01:10:57] greg-g, I'm resigning as a SWAT deployer [01:12:06] MaxSem: Did you sync our configuration change? [01:12:21] ? [01:12:22] the one that's not merged yet? [01:12:29] it merged [01:12:31]

(Merged) jenkins-bot: Enable otherProjectsLinks on it.wikipedia [mediawiki-config] - https://gerrit.wikimedia.org/r/180818 (owner: Glaisher) [01:12:34] that one [01:13:28] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/180818 (duration: 00m 05s) [01:13:35] Logged the message, Master [01:13:41] thanks [01:13:42] aude, hoo ^^^ [01:13:46] thanks [01:13:57] looks good (tried w/o js) [01:14:07] since they have a js gadget for this [01:14:15] that is obsolete now [01:14:22] Can we please kill that one? [01:14:29] https://it.wikipedia.org/wiki/MediaWiki:Gadgets-definition [01:14:33] is a giantic mess, btw [01:14:36] it's not a gadget actually [01:14:43] just in common.js [01:14:50] yikes [01:15:05] i'm sure they will take care of that soon [01:15:17] anyway, it works :) [01:16:48] aude: I can't find that damn script [01:16:49] link? [01:17:00] Not directly in https://it.wikipedia.org/wiki/MediaWiki:Common.js right? [01:17:53] https://it.wikipedia.org/wiki/MediaWiki:InterProject.js [01:18:07] linked from common js [01:18:43] where does that get its data from? [01:19:37] {{Interprogetto|commons=Category:Berlin|n=Categoria:Berlino|q=Berlino|q_preposizione=su|s=Ich bin ein Berliner|s_preposizione=su|wikt=Berlino|voy=Berlino|etichetta=Berlino}} [01:19:40] not for real [01:19:41] wow [01:21:00] !log maxsem Synchronized php-1.25wmf13/extensions/MobileFrontend/: (no message) (duration: 00m 05s) [01:21:09] Logged the message, Master [01:21:13] !log maxsem Synchronized php-1.25wmf13/extensions/VisualEditor/: (no message) (duration: 00m 07s) [01:21:15] James_F, ^^^ [01:21:17] Logged the message, Master [01:21:19] kaldari|2, ^^^ [01:21:23] MaxSem: Testing. [01:23:32] MaxSem: Well, it's not worse. Good to go. [01:37:53] MaxSem, still deploying? [01:38:05] greg-g, i need to push a minor zero portal fix [01:38:05] nope, done [01:38:17] ok, will do it now unless someone is deploying [01:53:59] MaxSem, i can't merge with wmf13 for some reason, and yours is last. Any thoughts? [01:54:55] yurikR, you need to kick jerkins outta this change first - fixed [01:55:19] MaxSem, thx [01:56:39] !log yurik Synchronized php-1.25wmf13/extensions/ZeroPortal: (no message) (duration: 00m 06s) [01:56:48] Logged the message, Master [02:03:54] PROBLEM - puppet last run on search1003 is CRITICAL: CRITICAL: Puppet has 1 failures [02:18:21] (03CR) 10Ori.livneh: [C: 032] "I read the backlog on #wikimedia-operations and it is clear that disabling Xenon was not intentional and did not fix the problem. Since Xe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181007 (owner: 10Ori.livneh) [02:18:31] (03Merged) 10jenkins-bot: Revert "Temporarily disable xhprof profiling, due to stability issues" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181007 (owner: 10Ori.livneh) [02:19:16] RECOVERY - puppet last run on search1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:19:32] !log ori Synchronized wmf-config/StartProfiler.php: re-enable xenon (duration: 00m 06s) [02:19:39] Logged the message, Master [02:28:56] PROBLEM - Host pay-lvs1001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [02:29:40] that seems bad [02:29:46] it's me [02:29:48] i think that may be bad [02:29:50] cool :) [02:29:52] it's fine [02:29:59] it paged [02:30:00] i screwed up and power cycled the wrong box ;-( [02:30:05] oops! [02:30:12] thank god for HA [02:30:26] i tried to mute it but icinga is useless [02:34:12] (03PS1) 10Springle: repool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181024 [02:34:52] (03CR) 10Springle: [C: 032 V: 032] repool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181024 (owner: 10Springle) [02:35:13] !log pay-lvs1001inadvertently power cycled [02:35:16] Logged the message, Master [02:35:24] RECOVERY - Host pay-lvs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [02:36:03] !log springle Synchronized wmf-config/db-eqiad.php: repool db1055, warm up (duration: 00m 06s) [02:36:11] Logged the message, Master [02:37:47] !log yurik Synchronized php-1.25wmf13/extensions/ZeroPortal: (no message) (duration: 00m 05s) [02:37:51] Logged the message, Master [02:38:31] (03PS1) 10Ori.livneh: mediawiki::hhvm: re-enable xenon [puppet] - 10https://gerrit.wikimedia.org/r/181026 [02:39:23] (03CR) 10Ori.livneh: "@Giuseppe: I reverted the wmf-config change which disabled xenon, but the extension is still disabled until this change is merged, too. If" [puppet] - 10https://gerrit.wikimedia.org/r/181026 (owner: 10Ori.livneh) [03:34:59] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:35:10] (03PS1) 10Springle: depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181028 [03:35:50] (03CR) 10Springle: [C: 032] depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181028 (owner: 10Springle) [03:36:33] !log springle Synchronized wmf-config/db-eqiad.php: depool db1028 (duration: 00m 06s) [03:36:40] Logged the message, Master [03:44:19] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 57848 bytes in 0.048 second response time [04:40:02] (03Abandoned) 10Jackmcbarn: Re-enable the Lua profiler on production HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161468 (owner: 10Jackmcbarn) [04:48:22] (03PS1) 10Springle: upgrade db1028 to trusty [puppet] - 10https://gerrit.wikimedia.org/r/181031 [04:58:39] (03CR) 10Springle: [C: 032] upgrade db1028 to trusty [puppet] - 10https://gerrit.wikimedia.org/r/181031 (owner: 10Springle) [05:09:33] springle: heya. [05:10:42] hi [05:10:47] springle: See: https://phabricator.wikimedia.org/T78775 - we just need config like Beta here? Please add any missing bits when you've time. [05:11:09] (specially, how we can add DB once we're ready) [05:14:17] kart_: i have not been watching. i presume you had Dev input to sort out MW config in beta? [05:14:26] the DB is already created [05:15:05] springle: Beta is done :) [05:15:13] springle: This is for Production. [05:16:28] springle: Feel free to add input on how to proceed for adding tables to wikishared DB for production. For Beta, I did that myself. [05:17:02] kart_: https://wikitech.wikimedia.org/wiki/Schema_changes [05:18:13] so, gerrit changesets, ask for review, well before code deploy, etc [05:18:18] springle: Thanks. I'll tracking ticket. [05:19:22] springle: We've contentranslation.sql in code, so you can directly review it. Will add that in ticket. [05:27:56] springle: https://phabricator.wikimedia.org/T84969 for you now :) [05:29:32] kart_: cool [05:29:53] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: puppet fail [05:42:48] RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:00:27] (03PS1) 10Andrew Bogott: Turn off nova-compute on virt1012 -- we're keeping this one in reserve. [puppet] - 10https://gerrit.wikimedia.org/r/181033 [06:01:12] (03CR) 10jenkins-bot: [V: 04-1] Turn off nova-compute on virt1012 -- we're keeping this one in reserve. [puppet] - 10https://gerrit.wikimedia.org/r/181033 (owner: 10Andrew Bogott) [06:03:04] (03PS2) 10Andrew Bogott: Turn off nova-compute on virt1012 -- we're keeping this one in reserve. [puppet] - 10https://gerrit.wikimedia.org/r/181033 [06:04:37] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [06:10:49] (03PS3) 10Andrew Bogott: Turn off nova-compute on virt1012 -- we're keeping this one in reserve. [puppet] - 10https://gerrit.wikimedia.org/r/181033 [06:12:34] (03CR) 10Andrew Bogott: [C: 032] Turn off nova-compute on virt1012 -- we're keeping this one in reserve. [puppet] - 10https://gerrit.wikimedia.org/r/181033 (owner: 10Andrew Bogott) [06:20:13] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:25:52] PROBLEM - DPKG on virt1011 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:33:35] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:18] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:52] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:59] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:51] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:52] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:37:08] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 2 failures [06:46:43] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:37] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:47:42] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:47:51] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:47:57] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:37] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:49:10] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:55:06] <_joe_> !log restarted HHVM on mw1184, stuck in HPHP::StatCache::refresh [06:55:11] Logged the message, Master [06:57:31] RECOVERY - HHVM rendering on mw1184 is OK: HTTP OK: HTTP/1.1 200 OK - 69017 bytes in 0.122 second response time [06:58:21] RECOVERY - Apache HTTP on mw1184 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [07:04:41] RECOVERY - HHVM busy threads on mw1184 is OK: OK: Less than 30.00% above the threshold [76.8] [07:04:53] (03PS4) 10Krinkle: gerrit: Don't match Phabricator identifiers within urls [puppet] - 10https://gerrit.wikimedia.org/r/177128 [07:05:03] RECOVERY - HHVM queue size on mw1184 is OK: OK: Less than 30.00% above the threshold [10.0] [07:14:59] !log disabled puppet and nova-compute on virt1010 and virt1011 until I can sort out a libvirt issue. [07:15:04] Logged the message, Master [07:38:16] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: puppet fail [07:56:42] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:08:29] (03PS4) 10Giuseppe Lavagetto: [WMF] New Package Version with various bugfixes [debs/hhvm] - 10https://gerrit.wikimedia.org/r/180752 [08:11:05] _joe_: when you have time (haha!) we should talk about hhvm on tools :) [08:13:33] <_joe_> YuviPanda: in a few minutes [08:13:42] <_joe_> pinky swear [08:14:10] :D ok [08:39:26] <_joe_> YuviPanda: I'm here [08:39:55] _joe_: so, I know nothing about the memory characteristics of HHVM... [08:40:04] how bad is it? [08:40:06] relative to Zend [08:40:23] <_joe_> YuviPanda: it's not /bad/, but I need to understand how toollabs operates [08:40:29] _joe_: ah, ok [08:40:36] <_joe_> do you have small instances running mod_php at the moment? [08:40:55] _joe_: nope, so… all php web stuff runs with lighttpd and (as I discovered yesterday) php-cgi [08:41:29] _joe_: so we have 5 VMs, each running about 120 different ‘webservices’ as different user accounts. [08:41:47] <_joe_> ok, how are those segregated? [08:41:52] there’s a lot of php, lot of python, and some other languags too (C++, C#, some TCL) [08:42:04] _joe_: each run in their own user account (‘tool’ account) [08:42:10] <_joe_> because that _is_ going to be an issue with hhvm [08:42:43] <_joe_> we don't want to be running N hhvm instances, one per tool I guess [08:43:08] yeah, it’s one per tool now. [08:43:16] not sure how to enforce tool separation without doing that... [08:44:23] <_joe_> ok _this_ is our big issue right now. Let me do a test [08:44:58] greetings [08:45:02] <_joe_> ciao [08:45:17] _joe_: ok! [08:51:41] <_joe_> so the only separation I can think of without running in containers or chroot jails is separating the local repo path [08:51:58] <_joe_> I'm trying to see if an ini_set works for that, which I strongly doubt [08:52:28] <_joe_> but that would mean no FS separation [08:53:12] _joe_: hmm, but then you still have problems with file permissions etc. [08:53:27] _joe_: I suspect not being able to use per-user processes will cause way too many problems. [08:53:53] <_joe_> ok [08:53:57] <_joe_> so let's think of that [08:55:13] _joe_: also all processes are scheduled / managed by GridEngine, and it will kill the process if it goes above a certain amount of VMEM [08:55:50] currently 4G for most of them, and some tools have 8G [08:55:51] <_joe_> YuviPanda: we have a setting for that in HHVM [08:56:02] <_joe_> YuviPanda: 4G is _a_lot_ [08:56:04] <_joe_> !! [08:56:10] <_joe_> we can work with that I guess [08:56:28] <_joe_> oh beware that HHVM is _much_ better than zend in enforcing memory limits [08:56:57] _joe_: tools *regularly* OOM with 4G as well, btw. we’ve a deamon that restarts when they go down as well, because that just keeps happening. Code quality issues + php-cgi [08:56:58] <_joe_> YuviPanda: how do you supervise the services? [08:57:22] <_joe_> YuviPanda: if we have 4G as a base it can be done I guess [08:57:41] _joe_: that’s the custom perl script that Coren wrote (I want to use Monit at some point). scheduling and accounting is taken care of by grid engine [08:58:07] <_joe_> ok [08:58:27] _joe_: it’s 4G of VMEM tho, because it is mostly lighty+php shared mem is a lot of it so things are ok. [08:58:42] I guess lighty+hhvm is also ok [08:58:50] <_joe_> so my suggestion is: create a template for hhvm.ini and the startup script [08:59:41] <_joe_> then change the hhvm.repo.central.path variable and any tmp directories to match some place in the workplace of your users [09:00:03] <_joe_> can you schedule hhvm to be restarted at regular intervals? [09:00:21] can do if required. [09:00:25] right now they get restarted if they die. [09:00:26] <_joe_> YuviPanda: also, where do you confugure the env of a tool? [09:00:45] _joe_: users configure it, but we can restrict it too with GridEngine. [09:01:10] _joe_: right now it is configured by setting params in the users’ custom lighttpd config [09:01:28] <_joe_> YuviPanda: so the do configure their php.ini? [09:01:41] <_joe_> they configure how to run the php cgi? [09:02:25] _joe_: nope, can’t change their php.ini outside of ini_set. [09:02:36] <_joe_> ok [09:02:40] _joe_: so we merge their custom config with a ‘default’ lighty config [09:02:53] _joe_: that sets up php-cgi and some others (minor static file serving, etc) [09:03:26] <_joe_> so in the case of hhvm, you'd need to provision: the fcgi config AND an hhvm startup script and an ini file [09:03:29] <_joe_> per user [09:04:10] _joe_: hmm, I wonder if I can just make lighty itself start the hhvm process if necessary. [09:04:21] I know we do somewhat similar things with python/fastcgi [09:04:22] <_joe_> YuviPanda: nah please :) [09:04:28] <_joe_> srsly? [09:04:39] <_joe_> oh, man, that is gross [09:04:41] yup, no uwsgi or anything [09:05:05] <_joe_> (disclaimer: I do the same in some particular setups :P) [09:05:17] _joe_: lighty starting it is more gross than php-cgi? :P [09:05:25] <_joe_> nah [09:05:39] <_joe_> but still, I think that is not acceptable [09:05:47] <_joe_> in the case of hhvm [09:05:59] <_joe_> it cannot be execute separately for every request [09:06:23] <_joe_> or do you have some fastcgi supervisor within lighty? [09:06:29] <_joe_> and btw, show me the code :) [09:06:45] looking at code, moment [09:07:15] https://github.com/wikimedia/operations-puppet/blob/production/modules/toollabs/files/bigbrother is the restarter [09:07:38] https://github.com/wikimedia/operations-puppet/blob/production/modules/toollabs/files/lighttpd-starter is the thing that sets up lighty [09:07:46] <_joe_> ook [09:08:12] was before my time, etc :) [09:08:13] <_joe_> bash for templating, perl for supervising [09:08:40] https://github.com/valhallasw/gerrit-patch-uploader/blob/master/lighttpd.conf is how most python/fcgi apps are run [09:08:46] (https://github.com/valhallasw/gerrit-patch-uploader/blob/master/app.fcgi) [09:09:09] _joe_: it’s all a jolly mess, although much nicer than toolserver. [09:09:18] <_joe_> ok maybe it's not bad [09:09:38] <_joe_> I have to check how lighty works with fastcgi.server [09:10:32] <_joe_> YuviPanda: this is enough information to get me started [09:10:41] <_joe_> would you care to put all this in phab? [09:10:52] yeah, I already filed a ticket, let me put all this info there as well [09:11:14] https://phabricator.wikimedia.org/T78783 is the task [09:11:45] <_joe_> ok thanks [09:18:02] _joe_: done, I think. let me know if I missed anything [09:18:44] <_joe_> YuviPanda: ok thanks, the major blocker is my zero knowledge of lighty as a fastcgi env [09:20:18] _joe_: right. technically we *can* use nginx here too if we want, but then I don’t know if running 200nginx processes on a VM is better or worse than running 200 lighty ones [09:21:02] <_joe_> why do you have 200 lighty processes? [09:21:17] <_joe_> I thought lighty was common, with virtualhosts [09:21:21] no [09:21:23] it isn't [09:21:26] <_joe_> oh ok [09:21:29] _joe_: I edited my phab comment to add that [09:21:35] _joe_: user account separation again. [09:22:11] _joe_: we have nginx running on a separate machine that acts as a reverse proxy, routing to the appropriate lighty [09:22:15] (also does SSL) [09:22:23] <_joe_> YuviPanda: if the web server is just acting as a revers proxy to fastcgi apps... [09:22:31] _joe_: it’s also serving static files [09:22:37] _joe_: and doing url rewriting. [09:22:58] <_joe_> YuviPanda: ok, all things that don't require anything more than a separate virtual hosts [09:23:22] <_joe_> *host [09:23:26] <_joe_> but i digress [09:23:34] <_joe_> how is the frontend nginx configured? [09:24:03] wait what? [09:24:11] we have nginx that proxies to lighttpd? [09:24:26] paravoid: yes, but it’s more complicated than that! [09:24:26] <_joe_> that proxies to fastcgi [09:24:30] so when a web tool starts... [09:24:34] nginx is awesome [09:24:35] idk [09:24:41] it opens up a socket [09:24:48] to this ‘proxylistener’ we have running on the nginx server. [09:25:03] and then tells the server, ‘route things for /toolname/ to this port on this host' [09:25:17] and then the server routes that route to that port on that host as long as that socket is open. [09:25:41] <_joe_> ok [09:25:43] (https://github.com/wikimedia/operations-puppet/blob/production/modules/toollabs/files/proxylistener.py) [09:26:08] <_joe_> what about for hhvm instances we do proxy nginx -> fastcgi directly instead of http? [09:26:14] <_joe_> it's like super easy [09:26:20] <_joe_> we cut a middleman [09:26:31] _joe_: static files and url rewriting? [09:26:40] <_joe_> YuviPanda: static files is easy [09:26:59] <_joe_> and we set a rule: for hhvm all apps must have an index.php entry point [09:27:08] <_joe_> which does url routing [09:27:15] <_joe_> it's pretty easy to do [09:27:32] <_joe_> easier than writing rewrite rules btw [09:28:04] <_joe_> the simplest form is a switch statement on $_SERVER['REQUEST_URI'] [09:28:41] lots of php apps don’t actually care about url rewriting anyway, so I guess that’s ok [09:28:43] <_joe_> YuviPanda: why do we need the proxylistener btw? [09:29:06] <_joe_> because the ip of the tool instance may change? [09:29:16] _joe_: because tools die / OOM all the time, and when they respawn they aren’t on the same port or host. so we need something to track what is running where. [09:29:43] <_joe_> oh, and GE doesn't have an api for that? [09:30:04] _joe_: not to find what port things are running in, no. [09:30:35] _joe_: GE doesn’t actually know about webservices, that’s just an abstraction coren hacked on top of GridEngine. [09:30:57] <_joe_> YuviPanda: why the port should change? [09:31:22] _joe_: because it’s just a random high numbered port? [09:31:39] <_joe_> I mean we can decide "hhvm" for the "attila" tool runs on port 9000? [09:31:49] <_joe_> or is there a specific reason we can't do that? [09:31:59] only reason would be something else is already running on port 9000 [09:32:11] oh [09:32:14] you mean reserve ports per tool? [09:32:22] <_joe_> yes [09:32:36] <_joe_> reserve a range for "hhvm tools" for now, and assign them [09:32:50] <_joe_> but in general, yes [09:33:08] you still need something to assign them [09:33:10] <_joe_> I do get you have multiple tools running on the same host, right? [09:33:24] <_joe_> yeah in your present execution model that may be hard to do [09:33:32] yeah. [09:33:49] <_joe_> but hhvm needs a port set when it starts [09:33:55] _joe_: oh yeah, that’s fine... [09:34:07] <_joe_> so... another thing we need to tackle [09:34:09] _joe_: there’s… portgranter, which picks an unused port and gives it to you. [09:34:31] <_joe_> what's portgranter? a GE tool? [09:34:32] Coren explained to me a while ago why we need portgranter, but I have since forgotten... [09:34:37] _joe_: nope, somethign Coren wrote. [09:34:44] <_joe_> oh, ok [09:34:48] <_joe_> where is that? [09:35:01] https://github.com/wikimedia/operations-puppet/blob/production/modules/toollabs/files/portgranter [09:35:06] https://github.com/wikimedia/operations-puppet/blob/production/modules/toollabs/files/portgrabber [09:35:23] <_joe_> so task 0 would be - make portgranter reserve a fixed set of ports for hhvm services [09:35:34] <_joe_> and use fixed ports for HHVM [09:35:37] hmm [09:36:20] <_joe_> then generate an nginx config per-hhvm-tool, and a startup wrapper that can be launched by bigbrother [09:37:46] hmm, so make dynamicproxy also be able to proxy back fastcgi:// [09:37:50] and then have a hhvm-wrapper [09:37:55] <_joe_> yes [09:38:06] and static files? [09:38:09] <_joe_> it's easy and it simplifies your execution model a lot [09:38:54] <_joe_> where are your static files? on disk? on NFS? on some shared storage? [09:39:04] <_joe_> if so, make them available to dinamicproxy [09:39:06] <_joe_> :) [09:39:34] _joe_: noooooooooooooooooo [09:39:35] :) [09:39:40] that seems terrible [09:39:42] they’re on NFS [09:39:52] I’m trying to remember why that felt terrible [09:39:57] <_joe_> why that seems terrible? [09:40:30] hmm [09:40:43] <_joe_> I mean what's more terrible, have files served statically from nfs by lighty and forwarded to nginx, or cut the middleman? [09:40:45] shouldn’t be too hard to say ‘hey, you have a static folder here, and everything in that will come out of /static' [09:40:52] <_joe_> I guess that was privilege separation again [09:41:13] _joe_: yeah, the important part is that it is hard for one terribly coded tool to fuck up other terribly coded tools [09:41:27] <_joe_> YuviPanda: read != write [09:41:31] true, true... [09:41:49] well, with OGE, if one tool takes up too much memory, just that one gets killed. [09:42:03] while if we have anything that’s shared across all tools, that becomes harder to ‘just kill' [09:42:16] nginx static file serving doesn’t seem *that* bad [09:42:21] <_joe_> well, isn't dynamicproxy already shared? [09:42:47] yeah, but it’s just reverse proxying [09:43:00] which is fairly simple and trivial. [09:43:20] <_joe_> well, serving static files is even more trivial [09:43:43] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: puppet fail [09:45:24] _joe_: alright. I’m still slightly worried about hitting NFS, but perhaps NFS isn’t as bad now as it used to be. [09:45:40] _joe_: so, fastcgi://, hhvm wrapper, and static stuff goes in public_html/static [09:45:50] (and is served out of /toolname/static) [09:45:53] <_joe_> YuviPanda: sorry but... aren't the files already served via nfs? [09:46:05] _joe_: I mean, hitting NFS from the nginx proxy. [09:46:05] <_joe_> I mean it's NFS->lighty->nginx right now [09:46:26] <_joe_> well, try to see how much static files traffic we have [09:46:39] yes, and if NFS locks up and doesn’t return for a long time, it’s the lighty that hands and not nginx. [09:46:43] hmm, let me lok [09:46:45] *look [09:46:52] *hangs [09:47:26] <_joe_> ok [09:47:38] <_joe_> so you block the individual tool and not nginx, right [09:47:58] <_joe_> the solution for that may be serving static files for all those tools from one single separate instance [09:48:38] yeah, that was what I was thinking as well [09:48:50] have tools-staticfiles as a simple nginx server, and just serve static files out of that [09:48:54] <_joe_> or two, or whatever [09:49:01] right. [09:50:04] _joe_: so, let me file a bunch of subtasks. [09:50:21] <_joe_> YuviPanda: thanks a lot [09:50:34] _joe_: thanks for helping think this through :) [09:50:36] <_joe_> so I was looking at portgrabber and portgranter [09:50:48] <_joe_> mmmh [09:58:35] RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:37:11] (03CR) 10Alexandros Kosiaris: "How did this manage to break parsoid on beta ? It was already in deployment-salt for days." [puppet] - 10https://gerrit.wikimedia.org/r/169622 (owner: 10Catrope) [10:43:14] (03CR) 10Alexandros Kosiaris: "I am a little bit unclear on why the pickle protocol does not allow to route metrics around, got a link handy ?" [debs/txstatsd] - 10https://gerrit.wikimedia.org/r/180786 (owner: 10Filippo Giunchedi) [10:56:06] (03CR) 10Hashar: "In short:" [puppet] - 10https://gerrit.wikimedia.org/r/169622 (owner: 10Catrope) [11:29:50] (03PS4) 10Filippo Giunchedi: txstatsd: add support for graphite line-protocol [debs/txstatsd] - 10https://gerrit.wikimedia.org/r/180786 [11:31:09] (03CR) 10Filippo Giunchedi: "basically lack of availability of tools that I could find, also carbon-c-relay is plaintext-only. I've reworded the commit message" [debs/txstatsd] - 10https://gerrit.wikimedia.org/r/180786 (owner: 10Filippo Giunchedi) [11:38:09] (03CR) 10QChris: [C: 04-1] "My comments for PS3 still apply." [puppet] - 10https://gerrit.wikimedia.org/r/177128 (owner: 10Krinkle) [11:42:11] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [11:53:36] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:20:11] (03PS1) 10Yuvipanda: toollabs: Add class and role for static file server [puppet] - 10https://gerrit.wikimedia.org/r/181053 [12:21:00] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Add class and role for static file server [puppet] - 10https://gerrit.wikimedia.org/r/181053 (owner: 10Yuvipanda) [12:22:43] (03PS1) 10Merlijn van Deen: Flake8-ify everything [debs/adminbot] - 10https://gerrit.wikimedia.org/r/181054 [12:23:04] (03PS2) 10Yuvipanda: toollabs: Add class and role for static file server [puppet] - 10https://gerrit.wikimedia.org/r/181053 [12:25:41] (03PS2) 10Merlijn van Deen: Flake8-ify everything [debs/adminbot] - 10https://gerrit.wikimedia.org/r/181054 [12:26:29] (03PS3) 10Merlijn van Deen: Flake8-ify everything [debs/adminbot] - 10https://gerrit.wikimedia.org/r/181054 [12:27:15] (03PS4) 10Merlijn van Deen: Flake8-ify everything [debs/adminbot] - 10https://gerrit.wikimedia.org/r/181054 [12:29:23] (03CR) 10Yuvipanda: [C: 032] toollabs: Add class and role for static file server [puppet] - 10https://gerrit.wikimedia.org/r/181053 (owner: 10Yuvipanda) [12:54:25] !log aude Synchronized php-1.25wmf12/extensions/Wikidata/extensions/Wikibase/lib/resources/jquery.wikibase: js caching issues (duration: 00m 05s) [12:54:29] Logged the message, Master [13:38:07] (03PS1) 10Yuvipanda: tools: Use autoindex instead of root for static-file server [puppet] - 10https://gerrit.wikimedia.org/r/181058 [13:39:59] (03PS2) 10Yuvipanda: tools: Use alias instead of root for static-file server [puppet] - 10https://gerrit.wikimedia.org/r/181058 [13:50:33] (03CR) 10Filippo Giunchedi: [C: 031] Make ircecho run as ircecho user [debs/ircecho] - 10https://gerrit.wikimedia.org/r/176333 (owner: 10Yuvipanda) [14:03:29] (03CR) 10Yuvipanda: [C: 032] tools: Use alias instead of root for static-file server [puppet] - 10https://gerrit.wikimedia.org/r/181058 (owner: 10Yuvipanda) [14:27:21] (03PS1) 10Yuvipanda: toollabs: Remove stray duplicate line in static-server [puppet] - 10https://gerrit.wikimedia.org/r/181066 [14:33:27] (03PS1) 10KartikMistry: Add Kartik Mistry to Beta Cluster alert [puppet] - 10https://gerrit.wikimedia.org/r/181067 [15:06:23] (03PS1) 10KartikMistry: Fix indentation in various files [puppet] - 10https://gerrit.wikimedia.org/r/181071 [15:08:38] (03CR) 10KartikMistry: "Santhosh, can we go with empty provider where we want MT disabled by default? I tested in config.js locally and it seems working. I can up" [puppet] - 10https://gerrit.wikimedia.org/r/180724 (owner: 10KartikMistry) [15:14:56] (03CR) 10Alexandros Kosiaris: [C: 032] Fix indentation in various files [puppet] - 10https://gerrit.wikimedia.org/r/181071 (owner: 10KartikMistry) [15:19:11] (03CR) 10Alexandros Kosiaris: [C: 04-1] "As far as I am concerned, this looks fine, but we should lose the WIP in the commit message before merging" [puppet] - 10https://gerrit.wikimedia.org/r/180724 (owner: 10KartikMistry) [15:19:36] (03PS1) 10Yuvipanda: tools: Allos origin protocols other than http [puppet] - 10https://gerrit.wikimedia.org/r/181073 [15:24:33] Coren: ^ [15:24:51] doesn’t require a restart of any tools, just requires restart of the admin tool... [15:26:17] (03CR) 10KartikMistry: "Yeah. I'm waiting Santhosh to reply. Should be fix on Monday!" [puppet] - 10https://gerrit.wikimedia.org/r/180724 (owner: 10KartikMistry) [15:29:54] YuviPanda: Is https://gerrit.wikimedia.org/r/181067 okay? [15:32:27] kart_: looking [15:53:16] (03CR) 10coren: [C: 031] "Moar flexible." [puppet] - 10https://gerrit.wikimedia.org/r/181073 (owner: 10Yuvipanda) [15:55:20] (03PS2) 10Yuvipanda: Add Kartik Mistry to Beta Cluster alert [puppet] - 10https://gerrit.wikimedia.org/r/181067 (owner: 10KartikMistry) [15:59:12] (03CR) 10Yuvipanda: [C: 032] Add Kartik Mistry to Beta Cluster alert [puppet] - 10https://gerrit.wikimedia.org/r/181067 (owner: 10KartikMistry) [15:59:22] (03PS2) 10Yuvipanda: tools: Allow origin protocols other than http [puppet] - 10https://gerrit.wikimedia.org/r/181073 [15:59:36] (03PS2) 10Yuvipanda: toollabs: Remove stray duplicate line in static-server [puppet] - 10https://gerrit.wikimedia.org/r/181066 [15:59:48] YuviPanda: Thanks! [16:00:44] (03CR) 10Yuvipanda: [C: 032] toollabs: Remove stray duplicate line in static-server [puppet] - 10https://gerrit.wikimedia.org/r/181066 (owner: 10Yuvipanda) [16:00:56] (03PS3) 10Yuvipanda: tools: Allow origin protocols other than http [puppet] - 10https://gerrit.wikimedia.org/r/181073 [16:02:45] (03CR) 10Yuvipanda: [C: 032] tools: Allow origin protocols other than http [puppet] - 10https://gerrit.wikimedia.org/r/181073 (owner: 10Yuvipanda) [16:05:34] (03PS1) 10Andrew Bogott: Remove qemu-kvm in favor of qemu-system. [puppet] - 10https://gerrit.wikimedia.org/r/181077 [16:06:46] RECOVERY - NTP on stat1003 is OK: NTP OK: Offset -0.06606340408 secs [16:06:50] YuviPanda: ^ might fix it. Running a couple more tests [16:07:06] andrewbogott: w000t [16:07:56] (03CR) 10Andrew Bogott: [C: 032] Remove qemu-kvm in favor of qemu-system. [puppet] - 10https://gerrit.wikimedia.org/r/181077 (owner: 10Andrew Bogott) [16:08:22] YuviPanda: oops, merge my patch while you're in there? [16:08:31] sure [16:10:23] (03PS1) 10Filippo Giunchedi: graphite: introduce local c-relay [puppet] - 10https://gerrit.wikimedia.org/r/181080 [16:10:55] RECOVERY - puppet last run on virt1010 is OK: OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures [16:11:17] hm, um, having trouble remembering. how does $cluster get set, so that ganglia knows what multicast group to use? [16:11:22] i see it set sometimes in site.pp [16:11:24] $cluster = [16:12:01] OH is it in hiera now HMMM [16:13:21]