[00:00:05] RoanKattouw, ^d, marktraceur, MaxSem: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141219T0000). Please do the needful. [00:00:13] Thank you! [00:00:40] All done :) [00:01:53] * James_F waits for merges. :-) [00:04:31] * aude here for swat [00:05:26] !log Reloading Zuul to deploy I3333f5e45 [00:05:29] Grumble grumble slow jenkins grumble. :-) [00:05:32] Logged the message, Master [00:05:34] Krinkle: Argh. [00:05:43] Krinkle: During a SWAT? Seriously? [00:05:44] Zuul is graceful. You won't notice a thing [00:05:52] Not good timing, though. [00:05:57] takes exactly 4ms for the actual reload. [00:06:33] I also breath during a SWAT. I can log it in a different channel :P [00:06:41] Krinkle: :-P [00:07:50] Krinkle: Hmm. https://gerrit.wikimedia.org/r/#/c/180912/ definitely shouldn't be blocked by https://gerrit.wikimedia.org/r/#/c/180701/ – they're in different branches. [00:08:20] James_F: the new 'mediawiki-gate' blocks all mw-related repos on each other. Regardless of branch or direction. [00:09:38] Krinkle: That seems… inefficient. [00:11:16] What on Earth? [00:14:37] mhm, no one to deploy again? [00:15:40] okay, I can do it again [00:15:52] MaxSem: thanks [00:15:54] :/ [00:15:58] thanks [00:16:57] (03CR) 10MaxSem: [C: 032] Turning off WikiGrok test on en.wiki, turning on WikiGrok on test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180867 (owner: 10Kaldari) [00:17:13] (03Merged) 10jenkins-bot: Turning off WikiGrok test on en.wiki, turning on WikiGrok on test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180867 (owner: 10Kaldari) [00:18:26] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/180867 (duration: 00m 06s) [00:18:30] Logged the message, Master [00:18:35] kaldari|2, ^ [00:19:00] checking [00:21:34] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [00:21:50] MaxSem: hmm, enwiki is still returning true for mw.config.get( 'wgMFEnableWikiGrok' ); [00:22:30] wtf, I made pulled [00:22:52] maxsem@tin:/srv/mediawiki-staging$ git log -1 [00:22:59] Merge "Turning off WikiGrok test on en.wiki, turning on WikiGrok on test2" [00:23:38] try debug mode? [00:26:51] mw.config.get( 'wgMFEnableWikiGrok' ); is false for me [00:26:55] on enwiki [00:26:58] yep, caching [00:28:31] !log maxsem Synchronized php-1.25wmf13/extensions/Wikidata/: https://gerrit.wikimedia.org/r/#q,181003,n,z (duration: 00m 12s) [00:28:35] aude, ^^ [00:28:38] Logged the message, Master [00:28:42] checking [00:28:52] it works [00:30:02] !log maxsem Synchronized php-1.25wmf12/extensions/Wikidata/: https://gerrit.wikimedia.org/r/#q,181000,n,z (duration: 00m 13s) [00:30:09] Logged the message, Master [00:30:22] yay [00:30:25] https://www.wikidata.org/wiki/Special:SetSiteLink/Q1/dewiktionary [00:30:26] looks good [00:30:44] (03CR) 10MaxSem: [C: 032] Enable otherProjectsLinks on it.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180818 (owner: 10Glaisher) [00:32:50] ffuck you jerkins [00:34:17] (03CR) 10MaxSem: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180818 (owner: 10Glaisher) [00:34:47] (03CR) 10Ori.livneh: "Not sure what is going on in this patch. It doesn't disable xhprof profiling; it disables xenon traces. Xenon has nothing to do with xhpro" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180791 (owner: 10Giuseppe Lavagetto) [00:35:21] (03PS3) 10MaxSem: Enable otherProjectsLinks on it.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180818 (owner: 10Glaisher) [00:35:33] (03PS1) 10Ori.livneh: Revert "Temporarily disable xhprof profiling, due to stability issues" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181007 [00:38:30] Krinkle, zuul appears down [00:38:51] MaxSem: Based on what? [00:39:00] MaxSem: It seems to be sending jobs to Jenkins as of 2 seconds ago [00:39:29] he queue looks locked up, no ^^^ mediawiki-config change can be seen [00:39:38] s/he/the/ [00:40:12] Lets wait for the queue to empty. It's definitely still up or that page would show an error (there's no caching) [00:40:31] MaxSem: This one? -- https://gerrit.wikimedia.org/r/#/c/180818/ [00:40:40] yep [00:40:52] Try reviewing with 0 and the +2 again [00:40:57] *and then [00:41:27] it processed a +2 in 5(!) minutes, but still I don't see it in the queue [00:41:39] It doesn't just "miss" events. That won't help. It'll show up after the queue is empty. It's quite rigorous about events. Unless someone shut Zuul down for a restart, it won't miss events. [00:41:50] (adding more events will only make it wait longer) [00:42:30] There's a MF job and a core job in the qeuee being run just fine. Zuul is very inefficient in how it handles the queue, always been that way, don't know why. if they don't show up after these 2 items are finished I'll restart. [00:43:40] A donationinterface commit just showed up so I guess its' working [00:44:07] Which was submitted to gerrit less than a minute ago. [00:44:20] So whatever is older than 1 minute and not on https://integration.wikimedia.org/zuul/ is not gonna show up [00:44:30] Aka "not working"? :-) [00:44:59] it's working. If I restart it now it will no nothing other than remove what it's in there and continue listening the same way. Those events mus've been missed somehow. [00:45:18] The DI commit entered the queue a minute ago and is already running in Jenkins. [00:45:34] the config changes never take long at all [00:45:46] must have been missed [00:49:24] (03CR) 10MaxSem: Enable otherProjectsLinks on it.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180818 (owner: 10Glaisher) [00:49:29] (03CR) 10MaxSem: [C: 032] Enable otherProjectsLinks on it.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180818 (owner: 10Glaisher) [00:52:36] (03Merged) 10jenkins-bot: Enable otherProjectsLinks on it.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180818 (owner: 10Glaisher) [00:53:37] !log demon Synchronized php-1.25wmf13/includes/Html.php: (no message) (duration: 00m 05s) [00:53:44] Logged the message, Master [00:53:52] <^d> greg-g: that's a thing I heard you were wondering about ^ [00:54:17] ^d: yeppers, thanks man [00:54:22] <^d> np [00:54:35] alright, i think we can shut 'er down for the holidays now :) [00:54:54] <^d> You kidding? It's time to Do All The Things! [00:55:04] shhhhhHh! [00:56:41] !log maxsem Synchronized php-1.25wmf12/resources/lib/oojs-ui/oojs-ui.js: https://gerrit.wikimedia.org/r/#/c/180860/ (duration: 00m 08s) [00:56:48] Logged the message, Master [00:57:06] !log maxsem Synchronized php-1.25wmf12/extensions/VisualEditor/: https://gerrit.wikimedia.org/r/#/c/180860/ (duration: 00m 07s) [00:57:12] Logged the message, Master [00:57:14] James_F, ^^^ [00:57:23] MaxSem: Thanks! [01:00:12] MaxSem: (Confirmed working.) [01:06:31] (03CR) 10Bartosz Dziewoński: "Is this happening? I'd love to see it happen this year. :)" [puppet] - 10https://gerrit.wikimedia.org/r/179407 (owner: 1020after4) [01:08:55] (03CR) 10MZMcBride: "Rush: this is a paper cut bug. Can this change please be merged and deployed?" [puppet] - 10https://gerrit.wikimedia.org/r/179407 (owner: 1020after4) [01:10:17] (03PS2) 10Krinkle: phabricator: Change security_topic from "default: none" to "default: default" [puppet] - 10https://gerrit.wikimedia.org/r/179407 (owner: 1020after4) [01:10:37] (03CR) 10Krinkle: [C: 031] phabricator: Change security_topic from "default: none" to "default: default" [puppet] - 10https://gerrit.wikimedia.org/r/179407 (owner: 1020after4) [01:10:57] greg-g, I'm resigning as a SWAT deployer [01:12:06] MaxSem: Did you sync our configuration change? [01:12:21] ? [01:12:22] the one that's not merged yet? [01:12:29] it merged [01:12:31] (Merged) jenkins-bot: Enable otherProjectsLinks on it.wikipedia [mediawiki-config] - https://gerrit.wikimedia.org/r/180818 (owner: Glaisher) [01:12:34] that one [01:13:28] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/180818 (duration: 00m 05s) [01:13:35] Logged the message, Master [01:13:41] thanks [01:13:42] aude, hoo ^^^ [01:13:46] thanks [01:13:57] looks good (tried w/o js) [01:14:07] since they have a js gadget for this [01:14:15] that is obsolete now [01:14:22] Can we please kill that one? [01:14:29] https://it.wikipedia.org/wiki/MediaWiki:Gadgets-definition [01:14:33] is a giantic mess, btw [01:14:36] it's not a gadget actually [01:14:43] just in common.js [01:14:50] yikes [01:15:05] i'm sure they will take care of that soon [01:15:17] anyway, it works :) [01:16:48] aude: I can't find that damn script [01:16:49] link? [01:17:00] Not directly in https://it.wikipedia.org/wiki/MediaWiki:Common.js right? [01:17:53] https://it.wikipedia.org/wiki/MediaWiki:InterProject.js [01:18:07] linked from common js [01:18:43] where does that get its data from? [01:19:37] {{Interprogetto|commons=Category:Berlin|n=Categoria:Berlino|q=Berlino|q_preposizione=su|s=Ich bin ein Berliner|s_preposizione=su|wikt=Berlino|voy=Berlino|etichetta=Berlino}} [01:19:40] not for real [01:19:41] wow [01:21:00] !log maxsem Synchronized php-1.25wmf13/extensions/MobileFrontend/: (no message) (duration: 00m 05s) [01:21:09] Logged the message, Master [01:21:13] !log maxsem Synchronized php-1.25wmf13/extensions/VisualEditor/: (no message) (duration: 00m 07s) [01:21:15] James_F, ^^^ [01:21:17] Logged the message, Master [01:21:19] kaldari|2, ^^^ [01:21:23] MaxSem: Testing. [01:23:32] MaxSem: Well, it's not worse. Good to go. [01:37:53] MaxSem, still deploying? [01:38:05] greg-g, i need to push a minor zero portal fix [01:38:05] nope, done [01:38:17] ok, will do it now unless someone is deploying [01:53:59] MaxSem, i can't merge with wmf13 for some reason, and yours is last. Any thoughts? [01:54:55] yurikR, you need to kick jerkins outta this change first - fixed [01:55:19] MaxSem, thx [01:56:39] !log yurik Synchronized php-1.25wmf13/extensions/ZeroPortal: (no message) (duration: 00m 06s) [01:56:48] Logged the message, Master [02:03:54] PROBLEM - puppet last run on search1003 is CRITICAL: CRITICAL: Puppet has 1 failures [02:18:21] (03CR) 10Ori.livneh: [C: 032] "I read the backlog on #wikimedia-operations and it is clear that disabling Xenon was not intentional and did not fix the problem. Since Xe" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181007 (owner: 10Ori.livneh) [02:18:31] (03Merged) 10jenkins-bot: Revert "Temporarily disable xhprof profiling, due to stability issues" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181007 (owner: 10Ori.livneh) [02:19:16] RECOVERY - puppet last run on search1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:19:32] !log ori Synchronized wmf-config/StartProfiler.php: re-enable xenon (duration: 00m 06s) [02:19:39] Logged the message, Master [02:28:56] PROBLEM - Host pay-lvs1001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [02:29:40] that seems bad [02:29:46] it's me [02:29:48] i think that may be bad [02:29:50] cool :) [02:29:52] it's fine [02:29:59] it paged [02:30:00] i screwed up and power cycled the wrong box ;-( [02:30:05] oops! [02:30:12] thank god for HA [02:30:26] i tried to mute it but icinga is useless [02:34:12] (03PS1) 10Springle: repool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181024 [02:34:52] (03CR) 10Springle: [C: 032 V: 032] repool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181024 (owner: 10Springle) [02:35:13] !log pay-lvs1001inadvertently power cycled [02:35:16] Logged the message, Master [02:35:24] RECOVERY - Host pay-lvs1001 is UP: PING OK - Packet loss = 0%, RTA = 0.72 ms [02:36:03] !log springle Synchronized wmf-config/db-eqiad.php: repool db1055, warm up (duration: 00m 06s) [02:36:11] Logged the message, Master [02:37:47] !log yurik Synchronized php-1.25wmf13/extensions/ZeroPortal: (no message) (duration: 00m 05s) [02:37:51] Logged the message, Master [02:38:31] (03PS1) 10Ori.livneh: mediawiki::hhvm: re-enable xenon [puppet] - 10https://gerrit.wikimedia.org/r/181026 [02:39:23] (03CR) 10Ori.livneh: "@Giuseppe: I reverted the wmf-config change which disabled xenon, but the extension is still disabled until this change is merged, too. If" [puppet] - 10https://gerrit.wikimedia.org/r/181026 (owner: 10Ori.livneh) [03:34:59] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:35:10] (03PS1) 10Springle: depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181028 [03:35:50] (03CR) 10Springle: [C: 032] depool db1028 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181028 (owner: 10Springle) [03:36:33] !log springle Synchronized wmf-config/db-eqiad.php: depool db1028 (duration: 00m 06s) [03:36:40] Logged the message, Master [03:44:19] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 57848 bytes in 0.048 second response time [04:40:02] (03Abandoned) 10Jackmcbarn: Re-enable the Lua profiler on production HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161468 (owner: 10Jackmcbarn) [04:48:22] (03PS1) 10Springle: upgrade db1028 to trusty [puppet] - 10https://gerrit.wikimedia.org/r/181031 [04:58:39] (03CR) 10Springle: [C: 032] upgrade db1028 to trusty [puppet] - 10https://gerrit.wikimedia.org/r/181031 (owner: 10Springle) [05:09:33] springle: heya. [05:10:42] hi [05:10:47] springle: See: https://phabricator.wikimedia.org/T78775 - we just need config like Beta here? Please add any missing bits when you've time. [05:11:09] (specially, how we can add DB once we're ready) [05:14:17] kart_: i have not been watching. i presume you had Dev input to sort out MW config in beta? [05:14:26] the DB is already created [05:15:05] springle: Beta is done :) [05:15:13] springle: This is for Production. [05:16:28] springle: Feel free to add input on how to proceed for adding tables to wikishared DB for production. For Beta, I did that myself. [05:17:02] kart_: https://wikitech.wikimedia.org/wiki/Schema_changes [05:18:13] so, gerrit changesets, ask for review, well before code deploy, etc [05:18:18] springle: Thanks. I'll tracking ticket. [05:19:22] springle: We've contentranslation.sql in code, so you can directly review it. Will add that in ticket. [05:27:56] springle: https://phabricator.wikimedia.org/T84969 for you now :) [05:29:32] kart_: cool [05:29:53] PROBLEM - puppet last run on elastic1031 is CRITICAL: CRITICAL: puppet fail [05:42:48] RECOVERY - puppet last run on elastic1031 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:00:27] (03PS1) 10Andrew Bogott: Turn off nova-compute on virt1012 -- we're keeping this one in reserve. [puppet] - 10https://gerrit.wikimedia.org/r/181033 [06:01:12] (03CR) 10jenkins-bot: [V: 04-1] Turn off nova-compute on virt1012 -- we're keeping this one in reserve. [puppet] - 10https://gerrit.wikimedia.org/r/181033 (owner: 10Andrew Bogott) [06:03:04] (03PS2) 10Andrew Bogott: Turn off nova-compute on virt1012 -- we're keeping this one in reserve. [puppet] - 10https://gerrit.wikimedia.org/r/181033 [06:04:37] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [06:10:49] (03PS3) 10Andrew Bogott: Turn off nova-compute on virt1012 -- we're keeping this one in reserve. [puppet] - 10https://gerrit.wikimedia.org/r/181033 [06:12:34] (03CR) 10Andrew Bogott: [C: 032] Turn off nova-compute on virt1012 -- we're keeping this one in reserve. [puppet] - 10https://gerrit.wikimedia.org/r/181033 (owner: 10Andrew Bogott) [06:20:13] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:25:52] PROBLEM - DPKG on virt1011 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:33:35] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:18] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:52] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:35:59] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:51] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:52] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:37:08] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 2 failures [06:46:43] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:37] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:47:42] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:47:51] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:47:57] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:37] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:49:10] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:55:06] <_joe_> !log restarted HHVM on mw1184, stuck in HPHP::StatCache::refresh [06:55:11] Logged the message, Master [06:57:31] RECOVERY - HHVM rendering on mw1184 is OK: HTTP OK: HTTP/1.1 200 OK - 69017 bytes in 0.122 second response time [06:58:21] RECOVERY - Apache HTTP on mw1184 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [07:04:41] RECOVERY - HHVM busy threads on mw1184 is OK: OK: Less than 30.00% above the threshold [76.8] [07:04:53] (03PS4) 10Krinkle: gerrit: Don't match Phabricator identifiers within urls [puppet] - 10https://gerrit.wikimedia.org/r/177128 [07:05:03] RECOVERY - HHVM queue size on mw1184 is OK: OK: Less than 30.00% above the threshold [10.0] [07:14:59] !log disabled puppet and nova-compute on virt1010 and virt1011 until I can sort out a libvirt issue. [07:15:04] Logged the message, Master [07:38:16] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: puppet fail [07:56:42] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:08:29] (03PS4) 10Giuseppe Lavagetto: [WMF] New Package Version with various bugfixes [debs/hhvm] - 10https://gerrit.wikimedia.org/r/180752 [08:11:05] _joe_: when you have time (haha!) we should talk about hhvm on tools :) [08:13:33] <_joe_> YuviPanda: in a few minutes [08:13:42] <_joe_> pinky swear [08:14:10] :D ok [08:39:26] <_joe_> YuviPanda: I'm here [08:39:55] _joe_: so, I know nothing about the memory characteristics of HHVM... [08:40:04] how bad is it? [08:40:06] relative to Zend [08:40:23] <_joe_> YuviPanda: it's not /bad/, but I need to understand how toollabs operates [08:40:29] _joe_: ah, ok [08:40:36] <_joe_> do you have small instances running mod_php at the moment? [08:40:55] _joe_: nope, so… all php web stuff runs with lighttpd and (as I discovered yesterday) php-cgi [08:41:29] _joe_: so we have 5 VMs, each running about 120 different ‘webservices’ as different user accounts. [08:41:47] <_joe_> ok, how are those segregated? [08:41:52] there’s a lot of php, lot of python, and some other languags too (C++, C#, some TCL) [08:42:04] _joe_: each run in their own user account (‘tool’ account) [08:42:10] <_joe_> because that _is_ going to be an issue with hhvm [08:42:43] <_joe_> we don't want to be running N hhvm instances, one per tool I guess [08:43:08] yeah, it’s one per tool now. [08:43:16] not sure how to enforce tool separation without doing that... [08:44:23] <_joe_> ok _this_ is our big issue right now. Let me do a test [08:44:58] greetings [08:45:02] <_joe_> ciao [08:45:17] _joe_: ok! [08:51:41] <_joe_> so the only separation I can think of without running in containers or chroot jails is separating the local repo path [08:51:58] <_joe_> I'm trying to see if an ini_set works for that, which I strongly doubt [08:52:28] <_joe_> but that would mean no FS separation [08:53:12] _joe_: hmm, but then you still have problems with file permissions etc. [08:53:27] _joe_: I suspect not being able to use per-user processes will cause way too many problems. [08:53:53] <_joe_> ok [08:53:57] <_joe_> so let's think of that [08:55:13] _joe_: also all processes are scheduled / managed by GridEngine, and it will kill the process if it goes above a certain amount of VMEM [08:55:50] currently 4G for most of them, and some tools have 8G [08:55:51] <_joe_> YuviPanda: we have a setting for that in HHVM [08:56:02] <_joe_> YuviPanda: 4G is _a_lot_ [08:56:04] <_joe_> !! [08:56:10] <_joe_> we can work with that I guess [08:56:28] <_joe_> oh beware that HHVM is _much_ better than zend in enforcing memory limits [08:56:57] _joe_: tools *regularly* OOM with 4G as well, btw. we’ve a deamon that restarts when they go down as well, because that just keeps happening. Code quality issues + php-cgi [08:56:58] <_joe_> YuviPanda: how do you supervise the services? [08:57:22] <_joe_> YuviPanda: if we have 4G as a base it can be done I guess [08:57:41] _joe_: that’s the custom perl script that Coren wrote (I want to use Monit at some point). scheduling and accounting is taken care of by grid engine [08:58:07] <_joe_> ok [08:58:27] _joe_: it’s 4G of VMEM tho, because it is mostly lighty+php shared mem is a lot of it so things are ok. [08:58:42] I guess lighty+hhvm is also ok [08:58:50] <_joe_> so my suggestion is: create a template for hhvm.ini and the startup script [08:59:41] <_joe_> then change the hhvm.repo.central.path variable and any tmp directories to match some place in the workplace of your users [09:00:03] <_joe_> can you schedule hhvm to be restarted at regular intervals? [09:00:21] can do if required. [09:00:25] right now they get restarted if they die. [09:00:26] <_joe_> YuviPanda: also, where do you confugure the env of a tool? [09:00:45] _joe_: users configure it, but we can restrict it too with GridEngine. [09:01:10] _joe_: right now it is configured by setting params in the users’ custom lighttpd config [09:01:28] <_joe_> YuviPanda: so the do configure their php.ini? [09:01:41] <_joe_> they configure how to run the php cgi? [09:02:25] _joe_: nope, can’t change their php.ini outside of ini_set. [09:02:36] <_joe_> ok [09:02:40] _joe_: so we merge their custom config with a ‘default’ lighty config [09:02:53] _joe_: that sets up php-cgi and some others (minor static file serving, etc) [09:03:26] <_joe_> so in the case of hhvm, you'd need to provision: the fcgi config AND an hhvm startup script and an ini file [09:03:29] <_joe_> per user [09:04:10] _joe_: hmm, I wonder if I can just make lighty itself start the hhvm process if necessary. [09:04:21] I know we do somewhat similar things with python/fastcgi [09:04:22] <_joe_> YuviPanda: nah please :) [09:04:28] <_joe_> srsly? [09:04:39] <_joe_> oh, man, that is gross [09:04:41] yup, no uwsgi or anything [09:05:05] <_joe_> (disclaimer: I do the same in some particular setups :P) [09:05:17] _joe_: lighty starting it is more gross than php-cgi? :P [09:05:25] <_joe_> nah [09:05:39] <_joe_> but still, I think that is not acceptable [09:05:47] <_joe_> in the case of hhvm [09:05:59] <_joe_> it cannot be execute separately for every request [09:06:23] <_joe_> or do you have some fastcgi supervisor within lighty? [09:06:29] <_joe_> and btw, show me the code :) [09:06:45] looking at code, moment [09:07:15] https://github.com/wikimedia/operations-puppet/blob/production/modules/toollabs/files/bigbrother is the restarter [09:07:38] https://github.com/wikimedia/operations-puppet/blob/production/modules/toollabs/files/lighttpd-starter is the thing that sets up lighty [09:07:46] <_joe_> ook [09:08:12] was before my time, etc :) [09:08:13] <_joe_> bash for templating, perl for supervising [09:08:40] https://github.com/valhallasw/gerrit-patch-uploader/blob/master/lighttpd.conf is how most python/fcgi apps are run [09:08:46] (https://github.com/valhallasw/gerrit-patch-uploader/blob/master/app.fcgi) [09:09:09] _joe_: it’s all a jolly mess, although much nicer than toolserver. [09:09:18] <_joe_> ok maybe it's not bad [09:09:38] <_joe_> I have to check how lighty works with fastcgi.server [09:10:32] <_joe_> YuviPanda: this is enough information to get me started [09:10:41] <_joe_> would you care to put all this in phab? [09:10:52] yeah, I already filed a ticket, let me put all this info there as well [09:11:14] https://phabricator.wikimedia.org/T78783 is the task [09:11:45] <_joe_> ok thanks [09:18:02] _joe_: done, I think. let me know if I missed anything [09:18:44] <_joe_> YuviPanda: ok thanks, the major blocker is my zero knowledge of lighty as a fastcgi env [09:20:18] _joe_: right. technically we *can* use nginx here too if we want, but then I don’t know if running 200nginx processes on a VM is better or worse than running 200 lighty ones [09:21:02] <_joe_> why do you have 200 lighty processes? [09:21:17] <_joe_> I thought lighty was common, with virtualhosts [09:21:21] no [09:21:23] it isn't [09:21:26] <_joe_> oh ok [09:21:29] _joe_: I edited my phab comment to add that [09:21:35] _joe_: user account separation again. [09:22:11] _joe_: we have nginx running on a separate machine that acts as a reverse proxy, routing to the appropriate lighty [09:22:15] (also does SSL) [09:22:23] <_joe_> YuviPanda: if the web server is just acting as a revers proxy to fastcgi apps... [09:22:31] _joe_: it’s also serving static files [09:22:37] _joe_: and doing url rewriting. [09:22:58] <_joe_> YuviPanda: ok, all things that don't require anything more than a separate virtual hosts [09:23:22] <_joe_> *host [09:23:26] <_joe_> but i digress [09:23:34] <_joe_> how is the frontend nginx configured? [09:24:03] wait what? [09:24:11] we have nginx that proxies to lighttpd? [09:24:26] paravoid: yes, but it’s more complicated than that! [09:24:26] <_joe_> that proxies to fastcgi [09:24:30] so when a web tool starts... [09:24:34] nginx is awesome [09:24:35] idk [09:24:41] it opens up a socket [09:24:48] to this ‘proxylistener’ we have running on the nginx server. [09:25:03] and then tells the server, ‘route things for /toolname/ to this port on this host' [09:25:17] and then the server routes that route to that port on that host as long as that socket is open. [09:25:41] <_joe_> ok [09:25:43] (https://github.com/wikimedia/operations-puppet/blob/production/modules/toollabs/files/proxylistener.py) [09:26:08] <_joe_> what about for hhvm instances we do proxy nginx -> fastcgi directly instead of http? [09:26:14] <_joe_> it's like super easy [09:26:20] <_joe_> we cut a middleman [09:26:31] _joe_: static files and url rewriting? [09:26:40] <_joe_> YuviPanda: static files is easy [09:26:59] <_joe_> and we set a rule: for hhvm all apps must have an index.php entry point [09:27:08] <_joe_> which does url routing [09:27:15] <_joe_> it's pretty easy to do [09:27:32] <_joe_> easier than writing rewrite rules btw [09:28:04] <_joe_> the simplest form is a switch statement on $_SERVER['REQUEST_URI'] [09:28:41] lots of php apps don’t actually care about url rewriting anyway, so I guess that’s ok [09:28:43] <_joe_> YuviPanda: why do we need the proxylistener btw? [09:29:06] <_joe_> because the ip of the tool instance may change? [09:29:16] _joe_: because tools die / OOM all the time, and when they respawn they aren’t on the same port or host. so we need something to track what is running where. [09:29:43] <_joe_> oh, and GE doesn't have an api for that? [09:30:04] _joe_: not to find what port things are running in, no. [09:30:35] _joe_: GE doesn’t actually know about webservices, that’s just an abstraction coren hacked on top of GridEngine. [09:30:57] <_joe_> YuviPanda: why the port should change? [09:31:22] _joe_: because it’s just a random high numbered port? [09:31:39] <_joe_> I mean we can decide "hhvm" for the "attila" tool runs on port 9000? [09:31:49] <_joe_> or is there a specific reason we can't do that? [09:31:59] only reason would be something else is already running on port 9000 [09:32:11] oh [09:32:14] you mean reserve ports per tool? [09:32:22] <_joe_> yes [09:32:36] <_joe_> reserve a range for "hhvm tools" for now, and assign them [09:32:50] <_joe_> but in general, yes [09:33:08] you still need something to assign them [09:33:10] <_joe_> I do get you have multiple tools running on the same host, right? [09:33:24] <_joe_> yeah in your present execution model that may be hard to do [09:33:32] yeah. [09:33:49] <_joe_> but hhvm needs a port set when it starts [09:33:55] _joe_: oh yeah, that’s fine... [09:34:07] <_joe_> so... another thing we need to tackle [09:34:09] _joe_: there’s… portgranter, which picks an unused port and gives it to you. [09:34:31] <_joe_> what's portgranter? a GE tool? [09:34:32] Coren explained to me a while ago why we need portgranter, but I have since forgotten... [09:34:37] _joe_: nope, somethign Coren wrote. [09:34:44] <_joe_> oh, ok [09:34:48] <_joe_> where is that? [09:35:01] https://github.com/wikimedia/operations-puppet/blob/production/modules/toollabs/files/portgranter [09:35:06] https://github.com/wikimedia/operations-puppet/blob/production/modules/toollabs/files/portgrabber [09:35:23] <_joe_> so task 0 would be - make portgranter reserve a fixed set of ports for hhvm services [09:35:34] <_joe_> and use fixed ports for HHVM [09:35:37] hmm [09:36:20] <_joe_> then generate an nginx config per-hhvm-tool, and a startup wrapper that can be launched by bigbrother [09:37:46] hmm, so make dynamicproxy also be able to proxy back fastcgi:// [09:37:50] and then have a hhvm-wrapper [09:37:55] <_joe_> yes [09:38:06] and static files? [09:38:09] <_joe_> it's easy and it simplifies your execution model a lot [09:38:54] <_joe_> where are your static files? on disk? on NFS? on some shared storage? [09:39:04] <_joe_> if so, make them available to dinamicproxy [09:39:06] <_joe_> :) [09:39:34] _joe_: noooooooooooooooooo [09:39:35] :) [09:39:40] that seems terrible [09:39:42] they’re on NFS [09:39:52] I’m trying to remember why that felt terrible [09:39:57] <_joe_> why that seems terrible? [09:40:30] hmm [09:40:43] <_joe_> I mean what's more terrible, have files served statically from nfs by lighty and forwarded to nginx, or cut the middleman? [09:40:45] shouldn’t be too hard to say ‘hey, you have a static folder here, and everything in that will come out of /static' [09:40:52] <_joe_> I guess that was privilege separation again [09:41:13] _joe_: yeah, the important part is that it is hard for one terribly coded tool to fuck up other terribly coded tools [09:41:27] <_joe_> YuviPanda: read != write [09:41:31] true, true... [09:41:49] well, with OGE, if one tool takes up too much memory, just that one gets killed. [09:42:03] while if we have anything that’s shared across all tools, that becomes harder to ‘just kill' [09:42:16] nginx static file serving doesn’t seem *that* bad [09:42:21] <_joe_> well, isn't dynamicproxy already shared? [09:42:47] yeah, but it’s just reverse proxying [09:43:00] which is fairly simple and trivial. [09:43:20] <_joe_> well, serving static files is even more trivial [09:43:43] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: puppet fail [09:45:24] _joe_: alright. I’m still slightly worried about hitting NFS, but perhaps NFS isn’t as bad now as it used to be. [09:45:40] _joe_: so, fastcgi://, hhvm wrapper, and static stuff goes in public_html/static [09:45:50] (and is served out of /toolname/static) [09:45:53] <_joe_> YuviPanda: sorry but... aren't the files already served via nfs? [09:46:05] _joe_: I mean, hitting NFS from the nginx proxy. [09:46:05] <_joe_> I mean it's NFS->lighty->nginx right now [09:46:26] <_joe_> well, try to see how much static files traffic we have [09:46:39] yes, and if NFS locks up and doesn’t return for a long time, it’s the lighty that hands and not nginx. [09:46:43] hmm, let me lok [09:46:45] *look [09:46:52] *hangs [09:47:26] <_joe_> ok [09:47:38] <_joe_> so you block the individual tool and not nginx, right [09:47:58] <_joe_> the solution for that may be serving static files for all those tools from one single separate instance [09:48:38] yeah, that was what I was thinking as well [09:48:50] have tools-staticfiles as a simple nginx server, and just serve static files out of that [09:48:54] <_joe_> or two, or whatever [09:49:01] right. [09:50:04] _joe_: so, let me file a bunch of subtasks. [09:50:21] <_joe_> YuviPanda: thanks a lot [09:50:34] _joe_: thanks for helping think this through :) [09:50:36] <_joe_> so I was looking at portgrabber and portgranter [09:50:48] <_joe_> mmmh [09:58:35] RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:37:11] (03CR) 10Alexandros Kosiaris: "How did this manage to break parsoid on beta ? It was already in deployment-salt for days." [puppet] - 10https://gerrit.wikimedia.org/r/169622 (owner: 10Catrope) [10:43:14] (03CR) 10Alexandros Kosiaris: "I am a little bit unclear on why the pickle protocol does not allow to route metrics around, got a link handy ?" [debs/txstatsd] - 10https://gerrit.wikimedia.org/r/180786 (owner: 10Filippo Giunchedi) [10:56:06] (03CR) 10Hashar: "In short:" [puppet] - 10https://gerrit.wikimedia.org/r/169622 (owner: 10Catrope) [11:29:50] (03PS4) 10Filippo Giunchedi: txstatsd: add support for graphite line-protocol [debs/txstatsd] - 10https://gerrit.wikimedia.org/r/180786 [11:31:09] (03CR) 10Filippo Giunchedi: "basically lack of availability of tools that I could find, also carbon-c-relay is plaintext-only. I've reworded the commit message" [debs/txstatsd] - 10https://gerrit.wikimedia.org/r/180786 (owner: 10Filippo Giunchedi) [11:38:09] (03CR) 10QChris: [C: 04-1] "My comments for PS3 still apply." [puppet] - 10https://gerrit.wikimedia.org/r/177128 (owner: 10Krinkle) [11:42:11] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [11:53:36] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:20:11] (03PS1) 10Yuvipanda: toollabs: Add class and role for static file server [puppet] - 10https://gerrit.wikimedia.org/r/181053 [12:21:00] (03CR) 10jenkins-bot: [V: 04-1] toollabs: Add class and role for static file server [puppet] - 10https://gerrit.wikimedia.org/r/181053 (owner: 10Yuvipanda) [12:22:43] (03PS1) 10Merlijn van Deen: Flake8-ify everything [debs/adminbot] - 10https://gerrit.wikimedia.org/r/181054 [12:23:04] (03PS2) 10Yuvipanda: toollabs: Add class and role for static file server [puppet] - 10https://gerrit.wikimedia.org/r/181053 [12:25:41] (03PS2) 10Merlijn van Deen: Flake8-ify everything [debs/adminbot] - 10https://gerrit.wikimedia.org/r/181054 [12:26:29] (03PS3) 10Merlijn van Deen: Flake8-ify everything [debs/adminbot] - 10https://gerrit.wikimedia.org/r/181054 [12:27:15] (03PS4) 10Merlijn van Deen: Flake8-ify everything [debs/adminbot] - 10https://gerrit.wikimedia.org/r/181054 [12:29:23] (03CR) 10Yuvipanda: [C: 032] toollabs: Add class and role for static file server [puppet] - 10https://gerrit.wikimedia.org/r/181053 (owner: 10Yuvipanda) [12:54:25] !log aude Synchronized php-1.25wmf12/extensions/Wikidata/extensions/Wikibase/lib/resources/jquery.wikibase: js caching issues (duration: 00m 05s) [12:54:29] Logged the message, Master [13:38:07] (03PS1) 10Yuvipanda: tools: Use autoindex instead of root for static-file server [puppet] - 10https://gerrit.wikimedia.org/r/181058 [13:39:59] (03PS2) 10Yuvipanda: tools: Use alias instead of root for static-file server [puppet] - 10https://gerrit.wikimedia.org/r/181058 [13:50:33] (03CR) 10Filippo Giunchedi: [C: 031] Make ircecho run as ircecho user [debs/ircecho] - 10https://gerrit.wikimedia.org/r/176333 (owner: 10Yuvipanda) [14:03:29] (03CR) 10Yuvipanda: [C: 032] tools: Use alias instead of root for static-file server [puppet] - 10https://gerrit.wikimedia.org/r/181058 (owner: 10Yuvipanda) [14:27:21] (03PS1) 10Yuvipanda: toollabs: Remove stray duplicate line in static-server [puppet] - 10https://gerrit.wikimedia.org/r/181066 [14:33:27] (03PS1) 10KartikMistry: Add Kartik Mistry to Beta Cluster alert [puppet] - 10https://gerrit.wikimedia.org/r/181067 [15:06:23] (03PS1) 10KartikMistry: Fix indentation in various files [puppet] - 10https://gerrit.wikimedia.org/r/181071 [15:08:38] (03CR) 10KartikMistry: "Santhosh, can we go with empty provider where we want MT disabled by default? I tested in config.js locally and it seems working. I can up" [puppet] - 10https://gerrit.wikimedia.org/r/180724 (owner: 10KartikMistry) [15:14:56] (03CR) 10Alexandros Kosiaris: [C: 032] Fix indentation in various files [puppet] - 10https://gerrit.wikimedia.org/r/181071 (owner: 10KartikMistry) [15:19:11] (03CR) 10Alexandros Kosiaris: [C: 04-1] "As far as I am concerned, this looks fine, but we should lose the WIP in the commit message before merging" [puppet] - 10https://gerrit.wikimedia.org/r/180724 (owner: 10KartikMistry) [15:19:36] (03PS1) 10Yuvipanda: tools: Allos origin protocols other than http [puppet] - 10https://gerrit.wikimedia.org/r/181073 [15:24:33] Coren: ^ [15:24:51] doesn’t require a restart of any tools, just requires restart of the admin tool... [15:26:17] (03CR) 10KartikMistry: "Yeah. I'm waiting Santhosh to reply. Should be fix on Monday!" [puppet] - 10https://gerrit.wikimedia.org/r/180724 (owner: 10KartikMistry) [15:29:54] YuviPanda: Is https://gerrit.wikimedia.org/r/181067 okay? [15:32:27] kart_: looking [15:53:16] (03CR) 10coren: [C: 031] "Moar flexible." [puppet] - 10https://gerrit.wikimedia.org/r/181073 (owner: 10Yuvipanda) [15:55:20] (03PS2) 10Yuvipanda: Add Kartik Mistry to Beta Cluster alert [puppet] - 10https://gerrit.wikimedia.org/r/181067 (owner: 10KartikMistry) [15:59:12] (03CR) 10Yuvipanda: [C: 032] Add Kartik Mistry to Beta Cluster alert [puppet] - 10https://gerrit.wikimedia.org/r/181067 (owner: 10KartikMistry) [15:59:22] (03PS2) 10Yuvipanda: tools: Allow origin protocols other than http [puppet] - 10https://gerrit.wikimedia.org/r/181073 [15:59:36] (03PS2) 10Yuvipanda: toollabs: Remove stray duplicate line in static-server [puppet] - 10https://gerrit.wikimedia.org/r/181066 [15:59:48] YuviPanda: Thanks! [16:00:44] (03CR) 10Yuvipanda: [C: 032] toollabs: Remove stray duplicate line in static-server [puppet] - 10https://gerrit.wikimedia.org/r/181066 (owner: 10Yuvipanda) [16:00:56] (03PS3) 10Yuvipanda: tools: Allow origin protocols other than http [puppet] - 10https://gerrit.wikimedia.org/r/181073 [16:02:45] (03CR) 10Yuvipanda: [C: 032] tools: Allow origin protocols other than http [puppet] - 10https://gerrit.wikimedia.org/r/181073 (owner: 10Yuvipanda) [16:05:34] (03PS1) 10Andrew Bogott: Remove qemu-kvm in favor of qemu-system. [puppet] - 10https://gerrit.wikimedia.org/r/181077 [16:06:46] RECOVERY - NTP on stat1003 is OK: NTP OK: Offset -0.06606340408 secs [16:06:50] YuviPanda: ^ might fix it. Running a couple more tests [16:07:06] andrewbogott: w000t [16:07:56] (03CR) 10Andrew Bogott: [C: 032] Remove qemu-kvm in favor of qemu-system. [puppet] - 10https://gerrit.wikimedia.org/r/181077 (owner: 10Andrew Bogott) [16:08:22] YuviPanda: oops, merge my patch while you're in there? [16:08:31] sure [16:10:23] (03PS1) 10Filippo Giunchedi: graphite: introduce local c-relay [puppet] - 10https://gerrit.wikimedia.org/r/181080 [16:10:55] RECOVERY - puppet last run on virt1010 is OK: OK: Puppet is currently enabled, last run 7 minutes ago with 0 failures [16:11:17] hm, um, having trouble remembering. how does $cluster get set, so that ganglia knows what multicast group to use? [16:11:22] i see it set sometimes in site.pp [16:11:24] $cluster = [16:12:01] OH is it in hiera now HMMM [16:13:21] RECOVERY - DPKG on virt1011 is OK: All packages OK [16:13:22] ah ha, i get it. [16:14:13] (03PS1) 10Ottomata: Move stat role servers into the 'analytics' $cluster [puppet] - 10https://gerrit.wikimedia.org/r/181083 [16:14:41] _joe_: does that look right to you? ^ [16:15:32] andrewbogott: merged ‘em, btw [16:17:54] (03PS2) 10Ottomata: Move stat role servers into the 'analytics' $cluster [puppet] - 10https://gerrit.wikimedia.org/r/181083 [16:18:26] <_joe_> ottomata: and you declare the "mainrole" where? [16:18:46] role/statistics.pp [16:18:49] <_joe_> btw, hold on a sec with that, we may have a simpler way to do all this by monday :) [16:19:02] <_joe_> it doesn't work like that right now [16:19:09] oh, k. ganglia is busted for these hosts since we moved them into the stats vlan [16:19:13] shoudl I just set globals in site.pp for now? [16:19:18] analytics vlan* [16:19:40] <_joe_> or add those hosts to regex.yaml and declare there [16:19:45] <_joe_> mainrole: statistics [16:24:10] oh. [16:24:15] that's how mainrole gets set? [16:24:18] it has to be a varaible? [16:24:28] it isn't ia the system::role? [16:25:17] ook [16:25:25] (03PS1) 10Yuvipanda: tools: Fix stupid typos [puppet] - 10https://gerrit.wikimedia.org/r/181085 [16:25:29] Coren: ^ [16:26:41] (03PS3) 10Ottomata: Move stat role servers into the 'analytics' $cluster [puppet] - 10https://gerrit.wikimedia.org/r/181083 [16:26:51] (03PS4) 10Ottomata: Move stat role servers into the 'analytics' $cluster [puppet] - 10https://gerrit.wikimedia.org/r/181083 [16:26:56] _joe_, like that? ^ [16:32:51] _joe_ ^^ ? :) [16:34:57] <_joe_> ottomata: I might have missed the commit message, I was disconnected [16:35:37] https://gerrit.wikimedia.org/r/181083 [16:36:28] (03CR) 10Giuseppe Lavagetto: [C: 031] Move stat role servers into the 'analytics' $cluster [puppet] - 10https://gerrit.wikimedia.org/r/181083 (owner: 10Ottomata) [16:36:35] danke [16:36:42] (03CR) 10Ottomata: [C: 032] Move stat role servers into the 'analytics' $cluster [puppet] - 10https://gerrit.wikimedia.org/r/181083 (owner: 10Ottomata) [16:58:10] (03PS2) 10Ori.livneh: mediawiki::hhvm: re-enable xenon [puppet] - 10https://gerrit.wikimedia.org/r/181026 [16:58:22] (03CR) 10Ori.livneh: [C: 032 V: 032] "per joe" [puppet] - 10https://gerrit.wikimedia.org/r/181026 (owner: 10Ori.livneh) [17:15:08] (03PS1) 10Chad: Kill search pools 2,4,5 from LVS [puppet] - 10https://gerrit.wikimedia.org/r/181091 [17:15:30] (03PS1) 10Chad: Remove lsearchd pools 2,4,5 from DNS [dns] - 10https://gerrit.wikimedia.org/r/181092 [17:15:49] <^d> Hehe :) [17:16:17] * YuviPanda delicately merges patches, trying to make sure tools is unfucked [17:16:18] what's that all about? [17:16:33] all what? [17:16:35] (03PS2) 10Yuvipanda: tools: Fix stupid typos [puppet] - 10https://gerrit.wikimedia.org/r/181085 [17:16:58] the removal of 3/5 search pools? [17:17:03] <^d> bblack: lsearchd is dead! long live elasticsearch! [17:17:08] heh ok [17:17:15] (03CR) 10Yuvipanda: [C: 032] tools: Fix stupid typos [puppet] - 10https://gerrit.wikimedia.org/r/181085 (owner: 10Yuvipanda) [17:17:18] <^d> Pools 2 4 and 5 are now unused since yesterday. [17:17:26] <^d> (had zero traffic so I disabled them in MW) [17:17:32] they did? [17:17:44] oh ok, two different LVS levels here to think about, ok [17:18:03] it just seems odd to see it being removed from LVS when it still has backends pooled for pybal currently. [17:18:19] <^d> All of which needs to go away too :) [17:18:30] I'm all for simplification :) [17:18:33] <^d> The whole lsearchd system will come tumbling down [17:22:59] <^d> bblack: If you want to depool all of them in those pools you could. Like I said, the cluster's not talking to them anymore. [17:23:15] as long as you're sure of that, it doesn't matter. [17:24:03] <^d> https://gerrit.wikimedia.org/r/#/c/180838/3/wmf-config/lucene-production.php, specifically [17:24:40] (03PS1) 10RobH: setting production dns for haedus and capella servers [dns] - 10https://gerrit.wikimedia.org/r/181093 [17:25:07] so the LVS pooling is just for en, nl, ru now? [17:25:35] (03CR) 10RobH: [C: 032] setting production dns for haedus and capella servers [dns] - 10https://gerrit.wikimedia.org/r/181093 (owner: 10RobH) [17:25:38] <^d> Yep [17:25:58] <^d> nlwiki has some gadget or tool still using it, which I need to track down and help fix. [17:26:12] <^d> enwiki has some spider scraping the API, trying to find contact info. [17:26:21] <^d> ruwiki might be able to go, but I saw like 1 or 2 api hits for it [17:26:29] <^d> So I erred and kept it [17:26:29] ok [17:28:47] ok I can push those today if we're ready, they require some pybal restarts to take effect [17:29:12] <^d> cc manybubbles. [17:29:40] or we can just leave them till after the holidays if we don't want the risk. I think it's ok if we get it done early today while we're still looking for any fallout [17:29:42] oh hai - you are trying to shut down lsearchd for wikis? [17:29:59] <^d> I wrote some patches to kill lvs for pools 2 4 and 5. [17:30:01] I don't really object. I was thinking after the holidays [17:30:28] I guess in practice it's just cleanup rather than a functional improvement, so yeah maybe we put off the cleanup step [17:31:22] <^d> I guess we could just file a Phab task so we don't forget but do it after the holidays. [17:31:51] yeah. I'll add myself to those two commits two, I tend to re-check my gerrit queue periodically. [17:32:35] <^d> Do they go in #operations or #ops-requests now? [17:32:55] (03PS1) 10RobH: setting mac address info for haedus and capella servers [puppet] - 10https://gerrit.wikimedia.org/r/181095 [17:34:01] we're not sure, I think :) [17:34:21] (03CR) 10RobH: [C: 032] setting mac address info for haedus and capella servers [puppet] - 10https://gerrit.wikimedia.org/r/181095 (owner: 10RobH) [17:34:30] but just assign it to me explicitly either way, I generally look at a lot of the LVS/DNS layer stuff anyways [17:34:32] <^d> I'll do #operations [17:34:49] <^d> T85009 [17:36:28] ^d: no objections to your pool choices. I think we should announce it before we drop support for them though. [17:36:54] ^d: also, uh, so, if folks on those wikis put srbackend=lucene then they'll just get errors. which I assume isn't too bad [17:37:01] greg-g: you probably want to read ^^ too [17:37:31] <^d> srbackend is already gone on the API since it only has one search backend. [17:39:01] so, yeah, get it all ready and do it post holidaze? [17:39:07] manybubbles: greg-g is gone and not reading email, I think [17:39:12] well, at least trying to not read email :) [17:39:14] YuviPanda: thanks [17:39:18] :) [17:39:25] YuviPanda: almost, 8 more hours [17:39:30] oh [17:39:35] timezones [17:39:38] :) [17:39:56] ^d: oh - uh - wait - doesn't that have to still work for the people still trying to use it? [17:40:21] <^d> No? That's not how that setting worked. [17:46:08] <^d> manybubbles, greg-g: better late than never I suppose :) https://lists.wikimedia.org/pipermail/wikitech-l/2014-December/079884.html [17:47:03] thanks! [17:54:32] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: puppet fail [17:54:51] !log Manually transferred the email from enwiki account "Hob Gadling" to the centralauth account of the same name (after a partially failed account creation). [17:54:57] Logged the message, Master [18:09:36] (03PS1) 10Nemo bis: [English Planet] Add Geni, en.wiki/Commons sysop etc. [puppet] - 10https://gerrit.wikimedia.org/r/181104 [18:09:56] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:30:15] (03PS1) 10Ori.livneh: Add abacist module & role; provision on stat1001 [puppet] - 10https://gerrit.wikimedia.org/r/181110 [18:31:43] bd808: have you met nuria__ from analytics? she was just reporting that logstash appears to not be getting log data. [18:31:59] bd808: hello! [18:32:07] hi nuria__ [18:32:15] bd808: let me know if this is something you can help with [18:32:35] (btw, are you the right person to ping, or should it be someone in ops?) [18:32:45] I probably can help. [18:33:02] ori: Me, Reedy or jgage are probably the folks to ask [18:33:41] nuria__: Which data is missing? I see a lot of stuff in logstash for the last 15 minutes globally [18:33:57] nuria__: hadoop data [18:34:20] otto recently showed me this error from hadoop: [18:34:21] java.io.IOException: Cannot send data to logstash1002.eqiad.wmnet/10.64.32.137:12201 [18:34:22] https://www.irccloud.com/pastebin/b9nI7hyj [18:34:30] 124,313 events logged in the last 15 mintues [18:34:39] then i found a number of these in logstash1002:/var/log/logstash/logstash.log: message=>"Gelfd failed to parse a message skipping", :exception=># [18:34:45] bd808: from hadoop? [18:35:12] TooManyChunksError is a client problem. The stack traces you are sending are too big for the protocol [18:35:18] i did notice a spike in network traffic to logstash over the last couple days, perhaps we are hitting a capacity issue [18:35:24] hah! [18:35:40] there is an option to avoid sending stack traces, but that would be a bummer [18:35:54] jgage: right....that would not be very useful [18:35:56] Yeah. Java stacks can get nuts [18:36:03] jgage: but in log4j you can configure [18:36:15] jgage: bd808 how deep your traces are [18:36:19] Maybe it's time to change transport? [18:36:47] bd808 you mean to tcp? or redis? [18:36:50] I do see lots of stuff in https://logstash.wikimedia.org/#/dashboard/elasticsearch/hadoop [18:36:53] jgage, bd808 : if we configure stack trace "deepness" that can also help [18:37:19] nuria__ i'll take a look at the log4j config, i don't recall offhand how to set the stack trace depth [18:37:29] bd808: I see heatbeat kind of info but errors? [18:37:30] jgage: redis would be my first choice today I think. kafka maybe later when we get that working [18:37:50] bd808 how close are we to deploying the redis queue in prod? [18:38:14] jgage: It's there an working. MW is using it for the group0 servers [18:38:22] *and working [18:38:33] oh sweet. so i can go ahead and change the hadoop config? [18:39:24] Yeah. give it a shot. In the MW config I'm randomly selecting a hosts from 100[123] per-request. [18:39:35] great ok [18:39:40] Hopefully the java client has a better way to spread out the load [18:40:03] I think "%throwable{n} " is the log4j way to limit a trace [18:40:17] (03PS1) 10RobH: setting partman for capella/haedus [puppet] - 10https://gerrit.wikimedia.org/r/181111 [18:40:32] But that may only work for pattern layouts :( [18:40:33] jgage, bd808 : let me know when you think you are ready to test and we can send a hadoop job that fails and try to serach for errors in logstash to make sure they are there [18:40:57] (03CR) 10RobH: [C: 032] setting partman for capella/haedus [puppet] - 10https://gerrit.wikimedia.org/r/181111 (owner: 10RobH) [18:41:16] thanks nuria__. i also have a hive query that fails by design that i can use to test. [18:41:30] jgage: ok, you let me know. Thank you! [18:43:19] jgage: the filtering for stack traces i remember i used in tomcat, not sure if something similar is available to you: http://openutils.sourceforge.net/openutils-log4j/filteredlayout.html [18:44:08] thank you :) [18:45:54] I would guess that the giant payloads are from nested exceptions. 128 * 1420 is a lot of log output [18:46:33] Switching to just sticking a json message in redis is probably best though [18:46:48] it avoids the problem of udp too [18:50:07] * jgage is making a phab task :D [18:50:23] jgage: :o [18:51:25] augh i accidentally hit back+forward and it cleared the form [18:54:08] (03PS1) 10Jforrester: gdash: Fix VE dashboard to point to new names of data streams [puppet] - 10https://gerrit.wikimedia.org/r/181114 [19:52:44] (03PS1) 10Mattflaschen: Make flow-bot grantable/removable on testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181120 [19:52:59] (03CR) 10Mattflaschen: [C: 04-1] Make flow-bot grantable/removable on testwiki and test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181120 (owner: 10Mattflaschen) [20:44:25] I am a global System administrator according to https://meta.wikimedia.org/wiki/Special:GlobalUsers/sysadmin and really shouldn't be ! Can someone ( MaxSem ?) kick me out of the club [20:45:12] spagewmf: Reedy can, or you can ask a steward in -stewards that you want it removed or prod the LCA of hell :) [20:45:38] (those are the methods I've seen work anyway) [20:45:56] JohnLewis: thanks, I was going to make a request at https://meta.wikimedia.org/wiki/Steward_requests/Global_permissions [20:46:12] that works but -stewards is easier :p [20:47:24] I just wonder how you got there in the first place. [20:48:16] spagewmf, actually I think you could remove yourself [20:49:01] spagewmf, https://meta.wikimedia.org/wiki/Special:GlobalUserRights/User:SPage_(WMF) ? [20:49:17] ^^ you should be able to according to the group rights [20:55:26] (03PS1) 10BryanDavis: Add 'wikipedia' to $wikiTags for SiteConfiguration checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181129 [20:55:28] (03PS1) 10BryanDavis: monolog: enable for group0 + group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181130 [20:55:36] (03CR) 10jenkins-bot: [V: 04-1] monolog: enable for group0 + group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181130 (owner: 10BryanDavis) [20:58:14] (03CR) 10BryanDavis: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181130 (owner: 10BryanDavis) [21:00:20] (03PS13) 10Krinkle: contint: Apply contint::qunit_localhost to labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/168631 (https://bugzilla.wikimedia.org/72063) [21:00:27] (03PS14) 10Krinkle: contint: Apply contint::qunit_localhost to labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/168631 (https://bugzilla.wikimedia.org/72063) [21:00:35] (03PS2) 10Krinkle: contint: Move tmpfs Require to caller to support labs' jenkins-deploy [puppet] - 10https://gerrit.wikimedia.org/r/173511 [21:00:44] (03PS2) 10Krinkle: [WIP] contint: Add tmpfs mount in jenkins-deploy homedir for labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) [21:00:53] (03PS3) 10Krinkle: contint: Add tmpfs mount in jenkins-deploy homedir for labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/173512 (https://bugzilla.wikimedia.org/72063) [21:10:36] (03PS2) 10BryanDavis: monolog: enable for group0 + group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181130 [21:11:42] (03Abandoned) 10BryanDavis: Add 'wikipedia' to $wikiTags for SiteConfiguration checks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181129 (owner: 10BryanDavis) [21:23:44] spagewmf: I see you got it swapped out from staff rights [21:24:33] yes it's in the log for my old S Page (WMF). I made a requestonthe noticeboard to removeit. Not urgent [21:24:48] I saw :p [21:24:58] replying with a diff to the log [21:26:06] spagewmf: the mystical James Alexander is poking stewards to deal with your request because LCA like controlling this stuff :) [21:28:13] JohnLewis: here's where I confess to not knowing what "LCA" is (and https://meta.wikimedia.org/wiki/Glossary#L) and every other glossary doesn't help [21:29:03] spagewmf: sorry; Legal and Community Advocacy. You know, Philippe, James and Maggie (mostly) under Geoff [21:29:12] o7 [21:29:38] hola Jamesofur [21:29:41] :p [21:30:01] Jamesofur: you mean 007 :) [21:31:23] heh [22:21:02] PROBLEM - puppet last run on calcium is CRITICAL: CRITICAL: Puppet has 1 failures [22:33:48] RECOVERY - puppet last run on calcium is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [22:34:17] Looks like Zuul needs a kick?? [22:35:53] ^d: Krinkle: ^^ [22:36:06] PROBLEM - tcpircbot_service_running on neon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:36:08] <^d> anytime I touch it I break it worse. [22:36:12] Zuul is "idle" with 31 queued jobs [22:36:13] * ^d doesn't touch it anymore [22:36:14] sigh [22:36:27] ^d: probably a good approach, that's what I do at least [22:36:47] I'm off until January 2nd starting tonight. [22:37:13] anyone here I can ping about this? [22:37:31] the test pipeline says "0" for the last 8 hours, actually [22:37:50] donno if that graph is usually empty or not [22:38:07] Its' stuck as of 14 minutes asgo [22:38:17] oh... not as bad as it looked [22:38:20] awight: graph 2 and 3 are down because graphite is broken [22:38:34] the graph have no meaning. [22:38:35] I don't know why [22:39:08] sad! [22:39:09] RECOVERY - tcpircbot_service_running on neon is OK: PROCS OK: 1 process with command name python, args tcpircbot.py [22:39:14] Krinkle: go on vacation, quick! [22:39:24] but if you want to name a backup, I can harrass that person [22:39:32] Antoine [22:39:35] argh [22:39:37] hashar: [22:39:44] he's been on IRC strike [22:39:49] No, he's just asleep [22:39:54] or at least done working today [22:39:59] k [22:40:03] He's been on IRC last few days. Just started again [22:40:10] I dont know when his vacation starts [22:40:42] hah [22:40:59] ok don't worry, I'm sure I can find someone to throw the server rack power switch or something :p [22:41:32] I'm looking into it now, and trying to pretend I dont konw anything and only follow documentation [22:42:27] :p [22:42:33] don't do that to yourself [22:44:52] Krinkle: looks like you made something come back to life--now there are 174 jobs queued :) [22:47:25] awight: Those jobs came from an independent schedule. Every x minutes, there's a cronjob triggering 100s of jobs for browser test and beta labs scap. Because I don't know, its' easier to say do it every 5 minutes and waste 100s of jobs then do it a normal way.. [22:47:31] Not triggered by Gerrit [22:48:04] In case something changed [22:50:03] yes I noticed [22:50:38] Also, it's not 5 minutes, but something like twice per day for most jobs, which is both higher and lower granularity than u'd wish [22:51:08] Krinkle: total tangent, I noticed that we're lacking a mechanism to provision database test fixtures for the browser tests. [22:51:20] Lemme en-phabricate that while I'm thinking about it... [22:51:48] awight: Well, most of those tests use test2 or beta labs instead of a plain mediawiki install. So the pages they depend on already exist. [22:51:59] But yeah, they should use simple local installs (like our unit tests do) and use fixtures. [22:52:23] We have a use case for fixtures, testing CentralNotice, and I'm sure there are others who could benefit... [22:52:29] Or create the pages with a maintenance script or API client [22:52:33] (so taht we can run it against beta labs still) [22:52:44] yep [22:52:50] awight: I'm not involved with browser tests though. [22:52:58] awight: as a consumer of CI I maintain front-end unit tests. [22:53:00] I balked at writing a r/w api for CentralNotice though... [22:53:15] lemme stop distracting you ;) [22:53:16] as maintainer of CI I maintain the infrastructure in general, not individual jobs. that's responsibility of individual engineering teams or QA. [22:53:28] Krinkle: did you do anytihng with Zuul recently? [22:53:50] greg-g: Nope. Zuul and Jenkins commit suicide every 10-40 hours as usual for the past few months. [22:53:59] and just come back up? [22:54:14] No, they escalate until Antoine or I wake up and push every button we see until it comes back [22:54:40] I've just disconnected and relaunched the Gearman manager on gallium [22:54:43] That usually brings it back [22:54:46] * greg-g nods [22:55:24] Anyone with wmf-ldap can restart Gearman via Jenkins web control panel. I'll write that up in an e-mail (it's like 3 steps) [22:55:44] Yep, that was the culprit this time [22:56:03] Managed to preserve the queue so fundraising jobs shoudl start running now [22:56:08] * have started running. [22:56:13] nice work, Krinkle [22:57:20] Is jenkins another system where ldap/wmf is granted way too much power? [22:58:05] The bug by the way is https://phabricator.wikimedia.org/T72597 (Jenkins Gearman plugin has deadlock on executor threads) [22:59:52] Krenair: the level of access matches all other internal CI tools I've ever run. Basically the "trusted engineers" can keep the CI tool running and control the jobs that run there. [23:00:30] and see lots of passwords? :/ [23:02:39] No, I don't think you can see lots of passwords. Maybe one for the irc bot. [23:03:06] the user passwords are in the ldap server, not jenkins [23:03:10] bd808, https://integration.wikimedia.org/ci/configure [23:03:54] those testing passwords are on wiki too I think. [23:07:37] they're not terribly sensitive, nothing we shouldn't be ok with ldap/wmf seeing [23:14:49] yeah, we have the passwords for test users on beta labs and test2wiki in Jenkins. we also limit the powers of test users to the minimum necessary [23:14:59] I'm getting strange CI errors on a mediawiki-core build: https://integration.wikimedia.org/ci/job/mediawiki-phpunit-hhvm/389/consoleFull [23:15:46] ie there are some automated tests we don't perform because they require privileges we are not comfortable giving to test users [23:19:45] bd808: just to be clear, no passwords on wikis, just Jenkins. if you know different, let me know. [23:21:55] awight: What's the gerrit link for that patch? [23:22:28] chrismcmahon: you should probably look at that instead of bd808 :) [23:22:31] ^ [23:22:44] Those look like hhvm related errors that should be fixed in master [23:24:13] bd808: https://gerrit.wikimedia.org/r/#/c/181207/ i forced the submit, nothing to see here :) [23:25:31] awight: Oh... you need to get hhvm tests turned off for that branch. They will fail badly always [23:26:09] awight: File a bug in phab against the ci project [23:26:22] ah bd808 good to know [23:26:57] anything older then 1.24wmf12 will not pass the test suite under hhvm [23:30:32] (03PS1) 10John F. Lewis: admin: grant twentyafterfour gallium [puppet] - 10https://gerrit.wikimedia.org/r/181211 [23:33:13] (03CR) 10Greg Grossmeier: [C: 031] admin: grant twentyafterfour gallium [puppet] - 10https://gerrit.wikimedia.org/r/181211 (owner: 10John F. Lewis) [23:33:57] bd808: oic, do you happen to know which release becomes fully hhvm compatible? [23:35:06] awight: 1.25 alpha should pass tests under hhvm :P [23:35:33] hehe ok thx [23:35:38] 1.24 claims beta support, but I don't think all the tests pass [23:35:39] awight: 1.24wmf14 (next internal branch) was the first to pass all tests under hhvm. I backported fixes for wmf12 and wmf13 yesterday. hashar made the hhvm test voting instead of non-voting yesterday [23:35:53] legoktm: They do. hhvm is voting now in master [23:36:02] but it should NOT be voting on older branches [23:36:08] bd808: oh great. We can probably bump ourselves up to REL1_24 [23:36:14] but it is right now [23:36:34] oh wait is master 1.25? [23:36:47] 1.25 then, sorry [23:36:49] numbers are hard [23:37:05] so not "stable" hhvm rel yet [23:37:28] but master is and will stay hhvm compliant [23:37:38] bd808: fyi I made a task here, https://phabricator.wikimedia.org/T85036 [23:49:49] (03PS2) 10Alex Monk: admin: grant twentyafterfour gallium [puppet] - 10https://gerrit.wikimedia.org/r/181211 (owner: 10John F. Lewis) [23:50:49] Krenair: I didn't know I forgot the bug prefix; thanks :) [23:50:59] :) [23:51:08] I dislike how it shows the patch uploader on the task [23:51:15] rather than the change owner [23:51:48] yeah [23:51:58] most common case is the proper author being the commit owner [23:52:07] (change owner, even) [23:53:27] but I forget where the gerritadmin code is [23:53:30] maybe ^d knows [23:56:55] (03CR) 10Ori.livneh: [C: 032] gdash: Fix VE dashboard to point to new names of data streams [puppet] - 10https://gerrit.wikimedia.org/r/181114 (owner: 10Jforrester)