[00:04:45] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.67% of data above the critical threshold [1000.0] [00:10:57] !log ori Synchronized php-1.26wmf5/extensions/Gadgets: cbb9b1e475: Update Gadgets for cherry-pick (duration: 00m 12s) [00:11:02] Logged the message, Master [00:11:36] PROBLEM - High load average on labstore1001 is CRITICAL 66.67% of data above the critical threshold [24.0] [00:12:52] !log ori Synchronized php-1.26wmf4/extensions/Gadgets: 7539873979: Update Gadgets for cherry-pick (duration: 00m 12s) [00:12:59] Logged the message, Master [00:13:06] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [00:13:22] !log ori Synchronized php-1.26wmf4/includes/jobqueue/jobs/RefreshLinksJob.php: 914d71f3cc: Temporary hack to drain excess refreshLinks jobs (duration: 00m 14s) [00:13:27] Logged the message, Master [00:14:37] (03PS1) 10Ori.livneh: Set $wgGadgetsCacheType to CACHE_ACCEL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210628 [00:14:41] (03CR) 10jenkins-bot: [V: 04-1] Set $wgGadgetsCacheType to CACHE_ACCEL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210628 (owner: 10Ori.livneh) [00:15:07] (03PS2) 10Ori.livneh: Set $wgGadgetsCacheType to CACHE_ACCEL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210628 [00:15:26] (03CR) 10Ori.livneh: [C: 032] Set $wgGadgetsCacheType to CACHE_ACCEL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210628 (owner: 10Ori.livneh) [00:16:07] !log ori Synchronized wmf-config/CommonSettings.php: I5ebedfdfb: Set $wgGadgetsCacheType to CACHE_ACCEL (duration: 00m 12s) [00:16:12] Logged the message, Master [00:23:06] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [00:28:59] Anyone else having long delays while connecting to production (e.g. tin, stat1003)? [00:29:50] matt_flaschen: from where? [00:30:08] gwicke, Philadelphia. [00:31:09] matt_flaschen: run mtr and check for packet loss [00:31:40] matt_flaschen: ATT? ;) [00:31:51] ori: Thanks, never heard of that before. I have had some other slight internet problems today, so it could definitely be on my side. [00:33:24] gwicke, nope, Comcast. [00:34:07] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [00:34:21] kk; was in a similar situation with ATT ~1½ months ago [00:34:53] turned out to be congested backbones close to eqiad [00:34:57] gwicke getting fancy with the unicode [00:35:14] compose ftw® [00:36:42] maybe matt_flaschen's ☏ is ⌛ because a cable was ✄, or because ☁ ❅ in philadelphia [00:38:04] There was snow on the first day of spring here, but it's May now. ;) [00:38:16] ¯\(°_o)/¯ [00:38:20] haha [00:38:43] so i took the xenon logs and aggregated by the function that is at the top of the stack (i.e., on-cpu) [00:38:48] top 5: [00:38:52] (any guesses?) [00:39:15] composer class loader? [00:39:21] (spoiler: it's all services!) [00:39:32] bd808: that's #8 [00:39:45] MWDebug emit? ;) [00:39:50] nope [00:40:00] Kidding, but that was really high up on a similar local profile I saw. [00:40:02] #1: DatabaseMysqli::doQuery (335287) [00:40:07] #2: Memcached::getByKey (185538) [00:40:13] #3: AutoLoader::autoload (103382) [00:40:18] #4: Hooks::run (71006) [00:40:24] #5: DatabaseMysqli::mysqlConnect (67280) [00:40:46] That actually makes sense for the most part. [00:40:47] in other words the app servers are waiting on io [00:40:51] the db ones are obvious [00:41:06] hooks is 47% of MediaWiki code so ... [00:41:11] who'd thought [00:41:31] next 5: [00:41:36] #6: ReplacementArray::replace (65590) [00:41:40] #7: CurlHttpRequest::execute (57829) [00:41:43] #8: Composer\Autoload\includeFile (43680) [00:41:46] #9: Elastica\Transport\Http::exec (36494) [00:41:52] #10: Preprocessor_Hash::preprocessToObj (35010) [00:42:07] if we switched to node the #1 would be BPromise.Await or something similar [00:42:30] no, it's not on the stack [00:42:42] (the JS stack) [00:43:07] the point of async programming is that you don't block while waiting for IO [00:43:09] * ori makes a mental note to look into ReplacementArray::replace [00:43:23] I was just about to open it up [00:43:59] gwicke: [00:44:02] "As we teased last month, HHVM 3.6 comes with lots of new goodies. In particular, many new Async features are now available by default including AsyncMySQL and MCRouter (memcache) support. Combined with existing support for asynchronous cURL, applications are now ready to deeply parallelize many common forms of costly data access." [00:44:07] we're running 3.6 as of a couple of hours ago [00:44:43] bd808: i'm actually on my way out so don't let me stop you from looking :) [00:44:52] catching up, one small step at a time ;) [00:44:53] looks like it is used by tidy and htmlformatter [00:45:00] that's great to hear [00:45:07] how b/c is this with Zend? [00:45:11] is that still a goal? [00:45:22] no, the await stuff is all hack [00:45:24] MW is a PHP 5.3.3+ application [00:45:32] for now [00:45:42] but we could make libs that do something different [00:45:47] there are more JS SLOC in mediawiki/core than PHP [00:46:04] hack is less interesting tbh [00:46:32] ori: ReplacementArray is "Replacement array for FSS with fallback to strtr()" [00:46:41] gwicke: http://docs.hhvm.com/manual/en/install.hack.h2tp.php [00:46:57] you can write code in hack and transpile it back to php for distribution [00:47:17] though i don't quite fathom how the async facilities can be transpiled to php -- i suspect they cannot [00:47:29] ori: any production experience with the new async features? ;) [00:47:30] was the ICU issue on hhvm3.6 fixed? [00:47:38] nope [00:47:50] uhh [00:47:56] is that not an issue in production? [00:48:33] task #? [00:48:40] https://phabricator.wikimedia.org/T98882 [00:48:42] https://phabricator.wikimedia.org/T98882 [00:49:16] this only occurred with pathological test input, no? [00:49:30] bascially yeah [00:50:05] iconv //IGNORE will return an empty string if there are any bad chars in the input [00:50:17] Yo, does everyone already know that the Commons upload wizard is broken? [00:50:24] "You can only upload files with a size of up to -1 B. You tried to upload a file that is 133 KB" [00:50:29] Tried with two different browsers [00:50:57] well, or, ok, probably just my account is broken [00:51:03] I thought that was fixed? https://phabricator.wikimedia.org/T97415 [00:51:34] hmm, that wasn't backported [00:51:36] Alls I knows is that it is happening to me right now [00:51:46] I'm getting it too [00:52:19] I’m about to be late… can someone who isn’t me follow up? [00:52:57] andrewbogott: i'll just paste some chat lines to that ticket, k? [00:53:44] mutante: ok. It seems moderately urgent to me, but maybe using the wizard is a niche case? [00:54:01] hhvm error quoted in original bug -- https://github.com/facebook/hhvm/issues/4993 [00:54:39] oh man that looks… much worse than I would’ve guessed. [00:54:52] andrewbogott: it's not a niche case. reopening the ticket. [00:55:01] ori: hhvm bug may be causing upload problems [00:55:30] mutante: I only thought it was a niche case because otherwise it seems insanely coincidental that I was the first to notice it :) [00:56:14] well I think bawolff's fix needs to be backported [00:56:18] but that's only gwtoolset [00:57:53] * andrewbogott => the pub [01:01:40] mutante: the existing bug is about GWToolset, a different uploading tool [01:02:37] legoktm: but it sounds like the same root cause? [01:02:50] yeah, but the fix will only fix GWToolset [01:03:42] legoktm: so we need a patch to upload/UploadFromFile.php, specials/SpecialUpload.php and WebRequest.php too apaprently [01:03:43] 'maxPhpUploadSize' => min( [01:03:43] wfShorthandToInteger( ini_get( 'upload_max_filesize' ) ), [01:03:43] wfShorthandToInteger( ini_get( 'post_max_size' ) ) [01:03:43] ), [01:03:46] hmm, so reopening ticket about gwtoolset is right but additionally a new one for uploadwizard? [01:03:50] probably [01:03:57] so UW needs a similar fix [01:04:48] marktraceur: around? [01:05:28] imaginarybot: !irclogs_tophab --10 -97415 [01:06:20] heh [01:07:39] !log added commons to supported projects in RESTBase API [01:07:44] Logged the message, Master [01:11:18] hmmm.. this works as expected on mw1070 -- php -r 'echo ini_get( "post_max_size" ), "\n";' [01:11:31] where php == HipHop VM 3.6.1 (rel) [01:12:13] ah. broken one is -- php -r 'echo ini_get( "max_post_size" ), "\n";' [01:13:30] bd808: I don't see that one in UploadWizard [01:13:38] me neither [01:13:41] just post_max_size and upload_max_filesize [01:14:29] php -r 'echo ini_get( "upload_max_filesize" ), "\n";' == "100M" [01:15:03] php -r 'echo ini_get( "post_max_size" ), "\n";' == 104857600 [01:16:07] bd808: mw.UploadWizard.config.maxPhpUploadSize (in JS) is -1 on commons [01:16:21] what sets that? [01:16:48] I think it's just 'maxPhpUploadSize' => min( [01:16:48] wfShorthandToInteger( ini_get( 'upload_max_filesize' ) ), [01:16:48] wfShorthandToInteger( ini_get( 'post_max_size' ) ) [01:16:48] ), [01:18:10] if ( $string === '' ) { return -1; } [01:18:24] that's in wfShorthandToInteger [01:18:52] but from CLI both of those are returning values [01:19:08] yeah. I wonder if fcgi is different? [01:19:46] looks like that might be it -- https://github.com/facebook/hhvm/issues/4993#issuecomment-82597836 [01:19:51] aw crap [01:19:55] I backported to the wrong branch [01:20:59] so upstream bug is still open :/ [01:21:14] !log legoktm Synchronized php-1.26wmf4/extensions/GWToolset/: Check php max_file_size limit directly from PHP $_FILES (duration: 00m 12s) [01:21:21] Logged the message, Master [01:22:29] so either we roll back to the older hhvm or try to hack around things somehow [01:23:00] !log legoktm Synchronized php-1.26wmf5/extensions/GWToolset/: Check php max_file_size limit directly from PHP $_FILES (duration: 00m 12s) [01:23:02] we can just set it in CommonSettings for now [01:23:04] the limit that is sent to js from uploadwizard could be hot fixed on WMF branches [01:23:06] Logged the message, Master [01:23:07] bd808: "hack"? ;) [01:23:34] bd808: $wgUploadWizardConfig['maxPhpUploadSize'] = 'whatever' I think will work [01:23:44] *nod* [01:23:51] what should whatever be? :P [01:24:04] 104857600? [01:24:18] that's what mw1070 reports [01:24:38] 100M bacically [01:24:43] *basically [01:26:26] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [01:26:55] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add jdouglas to researchers admin group - https://phabricator.wikimedia.org/T98536#1281391 (10Dzahn) [01:27:14] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add jdouglas to researchers admin group - https://phabricator.wikimedia.org/T98536#1270765 (10Dzahn) @milimetric thanks for the clarification. i renamed the ticket accordingly. [01:27:21] (03PS1) 10Legoktm: Hardcode UploadWizard max upload size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210636 (https://phabricator.wikimedia.org/T98933) [01:27:27] bd808: ^ [01:28:12] (03CR) 10BryanDavis: [C: 031] Hardcode UploadWizard max upload size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210636 (https://phabricator.wikimedia.org/T98933) (owner: 10Legoktm) [01:28:55] (03CR) 10Legoktm: [C: 032 V: 032] Hardcode UploadWizard max upload size [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210636 (https://phabricator.wikimedia.org/T98933) (owner: 10Legoktm) [01:29:42] !log legoktm Synchronized wmf-config/CommonSettings.php: Hardcode UploadWizard max upload size - T98933 (duration: 00m 12s) [01:29:50] Logged the message, Master [01:30:03] > mw.UploadWizard.config.maxPhpUploadSize [01:30:04] 104857600 [01:30:23] you fixed it [01:30:24] legoktm: [01:30:26] :)) [01:30:34] "fixed" [01:30:34] will an upload actually work now? or do we have a backend problem too? [01:30:40] i tested by uploading the screenshot of the failed upload :) [01:30:55] All uploads were successful! [01:30:58] it tells me [01:31:03] I didn't try finishing the uplaod, but I got passed the error screen and it said it was successful [01:31:47] https://commons.wikimedia.org/wiki/File:Legoktm_fixed_T98933.png [01:31:48] works :p [01:32:13] https://commons.wikimedia.org/wiki/File:Example_en_legoktm_test.svg [01:32:48] such a good wiki citizen [01:33:26] legoktm: we should put in for a raise due to collaboration between Ops, Reading and Editing to fix this bug ;) [01:33:50] hehe [01:34:08] does core special:upload still work? [01:34:31] I've typoed uplaod 3 times now >.> [01:35:06] it's working on testwiki [01:35:06] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [01:35:15] anywhere else we let people upload stuff? [01:35:46] not as far as I know... [01:35:54] office? [01:36:23] that would be UW and/or special:upload [01:37:54] https://commons.wikimedia.org/wiki/Commons:Village_pump#File_size_problem_when_uploading_from_mobile_phone left a note there [01:37:59] uploading is part of "editing" ? [01:38:24] yup. editing has the new multimedia team [01:39:07] legoktm is in editing (flow) and I'm in reading (infrastructure) [01:39:22] legoktm: perfect @ VillagePump [01:39:53] imma go eat dinner [01:39:54] bd808: *nod* ok [01:40:06] bd808: enjoy! [01:41:10] o/ [01:48:06] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:56:07] (03PS1) 10Dzahn: nagios plugin checks if wikitech-static is in sync [puppet] - 10https://gerrit.wikimedia.org/r/210637 (https://phabricator.wikimedia.org/T89323) [01:56:38] !log Started 'jobs' screen in tin to drain refreshLinks for enwiki using --nothrottle (T98621) [01:56:48] Logged the message, Master [01:58:51] (03PS2) 10Dzahn: nagios plugin checks if wikitech-static is in sync [puppet] - 10https://gerrit.wikimedia.org/r/210637 (https://phabricator.wikimedia.org/T89323) [02:02:43] (03CR) 10Dzahn: [C: 032] nagios plugin checks if wikitech-static is in sync [puppet] - 10https://gerrit.wikimedia.org/r/210637 (https://phabricator.wikimedia.org/T89323) (owner: 10Dzahn) [02:12:12] (03PS1) 10Dzahn: wikitech: add monitoring::service for static sync [puppet] - 10https://gerrit.wikimedia.org/r/210638 (https://phabricator.wikimedia.org/T89323) [02:12:55] (03CR) 10jenkins-bot: [V: 04-1] wikitech: add monitoring::service for static sync [puppet] - 10https://gerrit.wikimedia.org/r/210638 (https://phabricator.wikimedia.org/T89323) (owner: 10Dzahn) [02:17:03] (03PS2) 10Dzahn: wikitech: add monitoring::service for static sync [puppet] - 10https://gerrit.wikimedia.org/r/210638 (https://phabricator.wikimedia.org/T89323) [02:22:35] 6operations, 7Monitoring, 5Patch-For-Review: Monitor the up-to-date status of wikitech-static - https://phabricator.wikimedia.org/T89323#1281445 (10Dzahn) a:3Dzahn [02:26:10] hi all, i wanna contribute to improve wikimedia server softwares or services.. [02:26:17] how can i start? [02:26:22] or where ? [02:27:37] Nephil: hey [02:28:24] Nephil: so we use puppet for configuration management of the servers and you can git clone the puppet code [02:29:40] hi mutante, that sounds bit technical [02:30:00] can u show me how can i start [02:30:06] Nephil: check this out https://wikitech.wikimedia.org/wiki/Get_involved [02:30:40] yes actually i am here following that link [02:31:06] oh what is the git url for the puppet code? [02:31:11] Nephil: you could start by looking at the code review tool and what others are uploading [02:31:19] Nephil: that would be gerrit.wikimedia.org [02:31:35] ok [02:32:09] is it similar to github ? [02:32:52] Nephil: gerrit is a tool for code review, there are many projects on it. the puppet code is one of them [02:32:59] to get the puppet code git clone https://gerrit.wikimedia.org/r/operations/puppet [02:33:37] that link says not found [02:34:56] Nephil: uhm.. it works for me [02:35:14] also see https://en.wikipedia.org/wiki/Gerrit_%28software%29 [02:35:45] Nephil: that last link was not meant to be opened in a brower [02:36:11] so how can i open that? [02:36:31] using the git command [02:36:34] git clone https://gerrit.wikimedia.org/r/operations/puppet [02:37:08] you can also look at it on the web here: [02:37:32] oh i think i have to login to first? [02:37:39] https://phabricator.wikimedia.org/diffusion/OPUP/ [02:38:20] sorry.. so what is this puppet is for? [02:38:36] is it the core system for wikimedia? [02:39:01] pls, can u explain just basic [02:39:12] as i m not quite familiar with them [02:39:15] !log l10nupdate Synchronized php-1.26wmf4/cache/l10n: (no message) (duration: 10m 08s) [02:39:27] Logged the message, Master [02:39:30] thouht i have been editing Wikipedia a lot [02:39:48] puppet is a system to manage all the config files of the servers [02:40:13] i want to move slowly into servers and stuffs.. as i like programming very much [02:41:14] what kind of config files? [02:41:41] is it like system files of OS? [02:42:28] also, what language do u prefer to learn [02:42:28] Nephil: it's more like letting a tool use regedit [02:42:42] m currently using PHP for my work [02:42:48] https://en.wikipedia.org/wiki/Puppet_%28software%29 [02:43:43] Puppet is written in Ruby [02:44:50] Can i commit and fork ? here https://github.com/puppetlabs [02:46:31] !log LocalisationUpdate completed (1.26wmf4) at 2015-05-13 02:45:28+00:00 [02:46:38] Logged the message, Master [02:46:55] Nephil: well, if you wanted to fix something in puppet the software itself, but mostly you would start by using it [02:47:32] Nephil: there is also a copy of the wikimedia puppet code on github https://github.com/wikimedia/operations-puppet gotta go though [02:47:54] ok thanks [02:48:24] Nephil: also: http://www.catb.org/~esr/faqs/hacker-howto.html#basic_skills cya around [02:48:34] ok [02:48:41] thanks [02:50:25] PROBLEM - puppet last run on mw2015 is CRITICAL Puppet has 1 failures [02:58:55] (03CR) 10GWicke: "This might have broken the VE stats:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210394 (owner: 10Ori.livneh) [03:06:26] RECOVERY - puppet last run on mw2015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [03:07:05] !log l10nupdate Synchronized php-1.26wmf5/cache/l10n: (no message) (duration: 05m 43s) [03:07:20] Logged the message, Master [03:11:35] !log LocalisationUpdate completed (1.26wmf5) at 2015-05-13 03:10:31+00:00 [03:11:40] Logged the message, Master [04:09:41] (03PS1) 10Ori.livneh: Labs: set $wgMessageCacheType to CACHE_ACCEL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210640 [04:09:45] (03CR) 10jenkins-bot: [V: 04-1] Labs: set $wgMessageCacheType to CACHE_ACCEL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210640 (owner: 10Ori.livneh) [04:09:53] (03PS2) 10Ori.livneh: Labs: set $wgMessageCacheType to CACHE_ACCEL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210640 [04:10:04] (03CR) 10Ori.livneh: [C: 032] "Labs-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210640 (owner: 10Ori.livneh) [04:10:10] (03Merged) 10jenkins-bot: Labs: set $wgMessageCacheType to CACHE_ACCEL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210640 (owner: 10Ori.livneh) [04:23:36] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [04:29:55] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60587 bytes in 0.234 second response time [04:40:03] (03PS17) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [04:40:19] (03CR) 10jenkins-bot: [V: 04-1] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [04:40:38] (03PS2) 10Ori.livneh: Revert "contint: Move jenkins/tmpfs from slave::labs to slave::labs::common" [puppet] - 10https://gerrit.wikimedia.org/r/206863 (owner: 10Krinkle) [04:40:57] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "contint: Move jenkins/tmpfs from slave::labs to slave::labs::common" [puppet] - 10https://gerrit.wikimedia.org/r/206863 (owner: 10Krinkle) [04:48:25] (03PS1) 10Springle: depool db1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210643 [04:49:17] (03CR) 10Springle: [C: 032] depool db1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210643 (owner: 10Springle) [04:49:23] (03Merged) 10jenkins-bot: depool db1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210643 (owner: 10Springle) [04:50:19] !log springle Synchronized wmf-config/db-eqiad.php: depool db1018 (duration: 00m 12s) [04:50:28] Logged the message, Master [04:56:47] PROBLEM - puppet last run on db2030 is CRITICAL puppet fail [04:57:46] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [05:14:26] RECOVERY - puppet last run on db2030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:14:36] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [05:19:17] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60589 bytes in 0.084 second response time [05:32:53] (03PS18) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [05:34:35] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [05:45:06] (03PS19) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [05:45:39] (03PS1) 10Springle: Upgrade db1018 to trusty and MariaDB 10 [puppet] - 10https://gerrit.wikimedia.org/r/210646 [05:46:24] (03PS20) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [05:48:33] (03CR) 10Springle: [C: 032] Upgrade db1018 to trusty and MariaDB 10 [puppet] - 10https://gerrit.wikimedia.org/r/210646 (owner: 10Springle) [05:53:22] !log reinstall db1018 [05:53:31] Logged the message, Master [06:14:38] (03PS21) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [06:14:51] (03CR) 10jenkins-bot: [V: 04-1] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [06:16:47] <_joe_> yuvipanda: I forgot to submit my review to that, sorry [06:16:58] :) please do! [06:17:02] <_joe_> lemme see the new version [06:21:02] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed May 13 06:19:59 UTC 2015 (duration 19m 58s) [06:21:10] Logged the message, Master [06:27:09] (03PS22) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [06:27:47] (03PS19) 10Paladox: Adding task support instead of using Bug: which was for bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/209741 [06:27:49] (03CR) 10jenkins-bot: [V: 04-1] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [06:28:04] (03CR) 10Paladox: Adding task support instead of using Bug: which was for bugzilla (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/209741 (owner: 10Paladox) [06:28:26] (03PS23) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [06:29:17] PROBLEM - puppet last run on mw2056 is CRITICAL puppet fail [06:29:57] PROBLEM - puppet last run on mw1177 is CRITICAL puppet fail [06:30:08] PROBLEM - puppet last run on cp3037 is CRITICAL Puppet has 1 failures [06:30:28] (03PS24) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [06:31:06] PROBLEM - puppet last run on cp4014 is CRITICAL Puppet has 1 failures [06:31:47] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 1 failures [06:31:50] (03PS25) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [06:32:08] PROBLEM - puppet last run on db1051 is CRITICAL Puppet has 1 failures [06:33:07] PROBLEM - puppet last run on ms-fe2001 is CRITICAL Puppet has 1 failures [06:33:07] PROBLEM - puppet last run on wtp2012 is CRITICAL Puppet has 1 failures [06:33:10] (03CR) 10Yuvipanda: Initial commit (0314 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [06:33:57] PROBLEM - puppet last run on mw2134 is CRITICAL Puppet has 1 failures [06:34:06] PROBLEM - puppet last run on mw2093 is CRITICAL Puppet has 1 failures [06:34:06] PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 1 failures [06:34:17] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:34:57] PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 1 failures [06:35:13] (03CR) 10Yuvipanda: "@valhallasw most of your comments have been addressed, I think. And webservice-new and webservice-runner share almost no code now, except " [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [06:35:17] PROBLEM - puppet last run on mw2003 is CRITICAL Puppet has 1 failures [06:35:18] PROBLEM - puppet last run on mw2096 is CRITICAL Puppet has 1 failures [06:35:30] so with taht I go home [06:35:40] oh, I’ve to come to the office tomorrow anyway, heh. [06:35:49] * yuvipanda waves [06:40:28] yuvipanda: hello [06:45:24] (03PS2) 10Giuseppe Lavagetto: monitoring: add proper way to check systemd units [puppet] - 10https://gerrit.wikimedia.org/r/210396 [06:45:33] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] monitoring: add proper way to check systemd units [puppet] - 10https://gerrit.wikimedia.org/r/210396 (owner: 10Giuseppe Lavagetto) [06:46:17] RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:47:03] (03CR) 10Aklapper: [C: 04-1] "What QChris wrote -> -1." [puppet] - 10https://gerrit.wikimedia.org/r/209741 (owner: 10Paladox) [06:47:08] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:51:04] (03PS1) 10Giuseppe Lavagetto: monitoring: install libraries for nrpe systemd scripts [puppet] - 10https://gerrit.wikimedia.org/r/210650 [07:04:28] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 810.402515811 [07:05:57] RECOVERY - puppet last run on db1051 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [07:06:56] RECOVERY - puppet last run on ms-fe2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:57] RECOVERY - puppet last run on wtp2012 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:07:07] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [07:07:28] RECOVERY - puppet last run on mw2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:37] RECOVERY - puppet last run on mw2096 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:38] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:48] RECOVERY - puppet last run on mw2093 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:48] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:48] RECOVERY - puppet last run on mw2056 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:08:07] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:08:26] RECOVERY - puppet last run on mw1177 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:11:22] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring: install libraries for nrpe systemd scripts [puppet] - 10https://gerrit.wikimedia.org/r/210650 (owner: 10Giuseppe Lavagetto) [07:17:57] (03PS2) 10Giuseppe Lavagetto: hiera: use the proxy backend, rationalize the hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/207129 [07:31:16] PROBLEM - RAID on es2010 is CRITICAL 1 failed LD(s) (Degraded) [08:08:57] PROBLEM - puppet last run on cp3014 is CRITICAL puppet fail [08:23:57] PROBLEM - puppet last run on mw2106 is CRITICAL puppet fail [08:26:48] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [08:39:35] is there a syntax for a commit to show automatically on phabricator? [08:41:06] 6operations, 6WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1281694 (10Qgil) (Just for reference) >>! In T98722#1281026, @Dzahn wrote: >>>! In T98722#1278472, @Qgil wrote:> >>Someone with permissions could update the description of #WMF-NDA and add @ZhouZ. >... [08:41:47] RECOVERY - puppet last run on mw2106 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [08:48:05] 6operations, 6Engineering-Community, 3ECT-May-2015: date/budget proposal for 2015 Ops Offsite - https://phabricator.wikimedia.org/T89023#1281714 (10Qgil) [08:50:06] PROBLEM - puppet last run on mw1003 is CRITICAL Puppet has 1 failures [08:53:07] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1281721 (10Nemo_bis) The dump will be almost totally useless without full user IDs (i.e. email address) for votes, subscribers and reports, as that's the main public data being lost i... [08:55:22] jynus: yes https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines [08:56:13] Thank you, Nemo_bis [09:06:26] RECOVERY - puppet last run on mw1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:06:33] 6operations: Degraded RAID-1 arrays on new logstash hosts: [UU__] - https://phabricator.wikimedia.org/T98620#1281729 (10faidon) What's the status of this? RAID is currently broken on all three hosts, this is something that needs to be fixed as soon as possible. [09:20:35] !log inserting FDC election encryption key [09:20:40] decryption too [09:20:43] Logged the message, Master [09:20:44] but close enough [09:26:56] (03PS1) 10Jcrespo: Adding a new MariaDB master node, part of a new shard (m5) for miscelaneous services in labs Bug: T92693 [puppet] - 10https://gerrit.wikimedia.org/r/210660 (https://phabricator.wikimedia.org/T92693) [09:35:37] (03CR) 10Hashar: "Couple nitpicks. Might want to allow more process in parallels." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/210618 (https://phabricator.wikimedia.org/T96199) (owner: 10Thcipriani) [09:42:51] going to restart Jenkins [09:58:41] 6operations, 6Labs, 10Labs-Infrastructure: Migrate Labs NFS storage from RAID6 to RAID10 - https://phabricator.wikimedia.org/T96063#1281802 (10mark) This seems reasonable yes. Let's move ahead with this after the new backups have finished (codfw + eqiad). [10:29:03] http://status.wikimedia.org/ [10:29:04] wikis down [10:29:07] hashar [10:29:11] apergos [10:30:28] here [10:30:30] 6operations: Wikis down - https://phabricator.wikimedia.org/T98952#1281888 (10Steinsplitter) 3NEW [10:31:48] 6operations: Wikis down - https://phabricator.wikimedia.org/T98952#1281897 (10Steinsplitter) [10:34:19] confirmed, catchpoint also reported problems [10:35:14] yeah [10:35:43] page eventually loads e.g. fi wp but very very slowly [10:37:57] PROBLEM - puppet last run on maerlant is CRITICAL Puppet has 1 failures [10:41:37] now responding normally to me [10:41:48] seems recovered here too, Steinsplitter ? [10:41:54] same here [10:42:18] yes, recovered now [10:42:57] looks like it was esams unhappy for a little [10:52:38] RECOVERY - puppet last run on maerlant is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [10:54:38] 6operations: Wikis down - https://phabricator.wikimedia.org/T98952#1281923 (10fgiunchedi) confirmed, looks like temporary network disruption at esams, now recovered. will keep an eye and depool if we see it again [10:56:54] FYI ams-ix just had a major crash so you'll probably get some people complaining that sites were not reachable in Europe (cc apergos) [10:57:54] multichill: already been there and had that [10:58:02] got a linky? [10:58:18] only mildly ironic that RIPE conference is on now [10:58:24] heh nice isn't it [10:58:27] Strangly enough I haven't seen an ams-ix ticket [10:58:39] if you do, can you ping me with a link or toss the info on uh [10:58:50] - https://phabricator.wikimedia.org/T98952 ? [10:58:54] might as well have the record [10:59:18] apergos: Check the graph on http://tweakers.net/nieuws/103067/internetknooppunt-ams-ix-kampt-met-uitval.html (their own site is dead slow atm) [10:59:47] owie! [11:00:15] 6operations: Wikis down - https://phabricator.wikimedia.org/T98952#1281928 (10Multichill) http://tweakers.net/nieuws/103067/internetknooppunt-ams-ix-kampt-met-uitval.html <- In Dutch. Ams-ix was gone. No Ams-ix ticket yet. [11:00:26] thanks! [11:00:31] thanks multichill [11:00:37] Working from home and of course my connection is over ams-ix [11:00:50] duh of course it would be [11:01:07] tech-l just exploded :P [11:01:29] well they can unexplode themselves then [11:01:49] Most seems to be restored [11:02:01] All those old Cisco routers are probably still catching up [11:02:23] poor things [11:03:24] godog: Someone's presentation went wrong? ;-) [11:03:53] hehe I'm picturing many pagers going off in the same room now [11:03:55] I do remember that on one RIPE at some point they switched the wireless to ipv6 only [11:04:50] * godog lunch, brb [11:05:03] Seems about time for that [11:08:37] 6operations: Wikis down - https://phabricator.wikimedia.org/T98952#1281937 (10Multichill) FYI https://twitter.com/AMS_IX/status/598440987552305152 [11:15:39] 6operations: Activate OAuth Server Application in Phabricator for phragile login - https://phabricator.wikimedia.org/T98954#1281938 (10Abraham) 3NEW [11:41:13] 6operations: Wikis down - https://phabricator.wikimedia.org/T98952#1281969 (10faidon) 5Open>3Resolved a:3faidon This was indeed caused by a brief AMS-IX outage (AMS-IX ticket #176289, see below). This affected us in the following ways: - There would have been traffic disruption between the time the loop wa... [11:43:55] multichill: how did the ipv6-only wifi worked out? [11:46:10] Not good, most people didn't have ipv6 [11:47:26] multichill: at least decent reporting [11:47:42] "we f'cked up" :) [11:47:59] single engineer point of failure [11:51:09] I still see quite a few sessions down, but eveerything seems to be stable [11:52:18] Our Juniper routers had to really work for it. I bet a lot of old Cisco 7600 routers out there crashed because of this :P [11:53:11] yeah, it's just the long tail of recovery [11:53:50] i see some overall slowness still though (probably due to the 'noise') [11:56:00] 10Ops-Access-Requests, 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1281978 (10ArielGlenn) So it looks like we have around 980 domains, x 2 if we want https for them all. We've been looking for a way to add them that doesn't require going through the web browser one... [12:11:54] 6operations: Wikis down - https://phabricator.wikimedia.org/T98952#1281992 (10Krinkle) http://stats.ams-ix.net/ {F164340} [12:23:13] (03CR) 10Springle: [C: 031] Adding a new MariaDB master node, part of a new shard (m5) for miscelaneous services in labs Bug: T92693 [puppet] - 10https://gerrit.wikimedia.org/r/210660 (https://phabricator.wikimedia.org/T92693) (owner: 10Jcrespo) [12:28:44] (03PS1) 10Springle: depool db1060 for thread pool reconfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210676 [12:29:11] (03CR) 10Springle: [C: 032] depool db1060 for thread pool reconfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210676 (owner: 10Springle) [12:31:21] (03Merged) 10jenkins-bot: depool db1060 for thread pool reconfig [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210676 (owner: 10Springle) [12:39:20] !log upgrade and restart db1060 [12:39:25] Logged the message, Master [12:45:03] !log xtrabackup clone db1060 to db1018 [12:45:08] Logged the message, Master [12:47:56] (03PS1) 10Glaisher: Prevent indexing of User: namespace on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210680 (https://phabricator.wikimedia.org/T98926) [13:00:04] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150513T1300). [13:03:57] * apergos peeks in [13:04:15] just in case it's a deployment that uses git deploy... not though, it's scap based right? [13:10:36] apergos: just config and running maintenance scripts [13:10:46] ok [13:10:47] and some backports [13:11:07] so, scap-based [13:11:11] yeah, not something I need to pay attention to more than usual [13:11:12] thanks [13:11:39] I'm here as clinic duty person but not as 'omg maybe trebuchet is broken' person :-) [13:11:45] ps. when you get a chance, can you look at https://gerrit.wikimedia.org/r/#/c/210072/ [13:11:52] and https://gerrit.wikimedia.org/r/#/c/210081/ [13:12:13] yeah saw that in my queue [13:12:16] ok [13:12:26] not exactly urgent but would be nice to have soonish [13:12:45] if what i did is sane [13:13:58] 6operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 5Patch-For-Review: enwiki's job is about 28m atm and increasing - https://phabricator.wikimedia.org/T98621#1282078 (10Nemo_bis) Now en.wiki is merely at 21 millions. According to https://wikiapiary.com/wiki/Wikipedia_%28en%29 , it started dropping t... [13:14:53] only thing I' say is that's quite a long description, if you can trim it down a bit that would be nice [13:15:50] i can do that [13:20:44] (03PS2) 10Aude: Add wbc_entity_usage table to xml dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210081 (https://phabricator.wikimedia.org/T98743) [13:20:46] (03PS2) 10Aude: Add wb_changes_subscription table to xml dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210072 (https://phabricator.wikimedia.org/T98742) [13:21:06] PROBLEM - puppet last run on es2006 is CRITICAL puppet fail [13:22:05] that one line geets displayed above the ile name on the index page so it's nice if it isn't so long it wraps around [13:23:25] (03PS3) 10Krinkle: beta: Add script from Jenkins beta-update-databases [puppet] - 10https://gerrit.wikimedia.org/r/210618 (https://phabricator.wikimedia.org/T96199) (owner: 10Thcipriani) [13:23:31] (03CR) 10Aude: [C: 04-1] "has dependency on config change" [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210081 (https://phabricator.wikimedia.org/T98743) (owner: 10Aude) [13:24:07] *file name [13:24:08] geez [13:24:56] (03PS3) 10Aude: Add wbc_entity_usage table to xml dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210081 (https://phabricator.wikimedia.org/T98743) [13:25:09] made them shorter [13:25:22] saw, thanks [13:25:38] the second patch depends on a config change, which i prefer to give hoo a chance to see [13:25:43] okey dokey [13:26:04] I'll put them through all at once [13:26:09] ok thanks [13:26:17] (03CR) 10Aude: "depends on config change" [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210081 (https://phabricator.wikimedia.org/T98743) (owner: 10Aude) [13:26:18] do you think he'll look at them today? If so I can do them tomorrow morning [13:26:41] i can ask him, if he's around [13:26:55] ok [13:27:02] otherwise, imho i think the config change is okay [13:29:20] is there a wikidataclientlist dblist file ? [13:31:00] ah there is, goo [13:31:10] as long as all the clients in that list have the table, we're good [13:31:20] if not, then they need to have. [13:31:40] aude: [13:33:39] there is wikidataclientlist but they don't all have the table [13:33:57] one option instead would be just to create it everywhere, even if not used yet [13:34:13] might be more simple [13:34:48] * aude will look at that after this deploy [13:37:17] RECOVERY - puppet last run on es2006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:41:16] (03Abandoned) 10Aude: Add dblist for wikidatausagetracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210080 (owner: 10Aude) [13:45:15] !log aude Synchronized php-1.26wmf5/extensions/Wikidata: Update maintenance script (duration: 00m 20s) [13:45:23] Logged the message, Master [13:52:06] If nagios starts to cry for db1009, that is my fault: https://phabricator.wikimedia.org/T98958#1282124 [13:54:00] ^I also should have used the bot for editing that [13:54:16] (03PS1) 10Aude: Enable usage tracking on Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210687 [13:54:36] jynus: just copy paste with the !log prefix ? :D [13:54:56] yeah, I am commenting things aloud :-) [13:55:12] 6operations, 6Phabricator: Activate OAuth Server Application in Phabricator for phragile login - https://phabricator.wikimedia.org/T98954#1282126 (10Qgil) [13:55:28] jynus: when are you doing that? [13:55:52] hopefuly today [13:55:55] ok [13:56:10] It is non-production, so it shouln't be a problem [13:56:17] * aude just added new tables for a bunch of wikis [13:56:21] sounds ok [13:56:46] !log jcrespo Disabling puppet agent in db1009.eqiad in preparation for reinstall [13:56:47] can't possibly matter anyway with a reinstall of a server [13:56:52] Logged the message, Master [13:57:24] aude, does everybody have icinga web rights? [13:57:33] !log added wbc_entity_usage table on all Wikibase Client wikis [13:57:38] Logged the message, Master [13:57:53] jynus: i think if you are in the nda group, then you have icinga access [13:59:24] I think icinga disagrees: db1009 N/A Not Authorized :-) [14:00:04] chasemp: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150513T1400). Please do the needful. [14:00:24] "do the needful" :-) [14:03:16] (03CR) 10Aude: [C: 032] Enable usage tracking on Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210687 (owner: 10Aude) [14:03:24] (03Merged) 10jenkins-bot: Enable usage tracking on Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210687 (owner: 10Aude) [14:04:49] !log aude Synchronized wmf-config/InitialiseSettings.php: Enable usage tracking for Wikisource (duration: 00m 14s) [14:04:54] Logged the message, Master [14:06:13] * aude is done [14:07:45] (03CR) 10Hashar: [C: 031] puppetmaster: remove extraneous empty line [puppet] - 10https://gerrit.wikimedia.org/r/209264 (owner: 10Alexandros Kosiaris) [14:10:41] (03CR) 10Hashar: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/207785 (https://phabricator.wikimedia.org/T87594) (owner: 10Filippo Giunchedi) [14:11:43] (03CR) 10Aude: "was slightly confused. this doesn't need the config change but instead needed the table added to all clients. that is done now, so this pa" [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210081 (https://phabricator.wikimedia.org/T98743) (owner: 10Aude) [14:14:47] aude, jynus: so the nda, wmf and ops groups give you basic access to icinga [14:14:59] but I think there's still icinga-internal access controls [14:16:45] thank you, Krenair [14:16:56] I will contact so else [14:18:19] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1282153 (10Krenair) > [x] icinga user and permissions (icinga commands, paging/notifications) Some issues encountered in T98958 Also, what's the difference between those logins a... [14:19:40] https://wikitech.wikimedia.org/wiki/Icinga - bah, pmtpa stuff here :( [14:19:43] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1282163 (10jcrespo) 5Resolved>3Open [14:20:17] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277284 (10jcrespo) a:5jcrespo>3Dzahn [14:21:03] jynus, on the other hand, looking at that page, it might be possible for you to schedule downtime via the cli on neon [14:21:54] yes, I saw it, trying to learn it- but it seems like a permanent solution via configuration [14:23:03] jynus: I'll look in puppet at fixing your icinga access [14:24:34] oh someone already did it, my checkout was out of date [14:25:00] it could be case-sensitivity. I forget which way that goes since my login is always cached, but whether you login with all-lowercase or mixed-case or whatever on the username [14:25:19] you get basic access either way, but the case stuff will break executing commands for downtime and such [14:25:41] I think since the config has you has "Jcrespo", you may have to use that exact case [14:25:49] oh, ok [14:25:51] bblack, is there a difference between login to icinga via ldap vs. accounts from puppet? [14:26:07] or is it just that login is via ldap, and permissions are configured in puppet? [14:26:13] yes, that [14:26:18] which [14:26:30] the latter: permissions to execute commands and such are in puppet based on your ldap login name [14:26:54] fun... so the login is case insensitive, but permissions are case sensitive? [14:27:00] yeah :) [14:27:05] that's silly [14:27:22] 10Ops-Access-Requests, 6operations: Requesting access to contint-admins for Jan Zerebecki - https://phabricator.wikimedia.org/T98961#1282181 (10hashar) 3NEW [14:27:39] yes, but it's about item #9,231 on the prioritized list of silly things to fix around here, so .... :) [14:27:53] wow, that solved it [14:28:13] (03PS1) 10Hashar: Add Jan Zerebecki to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/210692 (https://phabricator.wikimedia.org/T98961) [14:28:21] more important than solving it it is at least documenting it, I will [14:28:22] (03CR) 10Manybubbles: [C: 032 V: 032] "Today we deploy these. Time to merge and sync." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/209541 (owner: 10Manybubbles) [14:28:50] thanks! [14:28:53] jynus, it sounds like there are a lot of things wrong with the docs on the icinga page [14:28:57] (03CR) 10Hashar: "Pending ops approval via T90275" [puppet] - 10https://gerrit.wikimedia.org/r/210692 (https://phabricator.wikimedia.org/T98961) (owner: 10Hashar) [14:30:39] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster: Remove the package installed site [puppet] - 10https://gerrit.wikimedia.org/r/209259 (owner: 10Alexandros Kosiaris) [14:31:29] jynus, to be honest, there are a lot of things wrong with docs on wikitech in general [14:31:54] new hire req: Ops Documentation Engineer :) [14:32:03] heh [14:32:31] but, yeah, never assume wikitech has up to date info on anything, sadly. [14:32:44] it's often somewhat useful in spite of that, though [14:32:44] new workflow specification: docs are now required to get a patch merged [14:32:52] !log syncing new versions of elsaticsearch plugins to prod. no restarts yet. [14:32:58] Logged the message, Master [14:33:25] good luck with that :) [14:33:43] (03CR) 10Dzahn: [C: 031] Add Jan Zerebecki to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/210692 (https://phabricator.wikimedia.org/T98961) (owner: 10Hashar) [14:34:02] hashar, yeah... the pmtpa migration would have been so much more fun with that requirement :) [14:34:06] hashar: this is how elasticsearch works. the docs are _better_ than most open source projects but not great [14:34:14] a lot of the docs are outdated at least because of that [14:34:54] I see 7 references to /home/wikipedia and one to /home/w on that icinga page [14:35:15] I don't think that is used now.. [14:35:31] !log first attempt at syncing elasticsearch plugins didn't work 100%. syncing again. gitfit/gitdeploy is betraying me [14:35:37] Logged the message, Master [14:35:52] manybubbles: \o/ also lol at gitfit [14:36:11] !log s/gitfit/gitfat/ oh well [14:36:16] Logged the message, Master [14:36:47] well to put in numerical perspective, the main ops/puppet repo has had 1214 commits merged since roughly April 1. (then there's DNS and puppet submodules and such) [14:37:10] changes to wikitech in that timeframe that aren't standard nova-resource stuff or deployment updates? I don't know, but probably well under 100. [14:38:06] manybubbles, oh, git deploy? [14:38:25] Krenair: yeah - git deploy + git fat usually takes two or three tries to resolve [14:38:27] I think apergos was saying something about that earlier [14:38:31] okay... [14:38:47] its a sad thing but no one care enough to fix it I think [14:38:56] well not necessarily [14:39:13] so manybubbles, what sort of errors out did you see? [14:39:34] did some of them sync and others not? [14:39:45] apergos: you know how it creates these stub files that are then unstubed? some don't unstub [14:39:54] ok [14:39:57] but if you do it twice or three times it gets them all [14:40:10] ah so it is actually doing the work correctly eventually [14:40:28] if there's a ticket for that please point me at it so I can add myself [14:40:54] its been so long - let me see if I can find it [14:45:46] PROBLEM - puppet last run on uranium is CRITICAL Puppet last ran 1 day ago [14:47:19] 6operations: git fat/git deploy doesn't always unstub files - https://phabricator.wikimedia.org/T98962#1282203 (10Manybubbles) 3NEW [14:47:26] RECOVERY - puppet last run on uranium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:47:28] apergos: https://phabricator.wikimedia.org/T98962 filed a new one [14:48:02] !log ok - time to start the rolling restart. I'm going to to elastic1001 first non-automated and watch it [14:48:10] (03PS1) 10BBlack: Revert abusive-search block from 51e15f3be [puppet] - 10https://gerrit.wikimedia.org/r/210695 [14:48:10] Logged the message, Master [14:48:12] (03PS1) 10BBlack: remove temporary intel-microcode purge [puppet] - 10https://gerrit.wikimedia.org/r/210696 [14:48:38] (03CR) 10BBlack: [C: 032 V: 032] Revert abusive-search block from 51e15f3be [puppet] - 10https://gerrit.wikimedia.org/r/210695 (owner: 10BBlack) [14:48:50] (03CR) 10BBlack: [C: 032 V: 032] remove temporary intel-microcode purge [puppet] - 10https://gerrit.wikimedia.org/r/210696 (owner: 10BBlack) [14:48:54] (03CR) 10Alexandros Kosiaris: [C: 032] "Ran in catalog compiler against the entire fleet with a noop." [puppet] - 10https://gerrit.wikimedia.org/r/208630 (owner: 10Alexandros Kosiaris) [14:49:01] (03PS4) 10Alexandros Kosiaris: hieraize nrpe [puppet] - 10https://gerrit.wikimedia.org/r/208630 [14:49:45] 6operations: git fat/git deploy doesn't always unstub files - https://phabricator.wikimedia.org/T98962#1282219 (10ArielGlenn) [14:50:39] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1282220 (10jcrespo) 5Open>3Resolved [14:50:52] (03CR) 10Alexandros Kosiaris: [C: 032] hieraize nrpe [puppet] - 10https://gerrit.wikimedia.org/r/208630 (owner: 10Alexandros Kosiaris) [14:50:54] (03CR) 10Andrew Bogott: "This is great!" [puppet] - 10https://gerrit.wikimedia.org/r/210637 (https://phabricator.wikimedia.org/T89323) (owner: 10Dzahn) [14:51:04] * anomie sees nothing for SWAT this morning [14:51:27] jouncebot: next [14:51:27] In 0 hour(s) and 8 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150513T1500) [14:51:54] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1277284 (10jcrespo) The permissions issue was authenticating on Icinga as jcrespo instead of Jcrespo. We should document this issue. [14:52:05] (03PS3) 10Andrew Bogott: wikitech: add monitoring::service for static sync [puppet] - 10https://gerrit.wikimedia.org/r/210638 (https://phabricator.wikimedia.org/T89323) (owner: 10Dzahn) [14:52:06] well there's still 8 minutes to go. don't the last-minute SWAT additions usually come in the final 3 minutes? :) [14:53:24] (03CR) 10Andrew Bogott: [C: 032] wikitech: add monitoring::service for static sync [puppet] - 10https://gerrit.wikimedia.org/r/210638 (https://phabricator.wikimedia.org/T89323) (owner: 10Dzahn) [14:53:40] It means I'm not going to worry about doing SWAT or making sure someone else is doing it. [14:53:47] !log elasticsearch restart on elastic1001 going well. cluster still in recovering state as expect. I'll give it an hour to soak. [14:53:52] Logged the message, Master [14:54:15] 6operations, 10Deployment-Systems: git fat/git deploy doesn't always unstub files [Trebuchet] - https://phabricator.wikimedia.org/T98962#1282225 (10ArielGlenn) [14:55:47] jynus: you ran into the same issue many of us ran into. indeed the LDAP auth is not case-sensitive but the Icinga permissions are, so you can be logged in and still not be able to send commands .. [14:55:55] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1282228 (10ArielGlenn) it has been, and at one point I normalized all names so that everyone used lower case and knew to use it. Now everyone does what they like... should fix aut... [14:55:59] happened to me just like that, Dzahn vs. dzahn [14:56:21] anomie, you could debug and fix nostalgiawiki if you like :p [14:57:02] 6operations, 10Deployment-Systems: git fat/git deploy doesn't always unstub files [Trebuchet] - https://phabricator.wikimedia.org/T98962#1282232 (10Manybubbles) [14:57:17] nostalgiawiki??? [14:57:36] bblack: yes, nostalgia.wikipedia.org is the original skin [14:57:50] meh, and that got deleted?? [14:58:24] no, I think there is some issues loading the skin [14:58:44] 6operations, 7Monitoring, 5Patch-For-Review: Monitor the up-to-date status of wikitech-static - https://phabricator.wikimedia.org/T89323#1282235 (10Andrew) I guess now we should break it on purpose to test the test? [14:58:46] it says the skin is there but disabled [14:59:04] https://phabricator.wikimedia.org/T98956 [14:59:24] I guess I'm less ??? about the skin and more about wtf nostalgiawiki would be heh [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150513T1500). Please do the needful. [15:00:24] Still nothing for SWAT. Yay, easy! [15:00:44] Huzzah! [15:01:43] bblack: seen this before? http://sep11.wikipedia.org/ [15:02:01] bblack: nostalgia wiki is like frozen at end of 2001 [15:02:41] https://en.wikipedia.org/wiki/Wikipedia:Nostalgia [15:04:12] <^d> Aww, poor nostalgia [15:04:15] <^d> What's up with it? [15:04:23] ^d, see https://phabricator.wikimedia.org/T98956 [15:04:28] skin not loaded properly [15:04:39] I fiddled around with it on tin earlier and couldn't really figure out why [15:05:03] be back in a little [15:05:09] <^d> Wonder when it broke [15:05:11] <^d> It worked before [15:05:25] could be related to the URL path changes for static assets? [15:05:29] Probably when legoktm made it use extension registration at the end of last month [15:05:36] bblack, I don't think so [15:05:42] this is a server-side thing [15:05:47] hmmm [15:06:13] <^d> require_once "$IP/skins/Nostalgia/Nostalgia.php"; [15:06:18] <^d> Does it still use that entry? [15:07:46] ^d, yep [15:07:55] no wfLoadSkin calls in wm conf yet [15:08:11] also I noticed the entry still shows up in wgMessagesDirs [15:12:10] 6operations: Decommission virt1001-1009 - https://phabricator.wikimedia.org/T98376#1282245 (10Andrew) p:5Triage>3Normal [15:12:28] 6operations: Fix all .erb variable warnings - https://phabricator.wikimedia.org/T97251#1282248 (10Andrew) p:5Triage>3Low [15:13:10] <^d> Krenair: Nostalgia.php has wfLoadSkin() call and $wgMessageDirs setting [15:13:47] yeah [15:13:49] <^d> Ahhh, is it that case-sensitive bug? [15:15:08] !log demon Synchronized wmf-config/InitialiseSettings.php: trying something (duration: 00m 12s) [15:15:21] Logged the message, Master [15:15:21] <^d> Nope [15:15:41] > var_dump( ExtensionRegistry::getInstance()->isLoaded( 'Vector' ) ); [15:15:41] bool(true) [15:15:45] > var_dump( ExtensionRegistry::getInstance()->isLoaded( 'Nostalgia' ) ); [15:15:45] bool(false) [15:17:49] !log demon Synchronized wmf-config/InitialiseSettings.php: didn't work, undoing previous sync (duration: 00m 12s) [15:17:56] Logged the message, Master [15:19:29] <^d> Ah, hmm [15:19:30] ^d, it shows up in wgValidSkinNames if you wfLoadSkin( 'Nostalgia' ); ExtensionRegistry::getInstance()->loadFromQueue(); [15:21:07] PROBLEM - puppet last run on mw2177 is CRITICAL puppet fail [15:21:21] !log I think the elasticsearch cluster got stuck with alloation disabled after the rolling restart. Funky. Haven't seen that one before. Probably a problem with our instructions. Anyway, unstuck it and recovery is going faster now [15:21:26] Logged the message, Master [15:22:49] <^d> Dur wha? [15:22:58] <^d> primary-only, or all allocation disabled? [15:23:54] mhhh sounds similar to the issue I came across last rolling restart? [15:25:13] ^d: primary only [15:25:35] I just hit it with es-tool start-replication and it took [15:25:35] <^d> Hmm. [15:25:49] its like es-tool start-replication in the "script" didn't take the first time [15:26:49] ah [15:26:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [15:27:02] es-tool health returns a non-zero exit code when the cluster is in yellow state [15:27:24] <^d> Ah yeah, I thought we'd fixed that. [15:27:53] it looks like the script wants to wait until es-tool health returns a 0 exit code before it sets the replication [15:27:57] PROBLEM - puppet last run on mw2080 is CRITICAL Puppet has 1 failures [15:28:45] ^d: looks intentional. line 108 [15:29:12] <^d> Ah grrr, because I made the mistake of reusing code here [15:29:44] <^d> The instructions are wrong. Did you try es-tool restart-fast? [15:30:00] not yet, no [15:30:05] will read it [15:30:16] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [15:30:56] <^d> manybubbles: It's basically the same process, only uses cluster_health() instead of es_health() [15:31:08] <^d> es_health() isn't all that useful tbh [15:31:17] I see it - I'll do it the next time [15:31:24] once we're green [15:31:30] * ^d nods [15:31:40] I want to give this one an hour after green just to make sure nothing weird happens. [15:31:56] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [15:31:58] paranoia++ [15:39:16] RECOVERY - puppet last run on mw2177 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:42:17] PROBLEM - Router interfaces on cr2-knams is CRITICAL host 91.198.174.246, interfaces up: 53, down: 1, dormant: 0, excluded: 2, unused: 0BRxe-0/0/0: down - Core: csw2-knams:xe-2/1/1 (GBLX leg 1) {#14006} [10Gbps DF CWDM C61]BR [15:42:58] <^d> cr2-knams - known? [15:43:50] yes [15:44:06] !log Disregard cr2-knams:xe-0/0/0; we're working on it [15:44:14] Logged the message, Master [15:44:54] <^d> mark: ok just making sure [15:44:59] thanks :) [15:46:07] RECOVERY - puppet last run on mw2080 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:08] 6operations: Degraded RAID-1 arrays on new logstash hosts: [UU__] - https://phabricator.wikimedia.org/T98620#1282390 (10bd808) `active raid1 sda2[0] sdb2[1]` -- This is a RAID1 volume. If we want it to span 4 disks instead of 2 wouldn't it need to be RAID10? I guess technically it should be possible to have a RA... [15:53:38] (03PS4) 10Thcipriani: beta: Add script from Jenkins beta-update-databases [puppet] - 10https://gerrit.wikimedia.org/r/210618 (https://phabricator.wikimedia.org/T96199) [15:54:31] (03CR) 10jenkins-bot: [V: 04-1] beta: Add script from Jenkins beta-update-databases [puppet] - 10https://gerrit.wikimedia.org/r/210618 (https://phabricator.wikimedia.org/T96199) (owner: 10Thcipriani) [15:57:55] (03CR) 10Legoktm: [C: 031] Add Jan Zerebecki to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/210692 (https://phabricator.wikimedia.org/T98961) (owner: 10Hashar) [16:00:34] (03PS5) 10Thcipriani: beta: Add script from Jenkins beta-update-databases [puppet] - 10https://gerrit.wikimedia.org/r/210618 (https://phabricator.wikimedia.org/T96199) [16:02:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:05:50] Krenair: I don't see why nostalgia's skin.json isn't being loaded. [16:07:06] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [16:08:27] 6operations, 10Wikimedia-General-or-Unknown, 7Documentation: Add a wiki on wikitech is out of date, incomplete - https://phabricator.wikimedia.org/T87588#1282461 (10Glaisher) Can we close this now? It seems to be much better now. https://wikitech.wikimedia.org/w/index.php?title=Add_a_wiki&oldid=158259 [16:10:36] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:12:06] 6operations, 10Wikimedia-General-or-Unknown, 7Documentation: Add a wiki on wikitech is out of date, incomplete - https://phabricator.wikimedia.org/T87588#1282469 (10greg) >>! In T87588#1282461, @Glaisher wrote: > Can we close this now? It seems to be much better now. https://wikitech.wikimedia.org/w/index.ph... [16:12:13] is jcrespo on IRC someplace? [16:12:24] andrewbogott, jcrespo is me [16:12:32] hello! [16:12:34] well, not my nick, but I am Jaime Crepso [16:12:53] 6operations, 10Wikimedia-General-or-Unknown, 7Documentation: Add a wiki on wikitech is out of date, incomplete - https://phabricator.wikimedia.org/T87588#1282471 (10demon) It's still wrong as it tells you to add wikis to dblists prior to running addwiki. You can't do that. [16:12:57] I don’t think you’ve had a hiring announcement sent yet. Where arey ou located? [16:13:12] spain [16:13:17] yep [16:13:32] Mark is going to announce it tomorrow [16:13:51] but am already putting down servers! [16:13:53] ok. I’m waiting in anticipation for https://phabricator.wikimedia.org/T92693 but if you’re in Spain I will not plan on that happening today :) [16:13:58] Welcome aboard! [16:14:02] thank you [16:14:25] andrewbogott, I am actually working on it [16:14:39] however, it may take some time due to unrelated issues [16:15:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [16:16:48] (03CR) 10Manybubbles: [C: 031] "Note to anyone looking - these are currently disabled in cirrussearch by default but we'll be enabling it in cirrus by default for our use" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210620 (https://phabricator.wikimedia.org/T91666) (owner: 10EBernhardson) [16:16:55] jynus: unrelated issues? [16:17:12] andrewbogott, pending work on the remote cluster [16:17:20] ah, ok. [16:17:21] so that we can provide HA [16:17:39] (high availability) [16:18:11] jynus: ok. Once we have a proper db is it possible to set it up master/master so that I can switch from my local db to the new one without interrupting service? I don’t really know how that works. [16:19:22] yeah, I will have to talk to you for migration- usually, that should be a regular replication and a small downtime (read-only mode) [16:19:43] but I will ping on the ticket when I get there [16:20:03] jynus: sounds good, thanks [16:20:42] your well come. Also, please be easy on me as the new guy :-) Even if the ticket is older [16:21:34] (03CR) 10Hashar: "argparse can be used to set sane default instead of having similar logic scattered in different methods (get_cores, get_dblist..)." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/210618 (https://phabricator.wikimedia.org/T96199) (owner: 10Thcipriani) [16:23:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:24:16] (03PS1) 10Ori.livneh: HHVM: enable DNS cache [puppet] - 10https://gerrit.wikimedia.org/r/210706 [16:28:46] !log disabling puppet on labnet1001 to tinker with nova config [16:28:52] Logged the message, Master [16:28:57] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [16:29:35] the RAID check on silver is OK but the actual status line is: NRPE: Unable to read output [16:29:58] shouldnt be OK, had something similar before that we fixed [16:30:32] that would be wikitech [16:31:55] same thing on californium, dbproxy1001-1004, eventlog2001,... hmm [16:36:43] zuul.eqiad.wmnet[0: 208.80.154.135]: errno=Connection timed out [16:36:45] known issue? [16:37:01] our jenkins tests are failing due to what appear to be network issues. eg: https://integration.wikimedia.org/ci/job/parsoidsvc-deploy-parse-tool-check/320/console [16:38:44] !log ori Synchronized php-1.26wmf5/includes/resourceloader/ResourceLoader.php: I30b490e5b: ResourceLoader::filter: use APC when running under HHVM (duration: 00m 14s) [16:38:53] Logged the message, Master [16:39:58] !log ori Synchronized php-1.26wmf4/includes/resourceloader/ResourceLoader.php: I30b490e5b: ResourceLoader::filter: use APC when running under HHVM (duration: 00m 11s) [16:40:04] Logged the message, Master [16:40:55] 6operations, 7Monitoring: Icinga RAID monitoring status "NRPE: Unable to read output " reported as OK - https://phabricator.wikimedia.org/T98978#1282528 (10Dzahn) 3NEW [16:41:01] !log Enabling puppet agent in db1009.eqiad after reinstall [16:41:06] Logged the message, Master [16:41:35] 6operations, 7Monitoring: Icinga RAID monitoring status "NRPE: Unable to read output " reported as OK - https://phabricator.wikimedia.org/T98978#1282536 (10Dzahn) [16:44:48] ori, still deploying? [16:44:54] no [16:44:55] or all done? [16:44:56] okay [16:44:58] thanks [16:45:35] (03CR) 10Alex Monk: [C: 032] Update nlwiki ContactPage recipient user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209967 (owner: 10Glaisher) [16:45:45] 6operations, 10ops-codfw: es2010 Failed Hard Drive - https://phabricator.wikimedia.org/T86588#1282543 (10Dzahn) 5Resolved>3Open saw the RAID check of es2010 as CRITICAL in Icinga. it started about 9 hours ago. so it looks like another disk just died. should i make a new ticket or can we just keep using t... [16:46:26] !log es2010 failed disk, reopening ticket for last fail in January [16:46:33] Logged the message, Master [16:46:46] (03CR) 10jenkins-bot: [V: 04-1] Update nlwiki ContactPage recipient user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209967 (owner: 10Glaisher) [16:47:30] ugh [16:47:32] :/ [16:47:32] that's broken [16:47:38] ACKNOWLEDGEMENT - RAID on es2010 is CRITICAL 1 failed LD(s) (Degraded) daniel_zahn https://phabricator.wikimedia.org/T86588 [16:48:29] 6operations: Degraded RAID-1 arrays on new logstash hosts: [UU__] - https://phabricator.wikimedia.org/T98620#1282548 (10RobH) the partman recipe isnt right, it shoudl be: sda: raid1 of / (shared with sdb) : raid 0 for remainder of data (across all 4 disks) sdb: raid1 of / (shared with sda) : raid 0 for remaind... [16:49:03] ERROR: Error cloning remote repo 'origin' : Could not clone git://zuul.eqiad.wmnet/operations/mediawiki-config ?? [16:49:09] hashar: ^ ? [16:49:21] I can clone from that on tin... [16:49:28] papaul: it looks like another disk died on es2010. i saw a ticket that this happened before in January and you replaced it back then [16:49:47] !log re-enabling puppet on labnet1001 [16:49:47] PROBLEM - puppet last run on labnet1001 is CRITICAL puppet fail [16:49:48] 6operations: Degraded RAID-1 arrays on new logstash hosts: [UU__] - https://phabricator.wikimedia.org/T98620#1282551 (10RobH) references to initial setup tasks: T96692 (install) & T84958 (hardware-request and disk discussion) [16:49:53] Logged the message, Master [16:50:07] papaul: what is better? new ticket for each event or reopening the same one so we have history in one place [16:50:09] Glaisher: bug filling it [16:50:23] mutante: new ticket will be great please [16:50:29] papaul: ok, will do [16:50:34] will the full error [16:50:37] "git clone git://zuul.eqiad.wmnet/operations/mediawiki-config ~/mw-cfg" works.. [16:50:46] papaul: ok [16:51:03] mutante: thanks [16:51:27] RECOVERY - puppet last run on labnet1001 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:51:41] !log Zuul clone failure https://phabricator.wikimedia.org/T98980 [16:51:46] Logged the message, Master [16:52:08] (03PS2) 10Alex Monk: Update nlwiki ContactPage recipient user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209967 (owner: 10Glaisher) [16:52:19] (03CR) 10Alex Monk: [V: 032] Update nlwiki ContactPage recipient user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209967 (owner: 10Glaisher) [16:53:13] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/209967/ (duration: 00m 14s) [16:53:19] Glaisher, ^ [16:53:20] Logged the message, Master [16:53:32] Looks like it's working. Thanks. [16:53:37] works for me [16:53:38] yeah [16:55:55] 6operations, 10ops-codfw: es2010 Failed Hard Drive - https://phabricator.wikimedia.org/T86588#972055 (10Dzahn) after talking to @papaul made a new separate ticket for this new event. please see T98982 instead. reclosing here. [16:56:11] 6operations, 10ops-codfw: es2010 Failed Hard Drive - https://phabricator.wikimedia.org/T86588#1282606 (10Dzahn) 5Open>3Resolved [17:02:26] RECOVERY - Router interfaces on cr2-knams is OK host 91.198.174.246, interfaces up: 60, down: 0, dormant: 0, excluded: 2, unused: 0 [17:02:41] 6operations: Degraded RAID-1 arrays on new logstash hosts: [UU__] - https://phabricator.wikimedia.org/T98620#1282623 (10Gage) Eventually I'd like to see the partman receipe fixed and tested by reinstalling one of these hosts, but I've fixed the running config so that the arrays no longer appear as degraded: ```... [17:02:58] !log Zuul clone failures solved. Was due to network traffic being interrupted between labs and prod. [17:03:03] Logged the message, Master [17:03:04] Glaisher: solved :] [17:03:25] Thanks :) [17:06:27] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [17:08:52] !log elastic1001 restarted and rejoined the cluster hapilly while I was at lunch. it looks good - no errors beyond the ones we have fixes in flight for. So I'm going to do elastic1002 [17:08:59] Logged the message, Master [17:09:38] 6operations: Check power supply balance settings on cp3030+ - https://phabricator.wikimedia.org/T98984#1282665 (10Krenair) [17:10:16] (03PS1) 10Legoktm: Load the Nostalgia skin next to all the other skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210714 [17:10:41] ^d: ^ +1? [17:10:47] (03CR) 10Chad: [C: 032] Load the Nostalgia skin next to all the other skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210714 (owner: 10Legoktm) [17:10:48] !log elastic1002 restarted and rejoined the cluster - now the cluster is repaining. hurray. [17:10:53] Logged the message, Master [17:10:55] or that :P [17:11:06] ^d: are you going to deploy it or should I? [17:11:13] <^d> I'm already on tin, can do [17:11:19] ok [17:11:27] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [17:13:06] (03Merged) 10jenkins-bot: Load the Nostalgia skin next to all the other skins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210714 (owner: 10Legoktm) [17:14:00] !log demon Synchronized wmf-config/CommonSettings.php: because sometimes moving code helps (duration: 00m 15s) [17:14:01] 6operations: Degraded RAID-1 arrays on new logstash hosts: [UU__] - https://phabricator.wikimedia.org/T98620#1282677 (10bd808) >>! In T98620#1282623, @Gage wrote: > Eventually I'd like to see the partman receipe fixed and tested by reinstalling one of these hosts, but I've fixed the running config so that the ar... [17:14:05] Logged the message, Master [17:14:21] <^d> Nope [17:14:36] no luck? [17:14:50] <^d> Well nostalgiawiki still looks broke to me [17:15:15] <^d> legoktm: What if we did the opposite. Unconditionally load the code but add it to $wgSkipSkins [17:15:20] <^d> For all wikis but nostalgia [17:15:26] getting 503 now [17:15:40] Request: GET http://nostalgia.wikipedia.org/wiki/HomePage, from 10.64.0.103 via cp1066 cp1066 ([10.64.0.103]:3128), Varnish XID 1376492437 [17:15:43] Forwarded for: 71.80.248.167, 10.64.0.103, 10.64.0.103 [17:15:46] Error: 503, Service Unavailable at Wed, 13 May 2015 17:15:10 GMT [17:15:57] wfm? [17:16:07] wfm [17:16:08] unlucky timing [17:16:12] works again [17:16:19] well, like before with the error about the skin [17:17:03] ^d: okay....but if that works something is seriously broken in the extension registry [17:17:11] <^d> Agreed. [17:18:27] hhvm got really sad? 141 core dumps in fatal monitor in the last 5 mins [17:18:43] interesting [17:18:44] * ori looks [17:18:50] 6operations, 10ops-esams: Check power supply balance settings on cp3030+ - https://phabricator.wikimedia.org/T98984#1282696 (10mark) [17:19:20] (03PS1) 10Chad: Reverse nostalgia skin configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210715 [17:19:23] graph looks like it was mostly from 17:14:30 to 17:15:45 or so [17:19:25] <^d> legoktm: ^^ [17:19:38] bd808: all from terbium? [17:19:49] ori: looking.... [17:19:52] (03CR) 10Legoktm: [C: 032] Reverse nostalgia skin configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210715 (owner: 10Chad) [17:19:57] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [17:19:57] (03Merged) 10jenkins-bot: Reverse nostalgia skin configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210715 (owner: 10Chad) [17:20:32] ori: nope all over the farm [17:20:42] did something get deployed just now? [17:20:50] (Merged) jenkins-bot: Load the Nostalgia skin next to all the other skins [mediawiki-config] - G210714 (owner: Legoktm) [17:20:57] !log demon Synchronized wmf-config/CommonSettings.php: something something skins are broken (duration: 00m 11s) [17:20:59] yes, ^d just deployed that [17:21:03] <^d> Yes, we've been debugging [17:21:06] Logged the message, Master [17:21:26] haha wtf it's still broken [17:21:56] what is still broken? [17:21:58] <^d> ori: Nothing new, just moving some nostalgiawiki config about [17:21:58] and yet https://en.wikipedia.org/wiki/Main_Page?useskin=nostalgia works! [17:22:07] https://nostalgia.wikipedia.org/wiki/HomePage [17:22:12] caching? [17:22:25] ori: here's the spike -- https://logstash.wikimedia.org/#dashboard/temp/usqtdVWSTXKMrq5Gmu3u6w [17:22:31] seems to have stopped [17:22:33] Krenair: of what? [17:22:42] fallback skin disables varnish caching [17:24:05] oh, right [17:24:07] alright, I don't mean to be a dick, but this is enough of a SNAFU to require a quick postmortem, even if it's only on IRC [17:24:14] what happened, starting at the beginning? [17:24:26] nostalgia or hhvm dumps? [17:24:32] are they related? [17:24:40] <^d> I highly doubt it [17:25:14] tc cold size out of space on mw1241 [17:25:25] <^d> ori: The nostalgia thing is just a skin bug only really affecting nostalgiawiki that we're trying to run down. I'm not sure what's causing hhvm dumps but I doubt it's us. [17:25:36] <^d> We're just moving about existing config to debug, nothing new anywhere. [17:25:38] bd808: where do you get 'all over the map'? i see two hosts really [17:26:44] [fluorine:/a/mw-log] $ grep '2015-05-13 17' fatal.log | head | field 3 | dist [17:26:44] Key|Ct (Pct) Histogram [17:26:44] mw1252|7 (70.00%) █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ [17:26:46] mw1030|2 (20.00%) ██████████████████████████████████▋ [17:26:48] mw1095|1 (10.00%) █████████████████▍ [17:27:09] ori: mw1009, 1008, 1011, 1003, 1012, 1001, 1006, 1013, ... [17:28:06] <^d> legoktm: Downside of our hack - useskin=nostalgia works on non-nostalgia wikis :p [17:28:09] fatal.log is mw's fatal log. hhvm.log will have the core dump notices [17:28:33] https://logstash.wikimedia.org/#dashboard/temp/hEmTi2TyQwWrKaHN9O43fA [17:29:10] <^d> legoktm: The giant green go button on https://en.wikipedia.org/wiki/Main_Page?useskin=nostalgia is amusing [17:29:51] I don't see a giant green button? [17:30:14] after the dropdown beginning with "Special pages" [17:30:22] mw-ui-button [17:30:28] bd808: all job runners [17:30:35] is there a way to dump APC? [17:30:39] May 13 17:25:56 mw1009 php: #012Fatal error: request has exceeded memory limit in /srv/deployment/jobrunner/jobrunner/redisJobChronService on line 170 [17:30:39] May 13 17:25:57 mw1009 kernel: [10906976.691767] init: jobchron main process (529) terminated with status 255 [17:30:39] May 13 17:25:57 mw1009 kernel: [10906976.691787] init: jobchron main process ended, respawning [17:30:40] <^d> legoktm: https://phabricator.wikimedia.org/F164491 :p [17:30:47] :o [17:31:15] legoktm: curl localhost:9002/dump-apc [17:31:17] on an app server [17:31:28] thanks [17:31:59] (03PS2) 10Ori.livneh: logstash: Exclude api-feature-usage-sanitized from indexing [puppet] - 10https://gerrit.wikimedia.org/r/210277 (https://phabricator.wikimedia.org/T98750) (owner: 10BryanDavis) [17:32:07] (03CR) 10Ori.livneh: [C: 032 V: 032] logstash: Exclude api-feature-usage-sanitized from indexing [puppet] - 10https://gerrit.wikimedia.org/r/210277 (https://phabricator.wikimedia.org/T98750) (owner: 10BryanDavis) [17:32:33] (03PS2) 10Ori.livneh: logstash: Update syslog processing rules [puppet] - 10https://gerrit.wikimedia.org/r/210278 (owner: 10BryanDavis) [17:32:42] (03CR) 10Ori.livneh: [C: 032 V: 032] logstash: Update syslog processing rules [puppet] - 10https://gerrit.wikimedia.org/r/210278 (owner: 10BryanDavis) [17:33:54] andrewbogott: icinga doesnt like that i used a question mark in a service name :p that's re: my patch you merged for wikitech-static. i'll fix this now: [17:33:57] Error: The description string for service 'are wikitech and wt-static in sync?' on host 'silver' contains one or more illegal characters. [17:34:10] (03PS2) 10Ori.livneh: HHVM: enable DNS cache [puppet] - 10https://gerrit.wikimedia.org/r/210706 [17:34:14] mutante: so picky! [17:34:33] that is also why "Check correctness of the icinga configuration" is CRIT [17:34:36] on it [17:34:36] (03CR) 10Ori.livneh: [C: 032 V: 032] HHVM: enable DNS cache [puppet] - 10https://gerrit.wikimedia.org/r/210706 (owner: 10Ori.livneh) [17:34:41] (03PS1) 10MaxSem: Add my new key now that I'm back in office [puppet] - 10https://gerrit.wikimedia.org/r/210716 [17:35:00] ori: I tried that on mw1017 and I'm just getting "Done" [17:35:10] legoktm: it probably writes it to /tmp or something [17:35:16] you'll have to grep the hhvm source to see where it writes it [17:35:23] or just ls -lat /tmp | head and see which new files are there [17:35:36] /tmp/apc_dump [17:35:37] thanks [17:36:06] nice [17:36:34] could you document that in https://wikitech.wikimedia.org/wiki/HHVM/Troubleshooting ? [17:36:37] PROBLEM - puppet last run on mw1239 is CRITICAL Puppet has 1 failures [17:40:18] (03CR) 10Mjbmr: [C: 031] "That's the best implementation, I would like to see this get going." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [17:40:41] done [17:40:56] (03CR) 10Mjbmr: [C: 031] Rename chapcomwiki to affcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169939 (https://bugzilla.wikimedia.org/39482) (owner: 10Reedy) [17:42:22] ori: thanks for the merges [17:42:50] bd808: np! excited to see logstash infra becoming more comprehensive and robust [17:47:56] 6operations, 10ops-codfw, 5Patch-For-Review: Set up missing PDUs in codfw and eqiad - https://phabricator.wikimedia.org/T84416#1282792 (10Cmjohnson) [17:47:58] 6operations, 10ops-eqiad, 5Patch-For-Review: humidity sensors in eqiad row c/d showing alarms - https://phabricator.wikimedia.org/T98721#1282790 (10Cmjohnson) 5Open>3Resolved I updated the sensor thresholds to match those on Row A and Row B to 12%. This should quell the notifications. -CJ [17:49:37] (03CR) 10Chad: [C: 04-1] "I disagree. I think it will just introduce confusion." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [17:49:56] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 7242.91288328 [17:50:15] ^ fixed by jgage. thanks [17:51:26] woo [17:52:23] (03PS1) 10Andrew Bogott: Exclude labs private IPs from dmz_cidr. [puppet] - 10https://gerrit.wikimedia.org/r/210720 [17:52:56] 7Blocked-on-Operations, 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1282827 (10Jdouglas) [17:52:56] RECOVERY - puppet last run on mw1239 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:53:41] 7Blocked-on-Operations, 3Search-and-Discovery-Research-and-Data-Sprint: Create rsync connector to fluorine - https://phabricator.wikimedia.org/T98383#1282830 (10Jdouglas) [17:54:30] (03PS1) 10Dzahn: wikitech-static monitoring, ? is illegal character [puppet] - 10https://gerrit.wikimedia.org/r/210721 (https://phabricator.wikimedia.org/T89323) [17:55:08] ^d: do you know what's causing all the dberror events? "Error connecting to 10.64.48.15: Can't connect to MySQL server on '10.64.48.15' (4)" [17:55:14] (03CR) 10Dzahn: [C: 032] wikitech-static monitoring, ? is illegal character [puppet] - 10https://gerrit.wikimedia.org/r/210721 (https://phabricator.wikimedia.org/T89323) (owner: 10Dzahn) [17:55:25] <^d> bd808: nope i don't [17:55:45] They seem to be pretty constant since we got the logstash logs turned back on (unrelated I'm sure) [17:56:04] bd808: we talked about this a few days ago, the theory it's because HHVM has its own db connect timeout that replaces/preempts the one set in our code that worked under zend [17:56:20] and it defaults to 1 second, whereas we had 3 before, and 1s == subject to lots of transient failures [17:56:27] ah [17:56:55] should we change the config to wait longer? [17:56:58] I think it can be overridden in the fastcgi.ini file or whatever for HHVM, but _joe_ may know more [17:57:17] there's no clear docs exactly what the change in the ini file should look like, just inferring from code with an ini lookup call in it [17:57:44] that's the story of hhvm config ;) [17:57:52] jynus, see bd808 above [17:58:06] read the source, guess the right mangled name, try it out [17:58:10] :) [17:58:26] the source line in question is: [17:58:27] https://github.com/facebook/hhvm/blob/e172929443989ef519b4300d88748552dc8241a0/hphp/runtime/ext/mysql/ext_mysql.cpp#L1088 [17:58:42] 15.48.64.10.in-addr.arpa domain name pointer db1062.eqiad.wmnet. [17:58:53] (03CR) 10Mjbmr: "@Chad: How? in what cases? Instead of being ready to oppose this please spend some time and find another way." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [17:59:03] Krenair, reading [17:59:16] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [17:59:22] (03PS1) 10Dzahn: wikitech-static monitoring, add check command [puppet] - 10https://gerrit.wikimedia.org/r/210724 (https://phabricator.wikimedia.org/T89323) [18:00:02] ganglia for that host looks happy [18:00:05] twentyafterfour, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150513T1800). [18:01:03] cheking it [18:02:03] 6operations, 10Labs-Vagrant: Backport Vagrant 1.7+ from Debian experimental to our Trusty apt repo - https://phabricator.wikimedia.org/T93153#1282858 (10akosiaris) a:5fgiunchedi>3akosiaris [18:02:22] <^d> legoktm: Let's revert everything since it didn't work. I'm not thrilled with nostalgia being an option elsewhere [18:02:47] * bd808 had forgotten that we have a new DBA [18:03:25] yep, something smells fishy [18:04:00] (03PS1) 10Chad: Revert all the nostalgia config debugging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210726 [18:04:23] (03CR) 10Chad: [C: 032] Revert all the nostalgia config debugging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210726 (owner: 10Chad) [18:04:29] (03Merged) 10jenkins-bot: Revert all the nostalgia config debugging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210726 (owner: 10Chad) [18:04:48] (03CR) 10JanZerebecki: [C: 031] noc - redirect HTTP to HTTPS; enable HSTS 7 days [puppet] - 10https://gerrit.wikimedia.org/r/199515 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:04:53] +1 [18:05:05] !log demon Synchronized wmf-config/CommonSettings.php: undo all the nostalgia (duration: 00m 10s) [18:05:13] Logged the message, Master [18:05:25] 6operations, 10Labs-Vagrant: Backport Vagrant 1.7+ from Debian experimental to our Trusty apt repo - https://phabricator.wikimedia.org/T93153#1282860 (10akosiaris) Stealing this, it might take me a bit, as I familiarize myself with labs vagrant. [18:05:35] I can see the aborted clients in the last hour [18:06:06] (03CR) 10BBlack: [C: 031] "I know nothing about the broader context of this, but the subnet math is right." [puppet] - 10https://gerrit.wikimedia.org/r/210720 (owner: 10Andrew Bogott) [18:06:10] jynus: the hhvm side of it is on this logstash dashboard -- https://logstash.wikimedia.org/#dashboard/temp/q5lX3HFST6iiwopEEqgn3w [18:06:37] if it's just a matter of tuning connection timeouts for a busy cluster that should be fixable [18:07:47] bd808, thank you sometimes it is difficult to say something without a context [18:10:47] PROBLEM - HHVM rendering on mw1084 is CRITICAL - Socket timeout after 10 seconds [18:11:21] (03PS1) 10Dzahn: wikitech-static monitoring, install plugin [puppet] - 10https://gerrit.wikimedia.org/r/210730 (https://phabricator.wikimedia.org/T89323) [18:11:27] PROBLEM - Apache HTTP on mw1084 is CRITICAL - Socket timeout after 10 seconds [18:11:39] (03CR) 10Dzahn: [C: 032] wikitech-static monitoring, add check command [puppet] - 10https://gerrit.wikimedia.org/r/210724 (https://phabricator.wikimedia.org/T89323) (owner: 10Dzahn) [18:12:15] (03CR) 10Dzahn: [C: 032] wikitech-static monitoring, install plugin [puppet] - 10https://gerrit.wikimedia.org/r/210730 (https://phabricator.wikimedia.org/T89323) (owner: 10Dzahn) [18:13:57] RECOVERY - HHVM rendering on mw1084 is OK: HTTP OK: HTTP/1.1 200 OK - 64854 bytes in 0.318 second response time [18:14:37] RECOVERY - Apache HTTP on mw1084 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.035 second response time [18:17:12] the load is clearly higher on all read nodes [18:19:07] but I do not think it is timeout-related, but lock-related [18:21:45] there was some traffic recently about a problem with lock_wait_timeout issues [18:21:49] let me see if I can find the ticket... [18:22:24] bblack, hi, i fixed the caching headers. Is varnish still blocking them for graphoid? [18:22:51] https://phabricator.wikimedia.org/T90704 [18:23:08] yurik: yes, varnish is configured to pass requests through directly [18:23:49] jynus: well especially from here on was recent traffic on that ticket: https://phabricator.wikimedia.org/T90704#1267738 [18:24:07] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:24:21] http://ganglia.wikimedia.org/latest/graph.php?r=day&z=large&c=MySQL+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [18:24:49] (03PS1) 10Dzahn: wikitech-static monitoring, fix template name [puppet] - 10https://gerrit.wikimedia.org/r/210739 (https://phabricator.wikimedia.org/T89323) [18:25:04] I've read it [18:25:45] mysql traffic seems to coincide with 12:39 springle: upgrade and restart db1060 12:45 springle: xtrabackup clone db1060 to db1018 [18:25:48] so probably OK [18:26:07] accoording to my graphs it should be ok now, or almost ok? [18:26:17] or it would be unrelated [18:27:27] but those are unrelated shards, I think [18:27:53] (03PS2) 10Dzahn: wikitech-static monitoring, fi [puppet] - 10https://gerrit.wikimedia.org/r/210739 (https://phabricator.wikimedia.org/T89323) [18:28:07] elevated bytes in / out on the dbs is something to investigate, not sure what the cause is [18:28:16] sure [18:28:24] (03PS3) 10Dzahn: wikitech-static monitoring,fix check command setup [puppet] - 10https://gerrit.wikimedia.org/r/210739 (https://phabricator.wikimedia.org/T89323) [18:29:00] (03CR) 10Dzahn: [C: 032] wikitech-static monitoring,fix check command setup [puppet] - 10https://gerrit.wikimedia.org/r/210739 (https://phabricator.wikimedia.org/T89323) (owner: 10Dzahn) [18:29:26] the problem it is difficult to know the exact cause without more investigation - was the extra load the cause of the consequence? [18:30:47] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [18:32:00] I am worried about the spikes on https://tendril.wikimedia.org/host/view/db1062.eqiad.wmnet/3306, on the InnoDB locks section [18:32:14] I suspect jynus may have reached the magical moment at which he's been looking at our stuff just long enough to realize scope of our problems :) [18:32:32] o, no, bblack [18:32:52] the problem is that, the database, is something that you have to "feel" for some days [18:33:05] and look at the aplications, etc. [18:33:08] well everything's like that at some level [18:33:36] I'm just watching the messages go by and thinking this must feel pretty horrific by now :) [18:33:39] yes, but with the difference that if a fronted does a strange, you restart [18:34:08] everithing in the end goes to a database, both database and app problems [18:34:35] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [18:34:51] so, first thing, first, on my side the connection problems are going down [18:35:05] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [18:39:34] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [18:40:47] bblack, have you opened a an issue already? [18:41:27] no [18:41:33] you mean for today's db load issues? [18:41:37] yes [18:41:45] not at all, no :) [18:41:45] even if it is related to a previous issue [18:41:54] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [18:42:09] ok, I will, and feel free to add anything extra [18:42:13] ok [18:43:45] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [18:45:50] 6operations, 7Monitoring, 5Patch-For-Review: Monitor the up-to-date status of wikitech-static - https://phabricator.wikimedia.org/T89323#1283022 (10Dzahn) 5Open>3Resolved it works now: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=silver&service=are+wikitech+and+wt-static+in+sync... [18:46:17] andrewbogott: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=silver&service=are+wikitech+and+wt-static+in+sync [18:46:52] mutante: that’s great! Thanks. [18:46:59] :) [18:47:45] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 0 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [18:48:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [18:50:33] (03CR) 10Reedy: "I'm not overly keen on this either. It's a hack, and could lead to various amounts of confusion depending on how you look at it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [18:52:03] (03CR) 10Legoktm: "If you're changing the actual db name you're also going to have to update CentralAuth, any Wikidata items with links, and probably more..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [18:52:06] (03CR) 10BBlack: "The primary reason I've been stalling on this is I have no idea if there are various scripts/tools laying around that access http://noc an" [puppet] - 10https://gerrit.wikimedia.org/r/199515 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:53:18] icinga-wm, wut? [18:53:25] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 0 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [18:53:29] this makes no sense [18:54:04] local commit? [18:55:05] (03CR) 10Alexandros Kosiaris: Exclude labs private IPs from dmz_cidr. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/210720 (owner: 10Andrew Bogott) [18:55:41] 6operations: Changing address of Võro Vikipeediä - https://phabricator.wikimedia.org/T84537#1283059 (10Reedy) [18:55:43] 6operations, 10Hackathon-Lyon-2015, 10Wikimedia-Site-requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1283058 (10Reedy) [18:55:51] 6operations, 7database: Database connection failure issues on s7 shard - https://phabricator.wikimedia.org/T98998#1283060 (10Krenair) [18:56:04] 6operations: Changing address of Võro Vikipeediä - https://phabricator.wikimedia.org/T84537#928556 (10Reedy) TL;DR for this is: Renaming wikis is hard [18:58:09] (03PS2) 10Andrew Bogott: Exclude labs private IPs from dmz_cidr. [puppet] - 10https://gerrit.wikimedia.org/r/210720 [18:58:21] 6operations, 7database: Database connection failure issues on s7 shard - https://phabricator.wikimedia.org/T98998#1283066 (10Reedy) mediawikwiki is on s3, not s7. Just FYI [18:58:51] 6operations, 10Wikimedia-Site-requests: Rename Võro Wikipedia, fiu-vro -> vro - https://phabricator.wikimedia.org/T31186#1283070 (10Krenair) [18:58:52] 6operations: Changing address of Võro Vikipeediä - https://phabricator.wikimedia.org/T84537#1283069 (10Krenair) [18:58:53] (03PS1) 10Yurik: Enable Varnish caching for graphoid [puppet] - 10https://gerrit.wikimedia.org/r/210747 [18:58:55] bblack, ^ [18:59:07] (03CR) 10Reedy: "Depends on the actual wiki. Some aren't CA clustered etc, so would be "easier", but point taken, indeed. It'd be something to script, or r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [19:00:11] Reedy, what was the external storage issue again? [19:00:13] !log elastic1002 restart went well - starting elastic1003 [19:00:22] It's WORM effectively [19:00:32] Logged the message, Master [19:01:02] 6operations, 10Hackathon-Lyon-2015, 10Wikimedia-Site-requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1283075 (10Reedy) @jcrespo When you're settled in, and have some time (noting, these aren't urgent issues), would you have any ideas about act... [19:02:23] Reedy, right, so it's read-only [19:02:26] 6operations, 7database: Database connection failure issues on s7 shard - https://phabricator.wikimedia.org/T98998#1283077 (10jcrespo) The bakups were running on that shard at that time, so probably it is just me overreacting @springle [19:02:35] does it actually contain references to the main dbs? [19:03:21] https://github.com/wikimedia/mediawiki/blob/master/maintenance/storage/blobs.sql [19:03:30] IIRC it's under a database with the dbname [19:04:00] Technically, if they can't actually be renamed, making the wiki readonly temporarily, create the new database, and then just SELECT INSERT to copy the data would work [19:04:02] If a little messy [19:04:06] (03PS2) 10Yurik: Enable Varnish caching for graphoid [puppet] - 10https://gerrit.wikimedia.org/r/210747 (https://phabricator.wikimedia.org/T98803) [19:04:28] yurik: yeah really there's several related things that could be cleaned up there... [19:05:16] so fine fine you're pushing on it, and it's not that hard. let me go clean up the related things first, then the graphoid change will make more sense :) [19:05:33] 6operations, 10ops-codfw: degraded RAID / disk fail on es2010 - https://phabricator.wikimedia.org/T98982#1283107 (10Papaul) [19:06:02] why do ops like to create so many tickets without projects? :/ [19:06:20] 6operations, 7Icinga, 7Monitoring: monitoring alerts for "0 unmerged changes in mediawiki_config" - https://phabricator.wikimedia.org/T99001#1283115 (10Krenair) [19:06:42] like what? operations is a project [19:06:57] i added "monitoring" [19:07:00] ah [19:07:04] 6operations, 7Icinga, 7Monitoring: monitoring alerts for "0 unmerged changes in mediawiki_config" - https://phabricator.wikimedia.org/T99001#1283118 (10Reedy) IIRC, it usually means the local is ahead of master. Local commits, etc But I agree, the warning doesn't make sense. Unfortunately, with it not being... [19:07:36] 6operations, 7Icinga, 7Monitoring: Remove monitoring alerts for "0 unmerged changes in mediawiki_config" - https://phabricator.wikimedia.org/T99001#1283119 (10Krinkle) [19:07:55] 6operations, 6Release-Engineering: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#1283121 (10hashar) There was some rspec incompatibility issue for which I proposed a fix at https://github.com/rodjek/rspec-puppet/pull/243 which is superseded by https://github.co... [19:08:03] (03CR) 10Hashar: "There was some rspec incompatibility issue for which I proposed a fix at https://github.com/rodjek/rspec-puppet/pull/243 which is supersed" [puppet] - 10https://gerrit.wikimedia.org/r/178810 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [19:08:41] 6operations, 6Release-Engineering: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#1283127 (10hashar) 5Open>3stalled Stalled till @dduvall starts looking at building some rspec for operations/puppet.git [19:08:48] 6operations, 10Hackathon-Lyon-2015, 10Wikimedia-Site-requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1283129 (10jcrespo) >>! In T21986#1283075, @Reedy wrote: > @jcrespo When you're settled in, and have some time (noting, these aren't urgent is... [19:09:26] when i look at the project tags on a T in phab, they always say "(Backlog)" after the project name. are there other states that are not Backlog? [19:09:35] mutante, yes [19:09:41] those are workboard column headings [19:09:51] look at https://phabricator.wikimedia.org/tag/editing_department_2014_15_q4_blockers/ [19:09:52] (03CR) 10JanZerebecki: "That is not as exhaustive as inspecting logs or traffic, but the ones I found by grepping for noc\.w in puppet.git are either already usin" [puppet] - 10https://gerrit.wikimedia.org/r/199515 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [19:09:59] lots of different headings there [19:10:00] ok, let me to take some rest now, see you! [19:10:04] * Krenair waves [19:10:10] jynus: cu [19:10:14] Krenair: thanks [19:10:17] 6operations, 10Hackathon-Lyon-2015, 10Wikimedia-Site-requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1283130 (10Reedy) Can't you just do CREATE DATABASE newdb; foreach table in old db { RENAME DATABASE olddb.table TO newdb.table } :D [19:11:05] mutante, so those don't show up on all projects IIRC, because not all projects have workboards. Unfortunately anyone can add a workboard to a project (default columns: backlog only) and then you can't delete it [19:11:13] Krenair: yea, i have seen the "Doing" being used by other teams [19:11:33] so.. you already knew the answer to your question? :P technically [19:12:40] yea, i saw workboards, it was more that you had to tell me those are the work board titles shown in brackets with project tags [19:12:54] eh, column headings [19:14:02] ah [19:17:22] workboards really make more sense for teams rather than subject-areas, because they tend to reflect the assignment/progress of actual work in the who's-doing-what sense [19:17:42] but I guess you can use them for anything. I'm abusing them for the traffic stuff for now. [19:17:58] bblack, hehe, thanks :))) i'm worried it will skyrocket if someone adds it to a big site [19:18:08] *page [19:19:34] (03PS1) 10BBlack: move *oid/restbase frontend pass to frontend only [puppet] - 10https://gerrit.wikimedia.org/r/210751 [19:19:36] (03PS1) 10BBlack: eliminate redundant hit-for-pass for things already pass in recv [puppet] - 10https://gerrit.wikimedia.org/r/210752 [19:19:39] (03PS1) 10BBlack: get rid of beresp.ttl=0 for non-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/210753 [19:19:40] (03PS1) 10BBlack: enable frontend caching for graphoid [puppet] - 10https://gerrit.wikimedia.org/r/210754 [19:20:20] (03PS1) 1020after4: Remove 1.25wmf24 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210755 [19:20:22] (03PS1) 1020after4: Add 1.26wmf6 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210756 [19:20:24] (03PS1) 1020after4: Wikipedias to 1.26wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210757 [19:20:26] (03PS1) 1020after4: Group0 to 1.26wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210758 [19:21:40] (03CR) 1020after4: [C: 032] Remove 1.25wmf24 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210755 (owner: 1020after4) [19:21:46] (03Merged) 10jenkins-bot: Remove 1.25wmf24 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210755 (owner: 1020after4) [19:21:50] (03CR) 1020after4: [C: 032] Add 1.26wmf6 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210756 (owner: 1020after4) [19:21:56] (03Merged) 10jenkins-bot: Add 1.26wmf6 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210756 (owner: 1020after4) [19:23:12] (03CR) 10BBlack: [C: 032] move *oid/restbase frontend pass to frontend only [puppet] - 10https://gerrit.wikimedia.org/r/210751 (owner: 10BBlack) [19:23:46] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [19:25:09] !log twentyafterfour Started scap: testwiki to php-1.26wmf6 and rebuild l10n cache [19:25:14] Logged the message, Master [19:25:52] (03CR) 10BBlack: [C: 032] eliminate redundant hit-for-pass for things already pass in recv [puppet] - 10https://gerrit.wikimedia.org/r/210752 (owner: 10BBlack) [19:26:00] (03CR) 10BBlack: [C: 032] get rid of beresp.ttl=0 for non-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/210753 (owner: 10BBlack) [19:26:59] (03CR) 10BryanDavis: Add fluorine rsync connector (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/209684 (owner: 10OliverKeyes) [19:28:25] (03PS2) 10BBlack: enable frontend caching for graphoid T98803 [puppet] - 10https://gerrit.wikimedia.org/r/210754 [19:30:41] (03CR) 10BBlack: [C: 032] enable frontend caching for graphoid T98803 [puppet] - 10https://gerrit.wikimedia.org/r/210754 (owner: 10BBlack) [19:32:03] yurik: seems to work for me, now [19:32:13] bblack, awesome, thanks!!! [19:32:15] I just hope I didn't break any of the rest of the *oid varnish layer while doing cleanup :) [19:32:35] * yurik is pinging gwicke ^ [19:32:36] :) [19:32:37] (I don't think so, stats look sane, etc) [19:33:04] the rest was intended to be a functional no-op cleanup at the varnish layer, but varnish is notoriously difficult to reason about... [19:34:41] 6operations, 10Graphoid, 10RESTBase, 10Traffic, 5Patch-For-Review: Varnish does not honor Cache-Control for Graphoid - https://phabricator.wikimedia.org/T98803#1283183 (10BBlack) 5Open>3Resolved a:3BBlack Seems to be working now: ``` bblack@palladium:~$ curl -I https://graphoid.wikimedia.org/www.m... [19:36:21] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to contint-admins for Jan Zerebecki - https://phabricator.wikimedia.org/T98961#1283187 (10ArielGlenn) Ah hrm, we need the equivalent of a manager sign-off. In this case since he's WMDE we would.. uh... huh. No idea, so let me say that @... [19:36:35] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to contint-admins for Jan Zerebecki - https://phabricator.wikimedia.org/T98961#1283190 (10ArielGlenn) p:5Triage>3Normal a:3ArielGlenn [19:38:28] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant ebernhardson shell account access to the elasticsearch cluster - https://phabricator.wikimedia.org/T98766#1283195 (10ArielGlenn) Since this is a root privilege ticket we'll need to do the ops meeting chat (should be Monday) and then this can get done. [19:38:56] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to contint-admins for Jan Zerebecki - https://phabricator.wikimedia.org/T98961#1283197 (10Dzahn) Do you want Jan's manager in WMDE to sign off? (If they happen to be on phab too that would be nice) [19:40:25] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add jdouglas to researchers admin group - https://phabricator.wikimedia.org/T98536#1283205 (10ArielGlenn) So since @Tfinc is out.. @Manybubbles? Another access request on your doorstep. [19:42:33] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to contint-admins for Jan Zerebecki - https://phabricator.wikimedia.org/T98961#1283207 (10Dzahn) @Abraham Want to approve this request? [19:49:53] 6operations, 10Citoid, 6Services: Separate citoid service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#1283216 (10ArielGlenn) [19:51:25] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [19:53:14] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add jdouglas to researchers admin group - https://phabricator.wikimedia.org/T98536#1283226 (10Manybubbles) @ArielGlenn, this one is harder than the last one! I _think_ @Jdouglas should be in the researchers group then. Here is my reasoning: I'm pretty su... [19:56:15] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [19:57:45] hm parsoid deploy coming up? or did I get the timezones wrong again? [19:57:56] jouncebot: next [19:57:56] In 0 hour(s) and 2 minute(s): Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150513T2000) [19:58:04] apergos: good timing [19:58:15] PROBLEM - HHVM rendering on mw1190 is CRITICAL - Socket timeout after 10 seconds [19:58:34] ^ ? [19:58:35] PROBLEM - Apache HTTP on mw1190 is CRITICAL - Socket timeout after 10 seconds [19:58:48] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to contint-admins for Jan Zerebecki - https://phabricator.wikimedia.org/T98961#1283238 (10hashar) @greg @Abraham : Jan has been very helpful on CI front as long as I can remember. He effectively maintain the WMDE Jenkins job following u... [20:00:04] gwicke, cscott, arlolra, subbu: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150513T2000). [20:00:53] apache's still running, but it's 503-ing requests that should hit hhvm [20:00:56] wikipedia:80 10.64.33.3 - - [13/May/2015:19:58:15 +0000] "GET /w/api.php HTTP/1.0" 503 50065 "-" "Twisted PageGetter" [20:01:21] (03CR) 10OliverKeyes: Add fluorine rsync connector (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/209684 (owner: 10OliverKeyes) [20:02:17] (03PS1) 10Milimetric: [WIP] Add parallel kafka pipeline [puppet] - 10https://gerrit.wikimedia.org/r/210765 (https://phabricator.wikimedia.org/T98779) [20:02:33] * subbu gets ready to deploy parsoid [20:02:35] PROBLEM - HHVM queue size on mw1190 is CRITICAL 70.00% of data above the critical threshold [80.0] [20:02:57] May 13 19:51:34 mw1190 hhvm: #012Warning: Recursion detected in RequestContext::getLanguage in /srv/mediawiki/php-1.26wmf4/includes/context/RequestContext.php [20:03:04] yeah but that's pretty normal isn't it? [20:03:20] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to contint-admins for Jan Zerebecki - https://phabricator.wikimedia.org/T98961#1283266 (10greg) {F100025} [20:03:24] horribly normal [20:03:25] PROBLEM - HHVM busy threads on mw1190 is CRITICAL 77.78% of data above the critical threshold [115.2] [20:03:35] that bug has been around for a long long time [20:03:41] should we restart the service? [20:03:44] 7Blocked-on-Operations, 3Search-and-Discovery-Research-and-Data-Sprint: Create rsync connector to fluorine - https://phabricator.wikimedia.org/T98383#1283268 (10ArielGlenn) How much data are we talking about here and how is it expected to grow over time? (Need this for capacity planning, also we have more disk... [20:03:46] I'm gonna try [20:03:52] that's a very normal warning [20:03:55] !log restarting hhvm on mw1190 [20:04:03] Logged the message, Master [20:04:06] it's one of the top warnings on `fatalmonitor` [20:04:27] or was last time I checked a day or so ago [20:04:44] (03PS3) 10OliverKeyes: Add fluorine rsync connector [puppet] - 10https://gerrit.wikimedia.org/r/209684 [20:04:46] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 65131 bytes in 0.209 second response time [20:04:58] guess that worked [20:05:05] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.044 second response time [20:05:06] 7Blocked-on-Operations, 3Search-and-Discovery-Research-and-Data-Sprint: Create rsync connector to fluorine - https://phabricator.wikimedia.org/T98383#1283270 (10Ironholds) maybe 1kb a day; when I aggregate, I aggregate hard :D [20:05:06] yea [20:05:59] 7Blocked-on-Operations, 3Search-and-Discovery-Research-and-Data-Sprint: Create rsync connector to fluorine - https://phabricator.wikimedia.org/T98383#1283271 (10ArielGlenn) That's what we like to see! Yep we can handle that without waiting. [20:06:15] PROBLEM - Apache HTTP on mw1172 is CRITICAL - Socket timeout after 10 seconds [20:06:49] * apergos hangs around for the deploy [20:07:05] hmmm [20:07:17] do we have a situation with some traffic/request killing these? [20:07:24] PROBLEM - HHVM rendering on mw1172 is CRITICAL - Socket timeout after 10 seconds [20:07:30] because boom there goes another [20:07:45] hhvm upgrade bug? [20:08:20] gwicke, so, git deploy sync is now indicating that we have 44 parsoid minions (used to be 24 so far) .. did something change? [20:08:50] what's the list show? [20:08:51] i stopped sync so i know i am not missing something. [20:08:54] bblack: see backlog around 10:30 to 10:32 [20:09:08] you can see the event happening in ganglia for both hosts, e.g. this on 1172 now: http://ganglia.wikimedia.org/latest/?c=Application%20servers%20eqiad&h=mw1172.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [20:09:16] apache/hhvm metrics showing odd patterns [20:09:20] oh .. now, we have codfw servers as well? used to be eqiad only. [20:09:59] mutante: 10:xx what TZ? [20:10:35] bblack: PST [20:10:37] oh if UTC, that was the ams-ix issue [20:10:37] apergos, yes, 20 codfw minions in addition to the usual 24 eqiad ones. [20:10:46] !log elastic1003 restarted elasticsearch just fine. the cluster restart is going awesome. I'm going to rig the other 28 to restart via a script, one after the other. Expect nagios to complain about them some. [20:10:51] Logged the message, Master [20:11:07] <_joe_> subbu: yes, that's akosiaris' work [20:11:17] the list looks legit as I look at the output [20:11:21] let's see someting [20:11:31] _joe_, i see. data center replication? [20:11:44] <_joe_> subbu: of what? [20:11:46] subbu: did any of the parsoid window changes go out yet? or before that first hit on mw1190? [20:11:48] !log cancel that - I just realized I can't do that. [20:11:52] (03CR) 10Merlijn van Deen: "I still see some advantages in merging the two, but keeping it like this for now is also OK I think." (039 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [20:11:53] Logged the message, Master [20:12:03] bblack, no, not deployed anything yet. [20:12:06] oh I guess parsoid window started right *after* the mw1190 issue [20:12:14] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to contint-admins for Jan Zerebecki - https://phabricator.wikimedia.org/T98961#1283297 (10Krenair) https://wikitech.wikimedia.org/wiki/Requesting_shell_access#Escalating_Existing_Shell_Access - "Your manager approval is usually not requ... [20:12:15] RECOVERY - HHVM queue size on mw1190 is OK Less than 30.00% above the threshold [10.0] [20:12:23] _joe_, parsoid cluster. i didn't know we got 20 more servers. that is what threw me off. [20:12:28] well they all have the grain set all right [20:12:31] so ... [20:12:34] !log twentyafterfour Finished scap: testwiki to php-1.26wmf6 and rebuild l10n cache (duration: 47m 24s) [20:12:37] ok. so, i should go ahead then? [20:12:39] Logged the message, Master [20:12:43] <_joe_> subbu: yes, we are building everything in dallas too [20:13:07] bblack: can the parsoid deploy go ahead? [20:13:14] RECOVERY - HHVM busy threads on mw1190 is OK Less than 30.00% above the threshold [76.8] [20:13:17] ok. [20:13:17] (03CR) 10Merlijn van Deen: "And maybe add a comment below each execv saying something like" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [20:13:57] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to contint-admins for Jan Zerebecki - https://phabricator.wikimedia.org/T98961#1283304 (10Krenair) In general I think we need to rethink the references to "manager" or "direct supervisor" on that page and change them so that it only app... [20:14:15] apergos: I think so [20:14:30] subbu: guess you're good to go [20:14:32] ok. [20:14:39] (03CR) 10Yuvipanda: "https://phabricator.wikimedia.org/T93046 for discussion on port allocations. It is what is being used right now :)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [20:15:05] root@mw1190:/var/log/hhvm# grep "Fatal error" error.log [20:15:16] May 13 17:28:06 mw1190 hhvm: #012Fatal error: request has exceeded memory limit in /srv/mediawiki/wmf-config/StartProfiler.php on line 61 [20:15:25] that's like what ori said earlier [20:15:33] isnt it [20:15:35] !log restarted elasticsearch on elastic1004 [20:15:40] Logged the message, Master [20:15:55] PROBLEM - HHVM busy threads on mw1172 is CRITICAL 30.00% of data above the critical threshold [115.2] [20:15:57] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to contint-admins for Jan Zerebecki - https://phabricator.wikimedia.org/T98961#1283313 (10hashar) {F164545} [20:18:34] (03CR) 10Yuvipanda: Initial commit (034 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [20:22:48] (03CR) 10Merlijn van Deen: "Also, maybe merge https://gerrit.wikimedia.org/r/#/c/210266/ in here? lighttpd is pretty basic and important to have ;-)" (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [20:23:33] (03CR) 10Yuvipanda: "This isn't getting deployed until all the current types are supported, so it can be a separate patch, I think :)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [20:25:05] (03CR) 10Yuvipanda: Initial commit (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [20:26:17] yuvipanda: explicit is better than implicit, etc [20:26:40] valhallasw: it’s explicit in the docs, and it should be in env of every tool, IMO :) [20:26:47] valhallasw: if you want I can setup a comment there. [20:26:54] because the way the port info is transmitted changes per server, I think it should just be in the server-specific part [20:26:55] PROBLEM - HHVM queue size on mw1172 is CRITICAL 40.00% of data above the critical threshold [80.0] [20:26:55] PROBLEM - Parsoid on wtp1011 is CRITICAL - Socket timeout after 10 seconds [20:27:05] 'it's in the docs' != explicit :P [20:27:19] why should it be in the env? [20:27:44] subbu: how's it looking? [20:27:45] valhallasw: because $PORT is what heroku / docker / other PaaS does, and so that’s the easiest way going forward, IMO. [20:27:54] all restarts just finished. [20:28:02] 6operations, 10fundraising-tech-ops: upgrade tellurium.frack.eqiad.wmnet to Trusty - https://phabricator.wikimedia.org/T95294#1283345 (10Cmjohnson) scheduled downtime in Icinga. [20:28:02] see also: http://12factor.net/ and http://12factor.net/config [20:28:15] so, not sure if the codfw cluster is live or needs restarts, but, all the eqiad ones are restarted. [20:28:16] yuvipanda: I don't follow that logic. [20:28:25] !log deployed parsoid version a8108fe6 [20:28:30] Logged the message, Master [20:28:41] yuvipanda: and I'm not sure what that random blog post should tell me :P [20:28:45] _joe_, qn. above about the codfw cluster and restarts. [20:29:01] valhallasw: 12factor.net is good reading, and I suggest you do read it. [20:29:11] (03PS1) 10Dzahn: admin: add jdouglas to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/210772 (https://phabricator.wikimedia.org/T98536) [20:29:12] valhallasw: and I am not going to rehash what’s in that website for you over IRC :) [20:29:15] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [20:30:25] yuvipanda: then why aren't you passing {the web server type, the web server command, ...} over an environment variable? [20:30:25] RECOVERY - HHVM rendering on mw1172 is OK: HTTP OK: HTTP/1.1 200 OK - 65123 bytes in 0.126 second response time [20:30:33] anyway. [20:30:37] in that case [20:30:56] run() shouldn't have access to the port number, and the python one should get the info from the env, or something like that. [20:31:16] that’s just convenience, but yeah, if you think that’s better being explicit I can do that [20:31:43] and I really don't agree with you 'it should be in the env', and that's also not what that blog post states [20:31:47] subbu: a lot of hosts seem to have the old process running instead [20:31:49] the blog post says 'don't use config files' [20:31:50] lemme see which ones [20:31:58] but environments are a global solution to that. Passing parameters is a local one. [20:32:49] subbu: it's all the codfw hosts indeed [20:32:54] apergos, _joe_ http://ganglia.wikimedia.org/latest/ doesn't show parsoid codfw cluster [20:33:06] so, i assume that is not live? [20:33:25] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1476 bytes in 0.023 second response time [20:33:26] good question [20:33:59] x [20:34:00] valhallasw: how will you pass port to an external process? your options are commandline params and environment variables, and I think we should use environment variables because it’s config and also that’s what docker / heroku use so a nodejs / custom app that’s built to run there will have one less thing to change [20:34:13] they might not have an aggregator collecting them yet [20:34:19] ok. [20:34:37] passing the port as a variable to run was just convenience, and I disagree that it’s wrong but at this point I have very low energy to discuss this so I’ll just do it if you insist. [20:35:13] <_joe_> use env variables. [20:35:27] <_joe_> yuvipanda: I insist on the contrary. [20:35:42] yuvipanda: I'm completely OK with adding a generic 'docker' webservice which also uses $PORT to pass the port number, but I think it should be the exception rather than the rule [20:35:48] valhallasw: I disagree strongly [20:36:03] yuvipanda: because it's a global and implicit solution for a local problem [20:36:15] valhallasw: I don’t understand that sentence. [20:36:16] <_joe_> valhallasw: why using env variables for external configuration should be bad? [20:36:24] apergos, i follow the deployment process here: https://wikitech.wikimedia.org/wiki/Parsoid#Deploying_changes .. and i used dsh on bast1001 to issue restarts and they restarted the services on the eqiad cluster. [20:36:27] <_joe_> the environment is controlled by the sysadmin. [20:36:51] <_joe_> so that's the correct generic way to get such configuration in something like toollabs. [20:36:52] _joe_: the same way using a global to send variables to a function is not a great idea [20:37:08] ah so they've not been added to the dsh group yet [20:37:18] let's see if there's any status on these in phab [20:37:24] ok. [20:37:25] RECOVERY - HHVM busy threads on mw1172 is OK Less than 30.00% above the threshold [76.8] [20:37:37] yuvipanda: the environment is global state. command line arguments are local state [20:38:02] I think I’m not going to argue this point at this time - I do not have the time nor energy to do it. [20:38:10] ok, I definitely have the time, just not the energy. [20:38:11] sorry. [20:38:14] np [20:38:24] RECOVERY - HHVM queue size on mw1172 is OK Less than 30.00% above the threshold [10.0] [20:38:48] 6operations, 5Patch-For-Review, 5wikis-in-codfw: deploy wtp2001-2020 - https://phabricator.wikimedia.org/T90271#1283386 (10ArielGlenn) Do these hosts need to be added to the dsh group? See https://wikitech.wikimedia.org/wiki/Parsoid#Deploying_changes [20:39:55] valhallasw: I’m about to fix all the other stuff, however. [20:40:21] yuvipanda: I do agree the write-a-config-then-run-lighttpd system is less than ideal [20:41:06] valhallasw: yeah, but I think we’re stuck with that - I don’t think we can do that by commandline (that -> lighttpd config currently in the file) [20:41:23] subbu: it's not goingto kill us if they aren't restarted right now as they're not apparently getting any traffic, I noted on the relevant ticket so we'll see [20:41:45] apergos, ok, great. thanks. i will consider this done then. [20:41:50] sweet [20:42:40] * apergos clocks out for the day [20:42:54] (03PS1) 10Dzahn: add parsoid codfw servers to dsh group [puppet] - 10https://gerrit.wikimedia.org/r/210806 [20:43:10] valhallasw: the second patch I think is just doing that, though. but using jinja templates instead of the bash stuff we have now [20:45:07] yuvipanda: putting env vs argument aside, I do think the setting of $PORT fundamentally belongs in the WebService classes, and not in the wrapper (otherwise you wouldn't be able to correcly run the WS class outside of the wrapper script) [20:45:38] (03PS2) 10Dzahn: add parsoid codfw servers to dsh group [puppet] - 10https://gerrit.wikimedia.org/r/210806 (https://phabricator.wikimedia.org/T90271) [20:46:00] valhallasw: hmm. [20:46:02] valhallasw: let me think [20:46:26] yuvipanda: maybe by calling WebService.setEnv(port) to set the env [20:46:28] (03PS12) 10BryanDavis: [WIP] Add role::mediawiki_vagrant_lxc [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) [20:46:36] yuvipanda: or by adding a def execv() which also sets the port? [20:47:00] valhallasw: hmm, maybe I can move that into the webservice base class. [20:47:20] valhallasw: refactor the ‘run’ into a ‘get_execv_params’, and have run actually set the env variables and do the execv calls [20:47:35] meh. [20:48:46] (03CR) 10BryanDavis: [WIP] Add role::mediawiki_vagrant_lxc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) (owner: 10BryanDavis) [20:49:05] yuvipanda: then just do webservice-runner -> WebService.run(port) -> WhateverWebService.actualRun(port) [20:49:36] hmm? have the subclass call parent class’ run? [20:49:43] but then the parent class can’t actually run! [20:49:44] no, the other way around [20:49:56] oh, right [20:49:58] webservice-runner -> WebService.run(port) -> WhateverWebService.actualRun(port) -> execv [20:50:10] well, same thing as I was doing, except the execv itself would be handled by WebService.run [20:50:37] I don't like that option too much, as it makes the control flow indirect [20:51:08] how so? get_exec_params just returns data to do actual execution [20:51:32] it doesn’t do any execution itself [20:51:49] yes, that's exactly what's indirect about it :P [20:52:05] the option webservice-runner -> WhateverWebService.run(port) -> WebService.execv(command, args, port) might be cleaner, but that 'port' parameter is ugly there [20:52:08] hmm [20:52:12] there is one advantage to your option [20:52:25] WhateverWebService.getCommand() can return "port=%(port)s" [20:52:32] hmm [20:52:37] and let the parent do the formatting of that [20:52:46] well it’ll return a list rather than a string, so that’s going to be a bit ugly :D [20:53:04] well, yeah, obviously, but it can just do the formatting for each parameter [20:53:06] I’d let get_exec_command or whatever read that from parameter [20:53:06] yeah [20:54:13] hmm [20:54:14] 6operations: Minor fixes in Server Access document - https://phabricator.wikimedia.org/T99008#1283457 (10gpaumier) 3NEW [20:54:49] right, and everything else comes from self.tool [20:54:51] so that's fine [20:55:57] (03PS26) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [20:56:14] (03CR) 10jenkins-bot: [V: 04-1] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [20:56:15] valhallasw: so get_exec_command also means we are restricting any future ‘run’ subclasses into using execv [20:56:16] and nothing else [20:56:25] yup [20:56:47] well, unless they override the run() function [20:56:57] right [20:57:02] for when someone writes a pure python webserver or something [20:57:06] so that’s flexible enough [20:57:08] I think [20:57:25] (03PS27) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [20:57:32] valhallasw: hmm, or not. take lighttpd for example. [20:57:41] valhallasw: then get_exec_command will have to actually write out the file... [20:58:01] valhallasw: and then that’s not quite right... [20:58:18] yuvipanda: then that can override run(), but then we're overengineering again :P [20:58:23] valhallasw: yes, we are :P [20:58:48] valhallasw: so let’s just move the os.ENVIRON calls to the parent run [20:58:51] and have subclasses call that... [20:58:52] maybe? [20:58:55] sounds good [20:59:00] yeah [20:59:15] valhallasw, yuvipanda: http://12factor.net/config -- "Store config in the environment" [20:59:27] bd808: I already linked him to it :) [20:59:43] bd808: we are storing it in the env, just bikeshedding about where the os.ENVIRON calls should go. [21:00:04] well, before we were discussion env vs command line arguments, technically :P [21:00:09] in the Config class of course;) [21:00:26] bd808: shush you :P [21:00:42] yeah, but I refused to have that argument on IRC :) [21:00:52] (03CR) 10Subramanya Sastry: [C: 031] "I don't understand how this works (i.e. doesn't have eqiad.wmnet or codfw.wmnet), but I assume you all do :) so happy to +1." [puppet] - 10https://gerrit.wikimedia.org/r/210806 (https://phabricator.wikimedia.org/T90271) (owner: 10Dzahn) [21:01:07] yuvipanda: all the cool kids do it -- https://github.com/wikimedia/wikimedia-wikimania-scholarships/blob/master/src/Wikimania/Scholarship/Config.php [21:01:21] bd808: yup, it’s fairly standard across PAAS's [21:01:50] bd808: i'm not sure if you can even pass command line arguments to mod_php ;-) [21:02:36] another reason to not use them :P [21:02:48] you can do even grosser things like controlling php.ini settings from the Apache config [21:03:16] (03PS28) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [21:03:29] (03CR) 10jenkins-bot: [V: 04-1] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [21:03:30] yuvipanda: 'yeah, I'm used to languages that only have global variables so I really don't see why I shouldn't just make everything global in python' :P [21:03:42] errr [21:03:47] * yuvipanda continues not arguing this point [21:04:15] https://devcenter.heroku.com/articles/process-model is also a good reasd [21:04:15] *read [21:04:21] support both env variable and cli args :-D [21:04:31] with cli args overriding env var [21:04:42] I am wondering whether argparse has build-in support for that [21:04:43] no :P [21:04:55] well [21:04:59] for CI that is really useful [21:05:16] that let us easily override / inject parameters to a command line which is hardcoded [21:05:47] (03PS29) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [21:06:15] NOSE_VERBOSE=1 NOSE_NOCAPTURE=1 [21:06:21] hashar: default=os.environ.get('...'), or something [21:06:32] hashar: but it's not really built-in [21:06:36] yeah :/ [21:06:51] I could imagine a default_from_env='MYSCRIPT_' [21:07:01] then the arg --verbose would be MYSCRIPT_VERBOSE [21:07:03] magically :] [21:07:21] need a PEP I guess [21:07:26] valhallasw: am looking at https://docs.python.org/2/library/multiprocessing.html#process-and-exceptions now [21:08:42] valhallasw: hmm, so the problem with that is that it ties in the way the job is submitted (start()) with the job itself (run()), and so if we want to, in the far future (or as a param!) change the way the job is submitted, it’ll be a bit of a pain. [21:08:52] also start is synchronous, and ours is not... [21:08:58] so I don’t think we should mimic that interface [21:09:05] yuvipanda: eh, no, it's not synchronous [21:09:13] well, more synchronous than ours... [21:09:19] 7Blocked-on-Operations, 3Search-and-Discovery-Research-and-Data-Sprint: Create rsync connector to fluorine - https://phabricator.wikimedia.org/T98383#1283502 (10Ironholds) 5Open>3Resolved [21:09:28] either way, the first point stands. [21:09:37] valhallasw: > Roughly, a process object is alive from the moment the start() method returns until the child process terminates. [21:10:24] hence them being called request_start and request_stop... [21:10:29] so I don’t think we should mirror that interface :) [21:10:30] yuvipanda: I don't get your first issue. [21:10:37] yuvipanda: I don't get your first issue. [21:10:57] irc, please send my messages? [21:11:03] fucking IRC cloud [21:11:05] boo [21:11:07] moment [21:11:11] heh, I'm on from Limechat now [21:11:27] haha, me and valhallasw are both on IRCCloud and it just hiccuped for both of us [21:11:39] but my IRCCloud connects to my bouncer so I can still use this when IRCCloud is dead :P [21:11:52] * yuvipanda grins evilly [21:11:58] valhallasw: ping when you're back? [21:12:16] yuvipanda: yeah, alive again [21:12:28] yuvipanda: for the first issue, you mean switching out SGE for something else? [21:12:28] hmm, my IRCCloud is still dead [21:12:29] !log updated OCG to version c7c75e5b03ad9096571dc6dbfcb7022c924ccb4f [21:12:36] Logged the message, Master [21:12:37] valhallasw: yes. [21:12:43] that's a fair point [21:13:05] valhallasw: so imagine down the road we wanted to offer a --executor parameter that people can try out [21:13:14] so mixing these two will make that kindof a pain. [21:13:16] so let's not do that [21:14:03] ok, let me think if there's another existing API we could replicate [21:14:23] Popen comes to mind, but that doesn't really map either [21:14:38] hmm, yeah. [21:14:49] there's DRMMA but that's fucking nuts :P [21:15:29] 6operations, 7Icinga, 7Monitoring: remove (or fix) passive checks for removed hosts - https://phabricator.wikimedia.org/T99012#1283530 (10Dzahn) 3NEW [21:16:09] well, Popen is not entirely crazy, as long as we pass argv instead of a webservice object, and then it's a generic sge starter [21:16:29] I think it's overengineering :P [21:17:11] 6operations, 7Icinga, 7Monitoring: remove (or fix) passive checks for removed hosts - https://phabricator.wikimedia.org/T99012#1283543 (10Dzahn) affects just those 2 hosts, they are fundraising. @neon:~# grep "host could not be found" /var/log/icinga/icinga.log | grep -o "on host.*" | cut -d, -f1 | sort |... [21:17:42] 6operations, 7Icinga, 7Monitoring: remove (or fix) passive checks for removed hosts - https://phabricator.wikimedia.org/T99012#1283547 (10Dzahn) [21:18:09] yuvipanda: yeah, and it doesn't really map because it should also be able to kill a job it hasn't started itself [21:18:10] 6operations, 7Icinga, 7Monitoring: Icinga RAID monitoring status "NRPE: Unable to read output " reported as OK - https://phabricator.wikimedia.org/T98978#1283554 (10Dzahn) [21:18:37] valhallasw: yeah [21:19:00] ok, let's just do this then. [21:19:12] sweet :D [21:19:18] valhallasw: I think I addressed most of the other stuff... [21:19:21] valhallasw: but let me test. [21:20:49] valhallasw: deb packaging will be an additional patch I'll add on after this, I guess. [21:21:01] I'll need it for lighttpd I think [21:21:15] because I'll need to put stuff elsewhere, maybe? (the templates) [21:22:01] *nod*. not sure where, though [21:22:28] valhallasw: let me look at the Linux FHS [21:22:31] err [21:22:45] that's right actually, I thought I had the wrong TLA [21:22:51] yuvipanda: we probably want it either on NFS or in puppet [21:23:02] valhallasw: it as in? [21:23:11] the lighttpd template [21:23:17] err, no? it should be in the package... [21:23:25] and in /usr/share or something. [21:23:26] oh, we don't want it configurable? ok [21:23:30] no, not configurable. [21:23:40] I mean, you can configure it with .lighttpd.conf in your homedir [21:23:41] then it should be somewhere in site-packages, I think. Not sure how this works with python packaging [21:23:47] yeah, I'm unsure. [21:24:00] valhallasw: PackageLoader doesn't seem to work particularly well, but that might be just me not having prodded it enough [21:24:06] ? [21:24:18] oh, jinja [21:24:34] valhallasw: yeah. [21:24:39] valhallasw: whee, tested and it works fine :) [21:24:41] valhallasw: wanna +2? [21:25:26] yuvipanda: you can also just have it in the .py and use string formatting [21:25:42] valhallasw: hmm, I could - it doesn't do anything terribly fancyl [21:25:43] *fancy [21:25:50] also entry points? [21:25:52] valhallasw: but I'd still like it to be a separate file [21:26:19] valhallasw: can I do that in another patch? :P this works atm with stdeb... [21:27:05] yuvipanda: 'and this is how technical debt is created' :-p [21:27:07] but sure. [21:27:10] valhallasw: :P [21:27:18] also docs [21:27:20] docs [21:27:22] and docs :P [21:27:42] valhallasw: hmm, I wonder: should we enforce docstrings for everything? [21:28:06] I wouldn't mind, and I think we do that in the Android app... [21:28:08] !log restarting elasticsearch on elastic1005 [21:28:12] yuvipanda: dunno, just a manpage would also be ok, I guess. [21:28:13] Logged the message, Master [21:28:33] valhallasw: hmm, so there's two types of docs right - man pages for users, and this for devs. [21:28:46] yeah [21:28:49] I think man pages / user docs is a blocker for pushing this out to final deploy, but doesn't need to come before that [21:29:04] but dev docs, I guess, makes a strong case for being there from the start [21:29:25] I'd rather have a manpage and no dev docs than the other way around in this case [21:29:54] but I'm not against dev docs, obviously. [21:30:23] valhallasw: idunno, right now it's basically going to end up with exactly the same external interface as the webservice tool, which also has no manpage :) [21:31:43] valhallasw: btw, do you have +2 rights on that at all? [21:31:48] valhallasw: I'm going to add a README instead :) [21:31:54] (of a man page) [21:32:25] yuvipanda: errr [21:32:33] who is this readme aimed at? :P [21:32:39] idk, me? :P [21:32:45] hmm, that seems a bit pointless... [21:33:01] if you want to write just one piece of doc, give argparser some info [21:33:09] oh, hmm [21:33:10] good poing [21:33:22] let me do that [21:38:11] (03CR) 10Merlijn van Deen: [C: 04-1] "few minor things left" (036 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [21:40:34] (03PS1) 10Florianschmidtwelzow: Remove Mantle from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210816 (https://phabricator.wikimedia.org/T85890) [21:42:10] (03PS2) 10Florianschmidtwelzow: Remove Mantle from configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210816 (https://phabricator.wikimedia.org/T85890) [21:50:51] (03CR) 10Merlijn van Deen: "and I just read http://12factor.net/logs :P" (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [21:51:50] yuvipanda: and we should probably also add a logrotate for the output file somehow [21:53:03] yuvipanda: if the webserver output file is consistent, we can just do this globally, I think [22:04:24] I'm going to deploy "Wikipedias to 1.26wmf5" now if there is no objection. Already scaped and tested on test wiki, but I still haven't updated the wikiversions.json [22:12:02] (03CR) 10Jdlrobson: "> There are some major issues with mw ui atm." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175406 (https://phabricator.wikimedia.org/T73477) (owner: 10Glaisher) [22:12:47] no objections? [22:13:04] (03CR) 1020after4: [C: 032] Wikipedias to 1.26wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210757 (owner: 1020after4) [22:13:10] (03Merged) 10jenkins-bot: Wikipedias to 1.26wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210757 (owner: 1020after4) [22:17:46] !log restarted phd on iridium (phabricator) to sync the daemons' configuration [22:17:51] Logged the message, Master [22:21:39] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.26wmf5 [22:21:46] Logged the message, Master [22:21:54] (03CR) 1020after4: [C: 032] Group0 to 1.26wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210758 (owner: 1020after4) [22:22:00] (03Merged) 10jenkins-bot: Group0 to 1.26wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210758 (owner: 1020after4) [22:23:37] (03PS1) 10Ori.livneh: coal: set show_rollover_text to true for graphs [puppet] - 10https://gerrit.wikimedia.org/r/210819 [22:24:06] (03CR) 10Ori.livneh: [C: 032 V: 032] coal: set show_rollover_text to true for graphs [puppet] - 10https://gerrit.wikimedia.org/r/210819 (owner: 10Ori.livneh) [22:24:59] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: Group 0 to 1.26wmf6 [22:25:05] Logged the message, Master [22:26:01] valhallasw: sorry was away [22:26:12] valhallasw: logrotate on NFS makes me feel eugh. we should just spend that time doing logstash instead :) [22:26:55] !log twentyafterfour Purged l10n cache for 1.26wmf4 [22:26:58] yuvipanda: why does that make you feel eugh? [22:27:00] Logged the message, Master [22:27:07] valhallasw: logs on NFS :P [22:27:19] anyway, that shouldn't block this patch, but I agree it's something we should do. [22:27:22] hihg iops + nfs == genereall sadness [22:27:45] spelling is hard yo [22:28:12] :D [22:28:15] yuvipanda: yeah, okay, then I agree :P I didn't see the issue with logrotate per se [22:28:40] it's just a stopgap and I'd rather have us work on a real solution [22:28:42] and logrotate is something we could do soon-ish, while logstash is more on the order of, you know, half a year away [22:29:04] valhallasw: we can't do logrotate soonish. there's a bug for that too :P [22:29:24] and I don't think logstash is half a year away. my queue is basically: tools-webservice, crontab to service manifests, then logging [22:29:25] (03PS2) 10Gage: Fix logster tracking for CirrusSearch-slow.log [puppet] - 10https://gerrit.wikimedia.org/r/209821 (owner: 10BryanDavis) [22:29:28] so more like a month [22:29:49] yuvipanda: you mean https://phabricator.wikimedia.org/T68623 ? [22:30:18] (03CR) 10Gage: [C: 032] Fix logster tracking for CirrusSearch-slow.log [puppet] - 10https://gerrit.wikimedia.org/r/209821 (owner: 10BryanDavis) [22:30:26] valhallasw: yes, see scfc_de's comment. so it'll involve doing a restart and that's just going to be eugh. and if you want to gzip it then you're writing more data to NFS [22:31:11] can you make SGE log to a pipe? [22:31:22] rather than an inode? [22:32:24] (03CR) 10Yuvipanda: Initial commit (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [22:32:27] bd808: probably. Not sure on which host the actual logging is performed though; I'm guessing the exec host. [22:32:34] it's on the exec host, yeah [22:32:41] so we'll have to ship from exec host to logstash [22:32:49] which shouldn't be too hard, I guess. [22:33:10] twentyafterfour: suppose i didn't miss much? [22:33:28] log to syslog, use rsylog forwarding to ship wherever and rotate on the destination [22:33:40] bd808: needs to differentiate and authenticate users properly... [22:33:47] or log to syslog and ship directly to logstash [22:33:49] but then again, I've no idea how syslog does that, so I should just read up. [22:34:10] "authenticate" and "syslog" doesn't compute [22:34:19] are you worried about cross-logging? [22:34:23] yuvipanda: truncate works fine [22:34:39] valhallasw: well, feel free to work on it if you want to, but it's not a long term solution and I don't think we should work on it. [22:34:47] *nod* [22:34:49] like toolA logging "you are a weiner" to toolB's logs? [22:35:14] bd808: yeah, and we need to capture who is the owner of the process doing the logging anyway because we need authentication on the other way around when reading logs [22:36:21] bd808: we could just use a named pipe :) then we get ACLs for free! [22:36:26] if you control the stdout/err to syslog bridge then you can do whatever you want [22:37:10] FUSE is also perhaps an elegant solution to this problem [22:37:11] aude: nothing exciting [22:37:26] APP-NAME is a freeform string and what you would have for syslog for filtering [22:37:38] right [22:37:47] yeah, it shouldn't be hard. it's just a matter of doing it, I guess [22:38:06] In theory you could use the STRUCTURED-DATA segment too [22:38:20] which is just a pile of key=value pairs [22:38:25] right [22:38:56] and then have a proxy server of sorts that checks that for reading [22:39:57] yeah. the auth on that is the important part [22:40:16] PROBLEM - puppet last run on mw2119 is CRITICAL puppet fail [22:40:20] yuvipanda: and how do we prevent people from just reading syslog? it's generally taken to be world-readable, I think. [22:40:28] we do that by not writing to it [22:40:34] ok :-) [22:40:41] anyway, we shouldn't be having this conversation now. [22:40:59] * yuvipanda is carefully escaping conversations because he realizes he's close to burning out and should manage self carefully [22:41:25] twentyafterfour: ok [22:42:23] (03PS30) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [22:42:37] (03CR) 10jenkins-bot: [V: 04-1] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [22:42:48] valhallasw: ^ I added one as a FIXME because I can't find out where the 50000 is from, and a grep at current services shows nothing seems to be at 50000 but I'll verify later when andrew is aroudn [22:42:59] valhallasw: and can I have the 'run' argument in a separate patch? [22:43:13] ? [22:44:21] yuvipanda: oh, super().run() vs .setup_environment()? [22:44:57] yuvipanda: as for the 50000, make sure the < and >= line up :P [22:45:20] oh, right [22:45:26] valhallasw: yeah [22:45:33] I think just saying >= 50000 is sane [22:45:41] because that's what you're testing for [22:45:55] it's not really an important check anyway [22:46:11] because tools.X checks for the most obvious failure [22:46:23] and people can circumvent it if they want anyway [22:47:29] (03PS1) 10Dzahn: icinga: add script to schedule host downtimes [puppet] - 10https://gerrit.wikimedia.org/r/210824 (https://phabricator.wikimedia.org/T79842) [22:47:53] valhallasw: well, the only other test I know is that 499 is the highest system account, so I'd guess <= makes sense, but asking andrew is the better idea... [22:48:00] true [22:48:17] yuvipanda: as for .run() and .setup_environment() I'd rather not merge before sorting that out [22:48:24] as that gets baked into the API [22:48:30] even though the API is sort of internal, I guess. [22:48:37] it's internal [22:48:41] (03PS2) 10Dzahn: icinga: add script to schedule host downtimes [puppet] - 10https://gerrit.wikimedia.org/r/210824 (https://phabricator.wikimedia.org/T79842) [22:48:52] there's no way for things external to that package to register as webservice subclasses [22:48:53] so, no. [22:48:58] I don't think that should be sorted out. [22:49:13] right now. [22:49:20] and I honestly want to go and fix the tool-lighttpd [22:50:01] (03PS3) 10Dzahn: icinga: add script to schedule host downtimes [puppet] - 10https://gerrit.wikimedia.org/r/210824 (https://phabricator.wikimedia.org/T79842) [22:50:12] I can't exactly stop you from +2'ing your own project :p [22:50:18] valhallasw: well, I don't want to go back there :P [22:50:22] valhallasw: hence I'm convincing you. [22:50:31] trying to at least [22:50:35] (03CR) 10Dzahn: [C: 032] icinga: add script to schedule host downtimes [puppet] - 10https://gerrit.wikimedia.org/r/210824 (https://phabricator.wikimedia.org/T79842) (owner: 10Dzahn) [22:51:03] in any case, make the >= and < consistent [22:51:56] yuvipanda: and https://gerrit.wikimedia.org/r/#/c/210196/29/tools/webservice/__init__.py ? [22:51:56] (03PS31) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [22:51:58] done [22:52:04] no. [22:52:06] let's not do that [22:52:13] these are all things that will change in future changesets [22:52:14] err [22:52:16] patches [22:52:43] yes, that's the point. Make them all be in one place :P [22:52:59] https://gerrit.wikimedia.org/r/#/c/210196/29/tools/webservice/webservice.py exec* comment [22:53:30] jesus man [22:53:33] 'should use execv or execvp' [22:53:39] :P [22:53:43] you can amend it too you know [22:53:51] I'm just going to declare burnout and go home today [22:54:05] I'll come back to this tomorrow [22:54:13] whokay. [22:54:27] bblack: [22:54:28] [neon:~] $ sudo schedule-downtime [22:54:28] usage: /usr/local/bin/schedule-downtime -h -d -r [22:54:34] valhallasw: thanks for all the review! I'm *not* ragequitting or anything - it's just the visas and stuff have taken their toll... [22:55:05] I can imagine [22:56:36] yuvipanda: in a gerrit 2.11 world, I would have juts edited it inline, but no such thing :( [22:57:06] anyway, bed. [22:57:25] valhallasw: heh, true. [22:58:16] RECOVERY - puppet last run on mw2119 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [22:59:05] valhallasw: I'm going to push that patch (the exec command) and I think that's it. [22:59:53] (03PS32) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [22:59:57] valhallasw: ^ [22:59:58] (03PS1) 10Dzahn: icinga: schedule downtime reason can have spaces [puppet] - 10https://gerrit.wikimedia.org/r/210825 (https://phabricator.wikimedia.org/T79842) [23:00:04] RoanKattouw, ^d, ebernhardson, rmoen: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150513T2300). [23:00:26] (03CR) 10Dzahn: [C: 032] icinga: schedule downtime reason can have spaces [puppet] - 10https://gerrit.wikimedia.org/r/210825 (https://phabricator.wikimedia.org/T79842) (owner: 10Dzahn) [23:03:57] 6operations, 7Monitoring: have an IRC bot and/or cmdline script that allows easy scheduling of Nagios downtimes - https://phabricator.wikimedia.org/T79842#1283815 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/210824/ https://gerrit.wikimedia.org/r/#/c/210824/3/modules/icinga/files/schedule-downtime [neon:~] $... [23:03:58] 6operations, 7Monitoring: have an IRC bot and/or cmdline script that allows easy scheduling of Nagios downtimes - https://phabricator.wikimedia.org/T79842#1283816 (10Dzahn) 5Open>3Resolved a:3Dzahn ``` # usage: ./schedule_downtime -h -d -r # example: ./schedule_downtime -... [23:04:00] 6operations, 7Monitoring: have an IRC bot and/or cmdline script that allows easy scheduling of Nagios downtimes - https://phabricator.wikimedia.org/T79842#1283820 (10Dzahn) [23:04:25] (03CR) 10Milimetric: [WIP] Add parallel kafka pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/210765 (https://phabricator.wikimedia.org/T98779) (owner: 10Milimetric) [23:05:36] 6operations, 7Monitoring: have an IRC bot and/or cmdline script that allows easy scheduling of Nagios downtimes - https://phabricator.wikimedia.org/T79842#868171 (10Dzahn) [23:06:36] PROBLEM - puppet last run on neon is CRITICAL puppet fail [23:07:09] 6operations, 7Monitoring: have an IRC bot and/or cmdline script that allows easy scheduling of Nagios downtimes - https://phabricator.wikimedia.org/T79842#1283827 (10Dzahn) < bblack> I guess puppetize it as a /usr/local/bin script on neon that root can run, just to wrap over all the details of the nagios.cmd... [23:09:10] So whos doing swat today? RoanKattouw_away ? [23:12:41] yuvipanda: ok, feel free to +2 then [23:12:51] valhallasw: nah, you’re right about the services too. Am moving it. [23:12:55] valhallasw: you should have +2 rights on it... [23:13:01] all toollabs admins should [23:13:07] valhallasw: I’m trying to not have self merges there. [23:13:13] self merging is how we got to where we are... [23:13:27] (03PS1) 10Dzahn: rename icinga-downtime script [puppet] - 10https://gerrit.wikimedia.org/r/210828 (https://bugzilla.wikimedia.org/79842) [23:16:06] (03PS2) 10Dzahn: rename icinga-downtime script [puppet] - 10https://gerrit.wikimedia.org/r/210828 (https://bugzilla.wikimedia.org/79842) [23:16:50] (03CR) 10Dzahn: [C: 032] rename icinga-downtime script [puppet] - 10https://gerrit.wikimedia.org/r/210828 (https://bugzilla.wikimedia.org/79842) (owner: 10Dzahn) [23:18:35] (03PS33) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [23:20:36] 6operations, 7Monitoring: have an IRC bot and/or cmdline script that allows easy scheduling of Nagios downtimes - https://phabricator.wikimedia.org/T79842#1283858 (10Dzahn) renamed to "icinga-downtime" as requested. so now it is: root@neon:/usr/local/bin# ./icinga-downtime usage: ./icinga-downtime -h RECOVERY - puppet last run on neon is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [23:24:38] (03PS6) 10Thcipriani: beta: Add script from Jenkins beta-update-databases [puppet] - 10https://gerrit.wikimedia.org/r/210618 (https://phabricator.wikimedia.org/T96199) [23:24:51] (03CR) 10Quiddity: "For bugs/problems, see the recent comments at T73477" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175406 (https://phabricator.wikimedia.org/T73477) (owner: 10Glaisher) [23:25:11] 6operations: Minor fixes in Server Access document - https://phabricator.wikimedia.org/T99008#1283865 (10Dzahn) 5Open>3Resolved a:3Dzahn done: https://phabricator.wikimedia.org/legalpad/view/3/#47 [23:25:26] (03CR) 10jenkins-bot: [V: 04-1] beta: Add script from Jenkins beta-update-databases [puppet] - 10https://gerrit.wikimedia.org/r/210618 (https://phabricator.wikimedia.org/T96199) (owner: 10Thcipriani) [23:29:36] (03CR) 10GWicke: [C: 031] add parsoid codfw servers to dsh group [puppet] - 10https://gerrit.wikimedia.org/r/210806 (https://phabricator.wikimedia.org/T90271) (owner: 10Dzahn) [23:30:22] (03PS7) 10Thcipriani: beta: Add script from Jenkins beta-update-databases [puppet] - 10https://gerrit.wikimedia.org/r/210618 (https://phabricator.wikimedia.org/T96199) [23:30:50] (03CR) 10GWicke: "@subbu, this means that dsh -g parsoid ... will perform that action in both codfw & eqiad. I do believe that the same is true for the depl" [puppet] - 10https://gerrit.wikimedia.org/r/210806 (https://phabricator.wikimedia.org/T90271) (owner: 10Dzahn) [23:32:29] (03CR) 10Subramanya Sastry: "@gwicke, I meant ... how this works without having full internal hostname specified ... I suppose these names are mapped internally elsewh" [puppet] - 10https://gerrit.wikimedia.org/r/210806 (https://phabricator.wikimedia.org/T90271) (owner: 10Dzahn) [23:32:48] nice path: [23:32:50] /srv/deployment/rcstream/rcstream/rcstream/rcstream: Python script, ASCII text executable [23:33:48] (03CR) 10Dzahn: "right, in the dsh file for mediawiki servers we changed from short hostnames to FQDN's not that long ago" [puppet] - 10https://gerrit.wikimedia.org/r/210806 (https://phabricator.wikimedia.org/T90271) (owner: 10Dzahn) [23:34:49] mutante: drives home the point [23:35:13] (03CR) 10Dzahn: "it just works if /etc/resolv.conf contains a "search" line for eqiad.wmnet and also codfw.wmnet, which is puppetized on tin via the "base:" [puppet] - 10https://gerrit.wikimedia.org/r/210806 (https://phabricator.wikimedia.org/T90271) (owner: 10Dzahn) [23:35:22] /buffalo/buffalo/buffalo/buffalo [23:35:35] gwicke: yes, and "labs labs labs" [23:35:47] gwicke: we should probably use FQDNs in dsh group [23:35:51] amending [23:35:52] DEVELOPERS DEVELOPERS DEVELOPERS [23:35:57] :) [23:35:58] *nod*, good catch subbu [23:37:51] rmoen: i'm guessing noone picked up swat deploys today? [23:38:00] i just realized what time it is :) [23:39:22] (03CR) 10EBernhardson: [C: 032] Disable leading wildcard searches in CirrusSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210620 (https://phabricator.wikimedia.org/T91666) (owner: 10EBernhardson) [23:39:29] (03Merged) 10jenkins-bot: Disable leading wildcard searches in CirrusSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210620 (https://phabricator.wikimedia.org/T91666) (owner: 10EBernhardson) [23:40:15] rmoen: if your here i'll deploy yours too [23:40:21] !log ebernhardson Synchronized wmf-config/CirrusSearch-common.php: SWAT deploy cirrus config change (duration: 00m 12s) [23:40:24] *you're [23:40:26] Logged the message, Master [23:41:10] (03PS3) 10Dzahn: add parsoid codfw servers to dsh and use FQDN [puppet] - 10https://gerrit.wikimedia.org/r/210806 (https://phabricator.wikimedia.org/T90271) [23:41:27] subbu: it would have worked because of the search domain but FQDN is more failproof [23:41:55] mutante, ok .. thanks. makes sense. [23:42:45] on mw appservers we changed that in https://phabricator.wikimedia.org/T93983 [23:43:21] (03CR) 10Dzahn: [C: 032] "see https://phabricator.wikimedia.org/T93983 for the same switch on mw servers" [puppet] - 10https://gerrit.wikimedia.org/r/210806 (https://phabricator.wikimedia.org/T90271) (owner: 10Dzahn) [23:45:18] (03PS4) 10Dzahn: add parsoid codfw servers to dsh and use FQDN [puppet] - 10https://gerrit.wikimedia.org/r/210806 (https://phabricator.wikimedia.org/T90271) [23:45:49] (03PS1) 10Yuvipanda: [WIP] Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 [23:46:02] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 (owner: 10Yuvipanda) [23:47:57] 6operations, 5Patch-For-Review, 5wikis-in-codfw: deploy wtp2001-2020 - https://phabricator.wikimedia.org/T90271#1283902 (10Dzahn) >>! In T90271#1283386, @ArielGlenn wrote: > Do these hosts need to be added to the dsh group? See https://wikitech.wikimedia.org/wiki/Parsoid#Deploying_changes added, and switch... [23:48:32] rmoen: last call, will run out of time in the swat window [23:49:38] (03PS2) 10Yuvipanda: [WIP] Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 [23:50:56] ebernhardson i'm here [23:51:54] :) ok merging now [23:52:20] ebernhardson, ty [23:53:10] (03PS3) 10Yuvipanda: [WIP] Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 [23:54:42] (03PS4) 10Yuvipanda: [WIP] Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 [23:55:47] (03PS5) 10Yuvipanda: [WIP] Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 [23:58:42] (03PS1) 10Dzahn: tin: set cluster in hiera, not in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/210835