[00:00:05] RoanKattouw ostriches Krenair MaxSem: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160308T0000). [00:00:05] RoanKattouw AaronSchulz ebernhardson James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:17] * RoanKattouw waves [00:00:32] \o [00:00:33] * James_F wibbles too. [00:00:37] ebernhardson: Mind doing the SWAT today? [00:00:52] sure i can ship em out [00:00:56] * ebernhardson +2's all the things [00:01:10] warning, mw2212 will be slowing you down :/ [00:01:24] :( [00:01:35] i thought this list had more things in it an hour or two ago. oh well i'm not complaining :) [00:02:20] heh [00:02:51] http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&c=API+application+servers+codfw&h=mw2212.codfw.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=NOGROUPS it's down I guess [00:02:59] no ping [00:03:16] anyone in a real tz can look at it (since a reboot apparently didn't get it)? [00:03:47] I'm gonna pack it in cause 2 am [00:04:21] robh: online ^ [00:04:29] ...questionmark? [00:08:55] (03CR) 10EBernhardson: [C: 032] Enable async swift writes to all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273007 (owner: 10Aaron Schulz) [00:09:05] * ebernhardson starts shipping some config changes while zuul crunches away... [00:09:22] AaronSchulz: your swift thing is about to go out, looks like no effect to the website directly? [00:09:41] (03Merged) 10jenkins-bot: Enable async swift writes to all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273007 (owner: 10Aaron Schulz) [00:09:51] well, a slight slowdown for uploads [00:10:18] uh-oh Uncaught exception: Could not open extension /usr/lib/x86_64-linux-gnu/hhvm/extensions/20150212/luasandbox.so: /usr/lib/x86_64-linux-gnu/hhvm/extensions/201 [00:10:18] 50212/luasandbox.so: undefined symbol: _ZN4HPHP18ThreadLocalManager9s_managerE [00:12:23] greg-g: sorry was afk [00:12:29] so issue is mw2212 is down? [00:12:55] yeah [00:13:01] (03PS1) 10Yuvipanda: labs: Switch wdqs project to use only scratch [puppet] - 10https://gerrit.wikimedia.org/r/275728 (https://phabricator.wikimedia.org/T128815) [00:13:17] (03PS2) 10Yuvipanda: labs: Switch wdqs project to use only scratch [puppet] - 10https://gerrit.wikimedia.org/r/275728 (https://phabricator.wikimedia.org/T128815) [00:13:47] * greg-g goes toward the bus [00:14:22] !log ebernhardson@tin Synchronized wmf-config/filebackend-production.php: Enable async swift writes to all wikis except commons (duration: 02m 25s) [00:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:14:35] well, i dunno whats up with it no console output so i'll reboot again. (not sure if whoever rebooted it last time stayed on console to see if there were errors) [00:14:40] rebooting it agaiun [00:14:40] (03CR) 10Yuvipanda: [C: 032] labs: Switch wdqs project to use only scratch [puppet] - 10https://gerrit.wikimedia.org/r/275728 (https://phabricator.wikimedia.org/T128815) (owner: 10Yuvipanda) [00:14:42] again even. [00:15:07] AaronSchulz: your switch config patch has shipped [00:15:13] Gosh. [00:15:16] usually you dont have to stay on it, as 99.99% it fixes it. [00:15:22] (03CR) 10EBernhardson: [C: 032] Enable completion suggester as default prefix algo on test/test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275593 (https://phabricator.wikimedia.org/T128774) (owner: 10EBernhardson) [00:16:01] doesnt bode well that after sending a reboot there is zero console output... [00:16:05] (03PS1) 10Andrew Bogott: Define the horizon and wikitech hostnames in hiera. [puppet] - 10https://gerrit.wikimedia.org/r/275729 [00:16:07] (03Merged) 10jenkins-bot: Enable completion suggester as default prefix algo on test/test2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275593 (https://phabricator.wikimedia.org/T128774) (owner: 10EBernhardson) [00:16:09] (03CR) 10EBernhardson: [C: 032] Reduce replica count for commonswiki_file in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266658 (owner: 10EBernhardson) [00:16:55] (03Merged) 10jenkins-bot: Reduce replica count for commonswiki_file in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266658 (owner: 10EBernhardson) [00:19:13] RoanKattouw: ok your's finally merged, it's up next [00:20:26] !log ebernhardson@tin Synchronized wmf-config: T128774 enable completion suggester by default on test/test2wiki (duration: 02m 27s) [00:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:20:47] Thanks ebernhardson [00:23:27] 6Operations, 10ops-codfw: mw2122 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2097723 (10RobH) [00:23:47] RoanKattouw: shipped out. it will log whenever the other server times out... [00:23:54] (03CR) 10Andrew Bogott: [C: 032] "Puppet compiler approves!" [puppet] - 10https://gerrit.wikimedia.org/r/275729 (owner: 10Andrew Bogott) [00:23:59] Thanks [00:24:35] James_F: next up [00:24:46] (03CR) 10EBernhardson: [C: 032] Set VisualEditorSingleEditTabSwitchTime to correct dates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275270 (owner: 10Alex Monk) [00:24:51] * James_F nods. [00:25:24] (03Merged) 10jenkins-bot: Set VisualEditorSingleEditTabSwitchTime to correct dates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275270 (owner: 10Alex Monk) [00:25:35] !log ebernhardson@tin Synchronized php-1.27.0-wmf.15/extensions/Echo/Hooks.php: T128249 Try and avoid race conditions with thank-you-edit notifications (duration: 02m 23s) [00:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:26:12] ebernhardson: Should be a no-op. [00:26:41] (03PS1) 10RobH: mw2122 offline [puppet] - 10https://gerrit.wikimedia.org/r/275730 [00:27:34] (03PS2) 10RobH: mw2122 offline [puppet] - 10https://gerrit.wikimedia.org/r/275730 [00:27:47] hrmm, im in rebase race [00:29:18] (03CR) 10RobH: [C: 032] mw2122 offline [puppet] - 10https://gerrit.wikimedia.org/r/275730 (owner: 10RobH) [00:30:05] !log ebernhardson@tin Synchronized wmf-config/: Set VisualEditorSingleEditTabSwitchTime to correct dates (duration: 02m 27s) [00:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:30:18] James_F: all shipped out [00:31:49] ebernhardson: LGTM. [00:39:20] ebernhardson, are you doing the swat? [00:40:18] i have last minute patch [00:41:30] ebernhardson, could you deploy https://gerrit.wikimedia.org/r/#/c/275731/ pls [00:41:44] MaxSem, +2ed, waiting for jenkins to behave [00:42:34] yurik, cherrypick is in https://gerrit.wikimedia.org/r/#/c/275732/ [00:43:04] MaxSem, thx, waiting for confirmation from ebernhardson [00:43:14] i don't want to step on his tows with my own depl :) [00:46:42] James_F, are you done with your swat stuff? [00:48:48] (03PS3) 10Dduvall: Fix programdashboard hieradata [puppet] - 10https://gerrit.wikimedia.org/r/274572 [00:48:50] (03PS2) 10Dduvall: labs: Expand paths for nuyaml hiera lookup under common [puppet] - 10https://gerrit.wikimedia.org/r/274566 [00:48:52] (03PS5) 10Dduvall: labs: Database server to support Program Dashboard [puppet] - 10https://gerrit.wikimedia.org/r/275138 (https://phabricator.wikimedia.org/T127105) [00:48:54] (03PS3) 10Dduvall: labs: Deployer access for programdashboard [puppet] - 10https://gerrit.wikimedia.org/r/274579 (https://phabricator.wikimedia.org/T105967) [00:49:33] greg-g, is ebernhardson still doingthe swat? [00:52:14] yurik: Yes. [00:52:20] uh? [00:52:43] MaxSem, it seems the swat was done early, our patch is still pending [00:53:20] greg-g, can we squeeze a fix in? [00:53:42] (03PS1) 10Aaron Schulz: Lowered $wgMaxUserDBWriteDuration to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275734 [00:55:21] (03PS6) 10Dduvall: labs: Database server to support Program Dashboard [puppet] - 10https://gerrit.wikimedia.org/r/275138 (https://phabricator.wikimedia.org/T127105) [00:56:12] yurik: swat was done [00:56:16] s/was/is/ [00:56:26] ok, I'll do it [00:56:39] thx MaxSem [00:56:52] ebernhardson, no worries, thx [00:58:44] (03PS1) 10Papaul: Adding new partman recipe for 4 disks raid10 with lvm [puppet] - 10https://gerrit.wikimedia.org/r/275738 [01:00:00] (03PS1) 10Aaron Schulz: Lower "max lag" and $wgAPIMaxLagThreshold to 8/6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275739 [01:00:42] PROBLEM - RAID on db2018 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [01:03:00] !log maxsem@tin Synchronized php-1.27.0-wmf.15/extensions/Kartographer/: https://gerrit.wikimedia.org/r/#/c/275732/ (duration: 02m 25s) [01:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:03:11] yurik, ^ [01:03:22] MaxSem, woot! [01:04:11] MaxSem, https://www.mediawiki.org/wiki/User:Yurik/sandbox [01:04:15] check it out :) [01:09:27] (03PS2) 10Papaul: Adding new partman recipe for 4 disks raid10 with lvm [puppet] - 10https://gerrit.wikimedia.org/r/275738 [01:13:05] (03PS7) 10Dduvall: labs: Database server to support Program Dashboard [puppet] - 10https://gerrit.wikimedia.org/r/275138 (https://phabricator.wikimedia.org/T127105) [01:19:07] (03CR) 10RobH: [C: 04-1] "It seems your editor inputs spaces, not tabs. Half of the recipe is tab formatted, and half is space. Convert it to all space format, or" [puppet] - 10https://gerrit.wikimedia.org/r/275738 (owner: 10Papaul) [01:20:32] PROBLEM - mediawiki-installation DSH group on mw2122 is CRITICAL: Host mw2122 is not in mediawiki-installation dsh group [01:20:59] argh, i need to ack that icinga cuz i totally did that [01:21:22] does anybody know - do our varnish caches normalize query parameters? I.e. is a=1&b=2 and b=2&a=1 same cache entry or not? [01:22:01] or, just return it to serivce cuz it magically resurrected itself [01:22:03] what the helllllll [01:22:53] i turned it off... [01:23:00] (03PS3) 10Papaul: Adding new partman recipe for 4 disks raid10 with lvm [puppet] - 10https://gerrit.wikimedia.org/r/275738 [01:23:43] hrmm, so if i reutnr this to mediawiki-isntallation dsh gropu [01:23:56] i imagine its out of sync, return to dsh and does puppet fix or do i need to take another step? [01:25:21] (03PS1) 10RobH: Revert "mw2122 offline" [puppet] - 10https://gerrit.wikimedia.org/r/275741 [01:26:54] (03PS2) 10Aaron Schulz: Enable async swift writes for remaining backends [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272922 [01:26:58] (03CR) 10RobH: [C: 032] Adding new partman recipe for 4 disks raid10 with lvm [puppet] - 10https://gerrit.wikimedia.org/r/275738 (owner: 10Papaul) [01:27:44] SMalyshev: they don't normalize currently. it's something we could do, if we're sure the target apps aren't sensitive to order, though. [01:28:14] (03PS2) 10RobH: Revert "mw2122 offline" [puppet] - 10https://gerrit.wikimedia.org/r/275741 [01:29:24] robh, ssh into it and sync-common [01:29:32] (03PS1) 10Yurik: Fixed maps referrer header check [puppet] - 10https://gerrit.wikimedia.org/r/275743 (https://phabricator.wikimedia.org/T129187) [01:29:41] MaxSem: excellent, thanks! i havent merged that live yet but will in a few seconds [01:29:50] was just seeing if i can find hw errors, i havent yet [01:30:20] 6Operations, 10ops-codfw: mw2122 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2097964 (10RobH) 5Open>3declined It seems it simply took 5+ minutes to boot, not sure whats up but it seems fine now. I've not found any other hw errors, so rejecting this task I made. [01:30:39] robh, you don't need to repool it for sync-common to work, it just pulls from tin [01:30:57] could someone please review https://gerrit.wikimedia.org/r/#/c/275743/1/templates/varnish/maps-frontend.inc.vcl.erb [01:31:18] otherwise poor firefox users cannot see our maps :( [01:31:24] MaxSem: thats how i thought it worked, but i also thought puppet runs sync-common [01:31:33] i may be wrong, going to investigate [01:31:44] simple regex fix ^^^ [01:32:05] (03CR) 10RobH: [C: 032] Revert "mw2122 offline" [puppet] - 10https://gerrit.wikimedia.org/r/275741 (owner: 10RobH) [01:33:14] robh, puppet most definitely shouldn't sync-common in cron because it'll just melt the deployment server [01:33:24] duly noted [01:33:39] 6Operations, 6Discovery, 10Kartographer, 10Maps, and 2 others: No tiles shown in Firefox - https://phabricator.wikimedia.org/T129187#2097969 (10Yurik) [01:33:40] ok, sync-common was run on mw2122 no issues, and its back in the dsh node list [01:33:59] at least some time ago it used to sync after booting up and only then start apache [01:34:09] not sure it it's still the case [01:34:23] I may have just mis-recalled that, yeah. [01:34:46] robh, if you have a sec afterwards, could you +2 https://gerrit.wikimedia.org/r/#/c/275743/ [01:34:59] bblack: well, target is wiki (wikidata) so it's probably not. But right now we're not doing it? [01:35:03] i get lots of cries for help :) [01:35:15] yurik: i was typing out an answer to you. I dont really ahve any busienss approving varnish vcl changes i dont think [01:35:34] i can compare to other ones and think it looks right, but one shouldnt +2 unless one could fix it if it melted [01:35:37] and i dont think i could. [01:36:00] SMalyshev: correct, we're not currently trying to normalize query arg order [01:36:02] * robh pokes at the misc-web cluster but only in terms of certificate mgmt not in varnish [01:36:09] oh, bblack is here :) [01:36:21] bblack, firefox users are crying :( https://gerrit.wikimedia.org/r/#/c/275743/ [01:36:23] yurik: i was putting lots of apologies in the original reply ;] [01:36:32] robh, no worries :) [01:36:34] I'm not really here, is it really urgent? [01:37:08] bblack, turns out none of firefox users can access maps because that regex is not allowing URLs without trailing slash :( [01:37:15] and we just deployed maps [01:37:38] (03CR) 10BBlack: [C: 031] Fixed maps referrer header check [puppet] - 10https://gerrit.wikimedia.org/r/275743 (https://phabricator.wikimedia.org/T129187) (owner: 10Yurik) [01:37:43] looks pretty ok, and maps-only, and easy to revert [01:37:51] but I can't babysit merge right now, will bbl [01:37:51] robh, ^ :) [01:37:59] bblack, thx, not to worry [01:38:27] yurik: Uh, that doesn't change the fact that if varnish suddenly fell over during the merge I likely am not qualified to fix it... [01:38:39] (03PS1) 10Papaul: netboot:fix partman recipe for sinistra Bug:T128796 [puppet] - 10https://gerrit.wikimedia.org/r/275744 (https://phabricator.wikimedia.org/T128796) [01:38:42] robh, brave new world :D [01:38:52] i wounder if we have anyone except brandon to do it? :) [01:39:16] i'm beginning to worry about the bus factor! [01:39:44] yurik: my issue is i can always merge, and if it breaks revert and hope for hte best. unfortunately, with varnish (or most caching) it tends to do things like cache the horrible resutls [01:39:46] my point exactly! ^^ [01:40:00] and our varnish documentation has a lot of 'dont touch this unless you really understand it' r [01:40:14] robh, i am not blaming you, i wouldn't want to break anything either :) [01:40:19] I am wholly understanding of your situation, but not sure if I can fix [01:41:01] the person I would call is brandon, and if unavailable, I'd have to resort to eu opsen [01:41:05] robh, i 'm not trying to get you to do it - afterall, firefox users can wait a bit longer i think if it wasn't working ofr them before. I'm worried that we have only one person to fix it, that's a problem :) [01:41:30] We are lacking a senior level systems admin user past the central tz is all ;] [01:41:58] ok, from now on, the next senior opsen has to come from asia [01:41:58] well, i take that back, daniel may be sr level but i know im not =[ [01:42:14] wfm, we need a caching center over there eventually [01:42:16] so two birds [01:42:41] robh, caching center implies possible legal liability, doesn't it? [01:43:07] well, anywhere has some, but we put caching centers on non-usa-soil [01:43:21] we used to have one in yaseo back when i started [01:43:25] plus of course the netherlands [01:43:38] good point [01:45:06] but swapping harddrives and setting up servers is a bit different job description than configuring global caching and traffic [01:45:15] so we might still need two ppl [01:48:42] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [01:49:31] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [01:49:44] robh: merged your patch [01:49:58] shit, sorry, got pm's mid merge [01:50:00] thank you [01:50:21] (03PS2) 10Dzahn: netboot:fix partman recipe for sinistra Bug:T128796 [puppet] - 10https://gerrit.wikimedia.org/r/275744 (https://phabricator.wikimedia.org/T128796) (owner: 10Papaul) [01:50:22] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [01:50:31] (03CR) 10Dzahn: [C: 032] netboot:fix partman recipe for sinistra Bug:T128796 [puppet] - 10https://gerrit.wikimedia.org/r/275744 (https://phabricator.wikimedia.org/T128796) (owner: 10Papaul) [01:50:58] robh: np [01:51:12] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [01:52:16] 6Operations, 10MobileFrontend, 10Traffic, 7Regression, 7user-notice: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2097986 (10Krinkle) [02:04:17] (03PS2) 10BBlack: Fixed maps referrer header check [puppet] - 10https://gerrit.wikimedia.org/r/275743 (https://phabricator.wikimedia.org/T129187) (owner: 10Yurik) [02:04:30] bblack, thanks! [02:04:38] yurik: yeah but re: all of the above, emergencies are rare, or should be. This doesn't really qualify as much of an emergency :P [02:04:45] (03CR) 10BBlack: [C: 032] Fixed maps referrer header check [puppet] - 10https://gerrit.wikimedia.org/r/275743 (https://phabricator.wikimedia.org/T129187) (owner: 10Yurik) [02:04:54] (03CR) 10BBlack: [V: 032] Fixed maps referrer header check [puppet] - 10https://gerrit.wikimedia.org/r/275743 (https://phabricator.wikimedia.org/T129187) (owner: 10Yurik) [02:05:15] bblack, i didn't know if it does or doesn't - so had to explain what it is so you can make a call :) [02:05:26] our operational philosophy is more-geared towards not having constant firefighting, rather than having lots of firefighters on duty :) [02:05:30] but yeah, emergency would be for WP to be out ;) [02:05:52] hehe, true that [02:06:14] bblack, this is what you have helped to build: https://www.mediawiki.org/wiki/Extension:Kartographer [02:09:14] !log krinkle@tin Synchronized php-1.27.0-wmf.15/resources/src/mediawiki.special/mediawiki.special.preferences.js: T122702 (duration: 02m 28s) [02:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:10:18] 6Operations, 10MobileFrontend, 10Traffic, 7Regression, 7user-notice: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2098016 (10Jdlrobson) I can replicate this locally n... [02:17:21] PROBLEM - puppetmaster https on labcontrol1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [02:21:12] RECOVERY - mediawiki-installation DSH group on mw2122 is OK: OK [02:33:14] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.15) (duration: 13m 50s) [02:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:38:45] andrewbogott: Did you intend to merge https://gerrit.wikimedia.org/r/#/c/275582/ ? [02:39:37] I did! But now I think I want to wait until tomorrow when I’m paying attention :) [02:49:47] (03PS3) 10Dzahn: ganglia: add unit file for systemd on jessie [puppet] - 10https://gerrit.wikimedia.org/r/275146 (https://phabricator.wikimedia.org/T123674) [03:05:27] !log Upgraded mw1017 to hhvm_3.12.1+dfsg-1 [03:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:09:37] (03PS6) 10BBlack: WIP: cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [03:10:45] (03CR) 10jenkins-bot: [V: 04-1] WIP: cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [03:12:35] (03PS7) 10BBlack: WIP: cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [03:13:22] (03CR) 10jenkins-bot: [V: 04-1] WIP: cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [03:16:15] (03PS8) 10BBlack: WIP: cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [03:26:06] (03PS4) 10Dzahn: ganglia: add unit file template for systemd [puppet] - 10https://gerrit.wikimedia.org/r/275146 (https://phabricator.wikimedia.org/T123674) [03:26:51] (03Abandoned) 10Ladsgroup: Use ensure_package instead of require_package in ORES [puppet] - 10https://gerrit.wikimedia.org/r/274912 (https://phabricator.wikimedia.org/T127975) (owner: 10Ladsgroup) [03:28:03] (03PS5) 10Dzahn: ganglia: add unit file template for systemd [puppet] - 10https://gerrit.wikimedia.org/r/275146 (https://phabricator.wikimedia.org/T123674) [03:45:01] 6Operations, 6Discovery, 10Kartographer, 10Maps, and 2 others: No tiles shown in Firefox - https://phabricator.wikimedia.org/T129187#2098123 (10Yurik) 5Open>3Resolved a:3Yurik [04:45:32] (03PS1) 10EBernhardson: Build cirrus completion indices daily [puppet] - 10https://gerrit.wikimedia.org/r/275749 [05:17:42] PROBLEM - puppet last run on mw2140 is CRITICAL: CRITICAL: Puppet has 1 failures [05:43:11] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [100000000.0] [05:43:42] RECOVERY - puppet last run on mw2140 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:26:44] akosiaris: pushed tags for patches except giella-core, I need to reupload package. [06:30:23] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:11] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:31] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:31] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:02] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:48:42] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [06:56:11] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:56:23] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:56:51] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:57:12] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:57:52] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:01] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [07:05:32] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [07:21:26] (03Abandoned) 10Giuseppe Lavagetto: scap: remove row A7 proxy [puppet] - 10https://gerrit.wikimedia.org/r/275517 (owner: 10Giuseppe Lavagetto) [07:27:35] (03PS1) 10Giuseppe Lavagetto: Remove decommissioned appservers [dns] - 10https://gerrit.wikimedia.org/r/275756 (https://phabricator.wikimedia.org/T126242) [07:33:11] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [07:33:21] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [07:47:41] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [07:56:22] RECOVERY - MariaDB Slave SQL: s5 on db2038 is OK: OK slave_sql_state Slave_SQL_Running: Yes [07:56:31] RECOVERY - MariaDB Slave IO: s5 on db2038 is OK: OK slave_io_state Slave_IO_Running: Yes [07:59:41] s5 finished its schema change [08:00:08] !log starting schema change (s6 partitioning) on db2039 [08:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:04:47] ETA: 10 hours [08:18:37] (03PS3) 10Muehlenhoff: Enable base::firewall on eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/274715 (https://phabricator.wikimedia.org/T113343) [08:26:04] (03PS2) 10Muehlenhoff: Move dynamicproxy ferm rules into the novaproxy role [puppet] - 10https://gerrit.wikimedia.org/r/274962 [08:27:11] (03CR) 10jenkins-bot: [V: 04-1] Move dynamicproxy ferm rules into the novaproxy role [puppet] - 10https://gerrit.wikimedia.org/r/274962 (owner: 10Muehlenhoff) [08:32:12] (03CR) 10Gehel: Build cirrus completion indices daily (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [08:38:10] 6Operations: OCG needs to migrate away from rdb1002 and get its own Redis instance - https://phabricator.wikimedia.org/T128491#2098256 (10elukey) Checked again this morning: ``` elukey@neodymium:~$ sudo -i salt -t 120 ocg100[123]* cmd.run 'curl -s http://localhost:8000/?command=health' ocg1003.eqiad.wmnet:... [08:39:58] (03PS1) 10Mobrovac: Introducing changeprop role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/275772 [08:40:00] (03PS1) 10Mobrovac: Assign changeprop service to scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/275773 [08:40:02] (03PS1) 10Mobrovac: Setup LVS for changeprop service on scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/275774 [08:40:55] (03CR) 10jenkins-bot: [V: 04-1] Assign changeprop service to scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/275773 (owner: 10Mobrovac) [08:41:06] (03CR) 10jenkins-bot: [V: 04-1] Setup LVS for changeprop service on scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/275774 (owner: 10Mobrovac) [08:43:14] !log Redis and Puppet stopped on rdb1002 (Job queue slave) as pre-step for Debian re-image [08:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:43:40] (03CR) 10Gehel: [C: 031] "Looks good to me. I have a question about using logrotate::conf (see inline comment) but I'm basically OK with merging this as-is." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268215 (owner: 10EBernhardson) [08:49:37] (03PS2) 10Mobrovac: Introducing changeprop role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/275772 [08:49:44] 6Operations, 10ops-codfw: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2098259 (10jcrespo) These are my comments after rechecking both the disks and the server roles: db2004 slot 8 - confirmed, but unused? Maybe pending decommission? I need to confirm it db2007 slot 0 - confi... [08:50:09] (03CR) 10Gehel: Add caching headers for nginx (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274864 (https://phabricator.wikimedia.org/T126730) (owner: 10Smalyshev) [09:04:12] 6Operations, 10ops-codfw: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2098265 (10jcrespo) The following servers respond to salt but do not have any known function: ``` db2001.codfw.wmnet: db2002.codfw.wmnet: db2003.codfw.wmnet: db2004.codfw.wmnet: db2005.codfw.wmnet: db2007... [09:05:45] 6Operations, 10ops-codfw: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2098267 (10jcrespo) Related T125827 [09:11:31] 6Operations, 10DBA: Investigate/decom db2001-db2008 - https://phabricator.wikimedia.org/T125827#2098270 (10jcrespo) @RobH I need a second confirmation that these servers are "assigned to me" administratively, and not to fundrising or someone else (as sometimes they reuse the same name). I made a first check an... [09:12:19] 6Operations, 10DBA: Investigate/decom db2001-db2008 - https://phabricator.wikimedia.org/T125827#2098272 (10jcrespo) a:3RobH [09:24:39] (03CR) 10Siebrand: Remove SVN admin and coder groups from mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224420 (https://phabricator.wikimedia.org/T105676) (owner: 10Alex Monk) [09:31:47] (03PS3) 10Jcrespo: Update s5 & s6 partitioning according to current row distribution [software] - 10https://gerrit.wikimedia.org/r/275454 (https://phabricator.wikimedia.org/T120513) [09:32:32] (03CR) 10Jcrespo: [C: 032 V: 032] Update s5 & s6 partitioning according to current row distribution [software] - 10https://gerrit.wikimedia.org/r/275454 (https://phabricator.wikimedia.org/T120513) (owner: 10Jcrespo) [09:53:10] (03PS1) 10Jcrespo: Revoke iron access; add salt-masters access for mysql management [puppet] - 10https://gerrit.wikimedia.org/r/275777 [09:58:01] (03CR) 10Gilles: [C: 031] rcstream: Add documentation link [puppet] - 10https://gerrit.wikimedia.org/r/271811 (owner: 10Krinkle) [09:59:29] (03CR) 10Gilles: [C: 031] mediawiki: Use [PT] instead of [L] for static.php rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/275582 (https://phabricator.wikimedia.org/T128747) (owner: 10Krinkle) [10:01:48] (03CR) 10Gilles: [C: 031] Report save timing by MediaWiki version [puppet] - 10https://gerrit.wikimedia.org/r/273990 (https://phabricator.wikimedia.org/T112557) (owner: 10Ori.livneh) [10:03:58] (03CR) 10Gilles: [C: 031] Added some jobqueue comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275378 (owner: 10Aaron Schulz) [10:08:03] ACKNOWLEDGEMENT - puppet last run on ms-be2010 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi failed disk T129117 [10:08:51] (03PS2) 10Jcrespo: Revoke iron access; add salt-masters access for mysql management [puppet] - 10https://gerrit.wikimedia.org/r/275777 [10:08:55] (03PS1) 10Ema: Add basic support for varnishtest [puppet] - 10https://gerrit.wikimedia.org/r/275779 [10:09:35] (03CR) 10Elukey: "Can we also add some comments for $wgJobQueueAggregator? It would be super useful, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275378 (owner: 10Aaron Schulz) [10:09:49] (03CR) 10jenkins-bot: [V: 04-1] Revoke iron access; add salt-masters access for mysql management [puppet] - 10https://gerrit.wikimedia.org/r/275777 (owner: 10Jcrespo) [10:12:29] (03PS3) 10Jcrespo: Revoke iron access; add salt-masters access for mysql management [puppet] - 10https://gerrit.wikimedia.org/r/275777 [10:13:01] (03CR) 10Gilles: [C: 031] Lowered $wgMaxUserDBWriteDuration to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275734 (owner: 10Aaron Schulz) [10:13:40] (03CR) 10Gilles: [C: 031] Lower "max lag" and $wgAPIMaxLagThreshold to 8/6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275739 (owner: 10Aaron Schulz) [10:14:04] (03CR) 10Alex Monk: [C: 04-1] "please don't reformat those yaml files" [puppet] - 10https://gerrit.wikimedia.org/r/275773 (owner: 10Mobrovac) [10:14:12] RECOVERY - puppetmaster https on labcontrol1001 is OK: HTTP OK: Status line output matched 400 - 330 bytes in 0.063 second response time [10:14:47] (03CR) 10Gilles: [C: 031] Enable async swift writes for remaining backends [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272922 (owner: 10Aaron Schulz) [10:20:34] 6Operations, 10media-storage: Unable to undelete file - https://phabricator.wikimedia.org/T129212#2098361 (10Peachey88) [10:20:41] (03PS1) 10Muehlenhoff: Mention some more fixed CVE IDs in changelog [debs/linux44] - 10https://gerrit.wikimedia.org/r/275781 [10:21:39] (03CR) 10Muehlenhoff: [C: 032 V: 032] Provide new meta package for Linux 4.4 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/274727 (owner: 10Muehlenhoff) [10:24:07] (03CR) 10Muehlenhoff: [C: 032 V: 032] Mention some more fixed CVE IDs in changelog [debs/linux44] - 10https://gerrit.wikimedia.org/r/275781 (owner: 10Muehlenhoff) [10:26:09] (03CR) 10Gilles: [C: 031] redis: declare /var/run/redis [puppet] - 10https://gerrit.wikimedia.org/r/268598 (owner: 10Ori.livneh) [10:28:18] (03CR) 10Giuseppe Lavagetto: "Thanks for the comments Alexandros, I will add the conftool data in a separate patch I'll merge before this one." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/275530 (https://phabricator.wikimedia.org/T129090) (owner: 10Giuseppe Lavagetto) [10:30:17] (03PS3) 10Giuseppe Lavagetto: lvs: add parsoid configuration in codfw [puppet] - 10https://gerrit.wikimedia.org/r/275530 (https://phabricator.wikimedia.org/T129090) [10:32:21] 6Operations, 10media-storage: Unable to undelete file - https://phabricator.wikimedia.org/T129212#2098391 (10fgiunchedi) p:5Triage>3High a:3fgiunchedi thanks, there was indeed an error with synchronizing `-deleted` containers, I've launched a replication now and will report back once it has finished [10:33:09] 6Operations: mw2212 unresponsive - https://phabricator.wikimedia.org/T129188#2098395 (10Peachey88) [10:33:11] 6Operations, 10ops-codfw: mw2122 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2098396 (10Peachey88) [10:33:35] (03PS2) 10Ema: Add basic support for varnishtest [puppet] - 10https://gerrit.wikimedia.org/r/275779 (https://phabricator.wikimedia.org/T128188) [10:33:50] 6Operations: mw2212 unresponsive - https://phabricator.wikimedia.org/T129188#2097531 (10Peachey88) 5duplicate>3Open Dup'ed wrong task. [10:37:11] (03PS1) 10Giuseppe Lavagetto: conftool: add pool data for parsoid in codfw [puppet] - 10https://gerrit.wikimedia.org/r/275784 (https://phabricator.wikimedia.org/T129090) [10:40:40] (03CR) 10ArielGlenn: "Having mysql db access move to the salt masters seems fine to me.I dislike a bit the way the grants are managed, because I am sure than ne" [puppet] - 10https://gerrit.wikimedia.org/r/275777 (owner: 10Jcrespo) [10:41:31] 6Operations, 10DBA: Investigate/decom db2001-db2008 - https://phabricator.wikimedia.org/T125827#2098420 (10jcrespo) @RobH @mark I think there is a mistake on the 5-year planing. I made a comment on the spreadsheet. Luckily, most of these do not need replacement. [10:49:21] (03PS2) 10Giuseppe Lavagetto: conftool: add pool data for parsoid in codfw [puppet] - 10https://gerrit.wikimedia.org/r/275784 (https://phabricator.wikimedia.org/T129090) [10:50:14] PROBLEM - puppetmaster https on labcontrol1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [10:50:29] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: add pool data for parsoid in codfw [puppet] - 10https://gerrit.wikimedia.org/r/275784 (https://phabricator.wikimedia.org/T129090) (owner: 10Giuseppe Lavagetto) [10:58:45] PROBLEM - puppet last run on db2064 is CRITICAL: CRITICAL: puppet fail [11:08:35] (03CR) 10Alexandros Kosiaris: [C: 031] lvs: add parsoid configuration in codfw [puppet] - 10https://gerrit.wikimedia.org/r/275530 (https://phabricator.wikimedia.org/T129090) (owner: 10Giuseppe Lavagetto) [11:13:36] (03CR) 10Jcrespo: "@Ariel, mysql grants do indeed require a refactoring (e.g. something similar to ssh keys or icinga management) plus avoifing them in plain" [puppet] - 10https://gerrit.wikimedia.org/r/275777 (owner: 10Jcrespo) [11:15:23] mariadb being mariadb: "Stage: 1 of 2 'copy to tmp table' 112% of stage done" [11:16:50] they went from 5.5 to 10, so of course their percentages go up to 112% [11:16:50] like rocket engines during launch [11:18:03] <_joe_> akosiaris: thank you sir [11:20:01] (03PS4) 10Giuseppe Lavagetto: lvs: add parsoid configuration in codfw [puppet] - 10https://gerrit.wikimedia.org/r/275530 (https://phabricator.wikimedia.org/T129090) [11:20:32] (03CR) 10Krinkle: [C: 04-1] "There are a few warnings from code that routinely exceeds the 6s limit. We need to fix and roll that out before lowering it further to avo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275734 (owner: 10Aaron Schulz) [11:22:24] (03CR) 10Giuseppe Lavagetto: [C: 032] lvs: add parsoid configuration in codfw [puppet] - 10https://gerrit.wikimedia.org/r/275530 (https://phabricator.wikimedia.org/T129090) (owner: 10Giuseppe Lavagetto) [11:22:42] (03CR) 10Filippo Giunchedi: [C: 04-1] "IIRC we should delete only the eqiad.wmnet entries first, and once they are fully unracked the mgmt.eqiad.wmnet part (hostname + asset tag" [dns] - 10https://gerrit.wikimedia.org/r/275756 (https://phabricator.wikimedia.org/T126242) (owner: 10Giuseppe Lavagetto) [11:25:39] 6Operations, 10Datasets-General-or-Unknown: Some large files not being Rsync'd from stat1003 to datasets.wikimedia.org - https://phabricator.wikimedia.org/T127514#2098515 (10ArielGlenn) [Receiver] io timeout after 300 seconds -- exiting rsync error: timeout in data send/receive (code 30) at io.c(195) [Receiver... [11:25:58] (03PS1) 10ArielGlenn: make timeout in rsync header a parameter, increase value for stats rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/275787 (https://phabricator.wikimedia.org/T127514) [11:26:25] RECOVERY - puppet last run on db2064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:26:58] (03CR) 10jenkins-bot: [V: 04-1] make timeout in rsync header a parameter, increase value for stats rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/275787 (https://phabricator.wikimedia.org/T127514) (owner: 10ArielGlenn) [11:27:11] <_joe_> !log restarting pybal on lvs2003,6 to pick up the config change [11:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:28:33] (03PS2) 10ArielGlenn: make timeout in rsync header a parameter, increase value for stats rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/275787 (https://phabricator.wikimedia.org/T127514) [11:30:37] 6Operations, 10ops-codfw: mw2122 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2098523 (10Joe) 5declined>3Open p:5Triage>3Normal [11:31:05] 6Operations, 10MediaWiki-Uploading, 6Multimedia, 10Traffic, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358#2098525 (10zhuyifei1999) A possible workaround is to use async chunked uploading, but pywikibot does not yet support so. [11:31:08] 6Operations, 10ops-codfw: mw2122 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2097723 (10Joe) the system is down again, and clearly needs hardware investigation. Removing it from the mediawiki-installation group. [11:32:23] <_joe_> It's pretty unbelievable I have to take care of this ^^ [11:34:20] (03PS1) 10Giuseppe Lavagetto: scap: temporarily remove mw2212 from mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/275788 (https://phabricator.wikimedia.org/T129196) [11:34:26] (03PS3) 10Mforns: Increase log verbosity on reportupdater cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/275508 (https://phabricator.wikimedia.org/T126058) [11:34:49] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] scap: temporarily remove mw2212 from mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/275788 (https://phabricator.wikimedia.org/T129196) (owner: 10Giuseppe Lavagetto) [11:35:37] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2122 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2098534 (10Joe) Btw, icinga reports the system as being down since more than one day, so I'm unsure how it "booted". [11:38:58] 7Blocked-on-Operations, 10RESTBase: Long-term graphite aggregation for restbase.requests.varnish_requests API request metrics not working - https://phabricator.wikimedia.org/T121580#2098536 (10fgiunchedi) 5Open>3Resolved @gwicke I've changed xff on `varnish_requests` and data for last month now shows up, w... [11:40:23] (03CR) 10ArielGlenn: "The change can go as is, I understand it's not quickly fixable, I just want to flag it for the future." [puppet] - 10https://gerrit.wikimedia.org/r/275777 (owner: 10Jcrespo) [11:40:55] (03PS1) 10Giuseppe Lavagetto: conftool: fix parsoid servers fqdns [puppet] - 10https://gerrit.wikimedia.org/r/275790 [11:41:38] _joe_: no one looked at it last night? really? [11:42:08] <_joe_> apergos: they looked and failed to do anything, apparently [11:42:12] bah [11:42:28] it was 2 am for me so I didn't feel bad about leaving [11:42:38] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: fix parsoid servers fqdns [puppet] - 10https://gerrit.wikimedia.org/r/275790 (owner: 10Giuseppe Lavagetto) [11:44:17] _joe_: robh rebooted it and it apparently restarted after like ~5m [11:46:18] <_joe_> p858snake: yeah I read the tickets :) [11:48:10] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2122 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2097723 (10Peachey88) Original patchsets taking it out mw-installation that didn't get reported in here: https://gerrit.wikimedia.org/r/#/c/275730/ https://gerrit.wikimedia.org/r/#/c/275741... [11:49:34] (03PS1) 10Hashar: nodepool: lower task ratelimiting from 10 to 1 sec [puppet] - 10https://gerrit.wikimedia.org/r/275791 (https://phabricator.wikimedia.org/T113359) [11:50:24] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Setup LVS for parsoid in codfw - https://phabricator.wikimedia.org/T129090#2098572 (10Joe) 5Open>3Resolved [11:54:26] !log performing schema change (s7 partitioning) on db2040 [11:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:56:16] 7Puppet, 5Continuous-Integration-Scaling, 13Patch-For-Review: Hiera is not properly configured on Nodepool instances - https://phabricator.wikimedia.org/T129092#2094783 (10hashar) p:5Triage>3High [12:03:17] (03Abandoned) 10Faidon Liambotis: whitelisting equinix domain for spam assassin [puppet] - 10https://gerrit.wikimedia.org/r/274170 (https://phabricator.wikimedia.org/T128497) (owner: 10RobH) [12:04:42] PROBLEM - Host parsoid.svc.codfw.wmnet is DOWN: PING CRITICAL - Packet loss = 100% [12:04:54] _joe_: that you? [12:06:48] 6Operations, 10Monitoring, 7Graphite, 7HHVM, 13Patch-For-Review: check_graphite - "UNKNOWN: More than half of the datapoints are undefined " - https://phabricator.wikimedia.org/T105218#2098590 (10fgiunchedi) as expected the number of `UNKNOWN` dropped significantly to about 1/4th (including soft and hard... [12:07:14] (03CR) 10Hashar: "I might have fixed Hiera lookup on the Nodepool images. Did that by simply hardcoding $labsproject = 'contintcloud' which might be enough." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [12:07:19] 6Operations, 10Monitoring, 7Graphite, 7HHVM, 13Patch-For-Review: check_graphite - "UNKNOWN: More than half of the datapoints are undefined " - https://phabricator.wikimedia.org/T105218#2098591 (10fgiunchedi) a:3fgiunchedi [12:09:11] paravoid: ^ getting there... [12:12:32] <_joe_> godog: actually you had 206 today in 6 hours, which seems on par with the previous days except yesterday [12:12:54] _joe_: did you see the parsoid codfw page? [12:13:10] looks like wtp20xx boxes don't have the service IP bound on localhost [12:13:28] <_joe_> paravoid: uhm it was reconfigured [12:13:37] <_joe_> looking [12:13:46] _joe_: indeed, I'll check back tomorrow but yesterday is when I merged the change so I'm confident [12:13:46] I know, that's why I'm pinging you [12:13:58] <_joe_> and no I didn't get the page (still) [12:14:44] <_joe_> inet 10.2.1.28/32 scope global lo:LVS [12:15:02] <_joe_> so it's there... what's missing then [12:18:02] (03CR) 10Jcrespo: "+1 to everything you said!" [puppet] - 10https://gerrit.wikimedia.org/r/275777 (owner: 10Jcrespo) [12:19:29] not bound on lvs2003, not wtp20xx [12:19:30] my bad [12:19:42] <_joe_> paravoid: I was writing just that [12:19:56] <_joe_> so yeah I forgot to add the ip somewhere in our lvs classes :/ [12:20:09] s/classes/hiera maze/ [12:20:30] (03CR) 10Jcrespo: "maybe mariadb::client role is a bad name. Maybe it looks too light and could be confusing? Maybe mariadb::admin_client ?" [puppet] - 10https://gerrit.wikimedia.org/r/275777 (owner: 10Jcrespo) [12:20:53] <_joe_> paravoid: nope I am pretty sure it's everywhere in hiera [12:22:51] RECOVERY - Host parsoid.svc.codfw.wmnet is UP: PING OK - Packet loss = 0%, RTA = 37.06 ms [12:22:52] got it [12:22:54] fixing [12:23:05] <_joe_> ? [12:23:25] <_joe_> what did you do? [12:23:42] (03PS1) 10Faidon Liambotis: lvs: bind parsoid.svc to lvs2003/2006 [puppet] - 10https://gerrit.wikimedia.org/r/275797 [12:24:04] (03CR) 10Faidon Liambotis: [C: 032] lvs: bind parsoid.svc to lvs2003/2006 [puppet] - 10https://gerrit.wikimedia.org/r/275797 (owner: 10Faidon Liambotis) [12:24:05] <_joe_> yeah... [12:24:13] <_joe_> sigh [12:25:41] <_joe_> thanks paravoid :) [12:25:54] np :) [12:27:10] now, what was I doing :P [12:28:43] (03CR) 10Faidon Liambotis: "I realize this is still a WIP, but for the final version a better commit message would help tremendously :)" [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [12:32:22] (03PS3) 10Giuseppe Lavagetto: realm: add $::master_dc hash [puppet] - 10https://gerrit.wikimedia.org/r/275443 (https://phabricator.wikimedia.org/T125673) [12:32:54] <_joe_> paravoid: while you're reviewing patches, what do you think of ^^ [12:48:23] _joe_: looking [13:01:35] 6Operations, 10Monitoring, 10netops, 10Scap3 (scap3-adoption): Deploy libreNMS with scap3 - https://phabricator.wikimedia.org/T129136#2098628 (10faidon) [13:02:46] 6Operations, 10Monitoring, 10netops, 10Scap3 (scap3-adoption): Deploy libreNMS with scap3 - https://phabricator.wikimedia.org/T129136#2096359 (10faidon) Trebuchet was super broken for a while and the version we currently run is straight out of git — so I would be more than happy to switch to scap3 ASAP. Wh... [13:04:11] 6Operations: OCG needs to migrate away from rdb1002 and get its own Redis instance - https://phabricator.wikimedia.org/T128491#2098631 (10elukey) rdb1002 has been moved successfully to Debian, but it still needs to be placed on another location (Ganeti VM?) far from the rdb Redis Job queues. Keeping this task o... [13:10:07] (03PS1) 10ArielGlenn: fix up rsync args for datset1001 rsync to labs [puppet] - 10https://gerrit.wikimedia.org/r/275801 (https://phabricator.wikimedia.org/T128945) [13:12:56] (03CR) 10ArielGlenn: [C: 032] fix up rsync args for datset1001 rsync to labs [puppet] - 10https://gerrit.wikimedia.org/r/275801 (https://phabricator.wikimedia.org/T128945) (owner: 10ArielGlenn) [13:14:24] (03PS5) 10Hashar: contint: rsync server to hold jobs caches [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) [13:14:41] (03CR) 10Hashar: "Moved to modules/role/manifests/ci/castor/server.pp" [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) (owner: 10Hashar) [13:17:36] (03CR) 10Hashar: [C: 04-1] "@dduval comments by joe in previous patch needs to be taken in account. Then the patch has to be rebased since manifests/role/ci.pp is no" [puppet] - 10https://gerrit.wikimedia.org/r/208024 (owner: 10Dduvall) [13:18:09] (03CR) 10ArielGlenn: "I think either is fine tbh." [puppet] - 10https://gerrit.wikimedia.org/r/275777 (owner: 10Jcrespo) [13:18:11] (03PS3) 10Muehlenhoff: Move dynamicproxy ferm rules into the novaproxy role [puppet] - 10https://gerrit.wikimedia.org/r/274962 [13:19:28] 6Operations, 10Wikimedia-Mailing-lists: Password for FDC list and ownership - https://phabricator.wikimedia.org/T129165#2098661 (10Aklapper) [13:20:56] (03PS2) 10Hashar: contint: set pbuilder basepath to actual directory [puppet] - 10https://gerrit.wikimedia.org/r/269103 (https://phabricator.wikimedia.org/T125999) [13:21:19] (03CR) 10Hashar: "Moved to modules/role/manifests/ci/" [puppet] - 10https://gerrit.wikimedia.org/r/269103 (https://phabricator.wikimedia.org/T125999) (owner: 10Hashar) [13:24:28] (03PS2) 10Muehlenhoff: Install php5-readline on mediawiki maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/274931 (https://phabricator.wikimedia.org/T126262) [13:24:45] (03CR) 10Muehlenhoff: [C: 032 V: 032] Install php5-readline on mediawiki maintenance hosts [puppet] - 10https://gerrit.wikimedia.org/r/274931 (https://phabricator.wikimedia.org/T126262) (owner: 10Muehlenhoff) [13:25:48] (03PS6) 10Hashar: contint: Use slave-scripts/bin/php wrapper script [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [13:26:06] (03CR) 10Hashar: [C: 031] "Rebased, the role class got moved to modules/role/manifests/ci/slave/labs.pp" [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [13:27:04] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [13:28:45] (03PS16) 10Hashar: contint: Put mysql db on tmpfs for role::ci::slave::labs [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [13:29:02] (03CR) 10Hashar: "I have copy pasted Jan comment in the manifest." [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [13:29:05] 6Operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=7 dev=sdh failed - https://phabricator.wikimedia.org/T127824#2098666 (10fgiunchedi) 5Open>3Resolved disk rebuilding, resolving [13:29:08] (03CR) 10jenkins-bot: [V: 04-1] contint: Put mysql db on tmpfs for role::ci::slave::labs [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [13:32:04] (03PS17) 10Hashar: contint: Put mysql db on tmpfs for role::ci::slave::labs [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [13:32:21] (03CR) 10Hashar: "Rebased and moved to modules/role/manifests/ci/slave/labs.pp" [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [13:37:46] (03CR) 10Hashar: [C: 031] "Cherry picked on integration puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/204528 (https://phabricator.wikimedia.org/T96230) (owner: 10coren) [13:37:51] (03CR) 10Hashar: [C: 031] "Cherry picked on integration puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/253322 (https://phabricator.wikimedia.org/T116017) (owner: 10Hashar) [13:37:55] (03CR) 10Hashar: [C: 031] "Cherry picked on integration puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/269103 (https://phabricator.wikimedia.org/T125999) (owner: 10Hashar) [13:38:03] (03CR) 10Hashar: "Cherry picked on integration puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/269370 (https://phabricator.wikimedia.org/T126211) (owner: 10Legoktm) [13:39:20] (03CR) 10Hashar: "This is no more on the integration puppet master since role classes got moved under modules/roles/" [puppet] - 10https://gerrit.wikimedia.org/r/208024 (owner: 10Dduvall) [13:44:10] godog: There was some discussion during the meeting yesterday about monitoring and improvements / changes to Graphite. [13:44:50] I'm missing the background there. How can I understand where we are going? [13:47:27] 6Operations, 6Services, 10hardware-requests: codfw: (2+2) sca & scb service clusters - https://phabricator.wikimedia.org/T128475#2098704 (10mark) Approved. [13:47:41] 6Operations, 6Services, 10hardware-requests: codfw: (2+2) sca & scb service clusters - https://phabricator.wikimedia.org/T128475#2098705 (10mark) a:5mark>3RobH [13:48:15] gehel: nice! tl;dr is that wrt graphite we're adding a second machine to expand disk space in https://phabricator.wikimedia.org/T126253 there's also a "graphite" tag and "monitoring" project with the relevant tasks [13:48:40] (03PS1) 10ArielGlenn: explicitly set perms on the empty directory for rsync deletes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/275805 [13:49:07] godog: I was under the impression that we are looking to replace graphite with something else. Did I just misunderstand? [13:49:42] (03CR) 10ArielGlenn: [C: 032 V: 032] explicitly set perms on the empty directory for rsync deletes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/275805 (owner: 10ArielGlenn) [13:52:18] (03PS1) 10ArielGlenn: pick up latest wmf labs rsync script which sets dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/275807 (https://phabricator.wikimedia.org/T128945) [13:53:01] gehel: not in the short term but yeah, OTOH having "distributed storage" for storing graphite data would make things better than they are now graphite-wise [13:53:10] 7Puppet, 5Continuous-Integration-Scaling, 13Patch-For-Review: Hiera is not properly configured on Nodepool instances - https://phabricator.wikimedia.org/T129092#2098713 (10hashar) a:3hashar [13:53:49] (03CR) 10ArielGlenn: [C: 032] pick up latest wmf labs rsync script which sets dir permissions [puppet] - 10https://gerrit.wikimedia.org/r/275807 (https://phabricator.wikimedia.org/T128945) (owner: 10ArielGlenn) [13:53:50] godog: what do you mean by "distributed storage" in this context? [13:54:25] godog: I suspect something more than consistent hashing by carbon-relay and multiple whisper backend... [13:57:23] gehel: yeah sth like cassandra, though if we can make it work operationally also consistent hashing with replication and multiple machines could work, I did outline some options a while ago at https://wikitech.wikimedia.org/wiki/Graphite/Scaling [13:58:16] godog: Thanks! [13:58:27] * gehel is adding a few more pages to his reading list [13:59:09] gehel: eheh, it is a bit outdated by now but let me know if you have any questions [13:59:29] godog: Don't worry! I am not done bothering you :P [13:59:47] hehe ok! [14:00:00] (03PS1) 10BBlack: cache_maps: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275808 [14:00:02] (03PS1) 10BBlack: cache_misc: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275809 [14:00:04] (03PS1) 10BBlack: cache_upload: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275810 [14:00:06] (03PS1) 10BBlack: cache_text: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275811 [14:00:08] (03PS1) 10BBlack: caches: default ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275812 [14:10:45] !log update reprepro with cassandra 2.1.13 T126629 [14:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:11:02] 6Operations, 10DBA: Implement mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#2098755 (10jcrespo) [14:13:08] (03Abandoned) 10Hashar: 0.1.1-wmf3: statsd and systemd support [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/224390 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [14:14:49] (03PS2) 10BBlack: cache_text: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275811 [14:14:51] (03PS2) 10BBlack: cache_upload: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275810 [14:14:53] (03PS2) 10BBlack: cache_misc: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275809 [14:14:55] (03PS2) 10BBlack: cache_maps: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275808 [14:14:57] (03PS2) 10BBlack: caches: default ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275812 [14:16:30] (03Abandoned) 10Hashar: beta: update hostname to have .deployment-prep. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265850 (owner: 10Hashar) [14:18:02] (03PS2) 10Giuseppe Lavagetto: mobileapps: point to $rb_primary, not to the local restbase cluster [puppet] - 10https://gerrit.wikimedia.org/r/275538 [14:18:04] (03PS2) 10Giuseppe Lavagetto: iegreview: use $parsoid_primary [puppet] - 10https://gerrit.wikimedia.org/r/275539 (https://phabricator.wikimedia.org/T125673) [14:18:06] (03PS2) 10Giuseppe Lavagetto: restbase: make restbase configuration $master_dc [puppet] - 10https://gerrit.wikimedia.org/r/275536 (https://phabricator.wikimedia.org/T126235) [14:18:08] (03PS2) 10Giuseppe Lavagetto: cxserver: use $rb_primary in configuring restbase urls [puppet] - 10https://gerrit.wikimedia.org/r/275537 (https://phabricator.wikimedia.org/T125065) [14:18:10] (03PS1) 10Giuseppe Lavagetto: parsoid::testing: use master_dc variables [puppet] - 10https://gerrit.wikimedia.org/r/275814 (https://phabricator.wikimedia.org/T124670) [14:21:02] 6Operations, 10media-storage: Unable to undelete file - https://phabricator.wikimedia.org/T129212#2098782 (10fgiunchedi) the sync has finished, @dmacks @Closedmouth could you try again? thanks! [14:23:28] (03PS4) 10Andrew Bogott: mediawiki: Use [PT] instead of [L] for static.php rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/275582 (https://phabricator.wikimedia.org/T128747) (owner: 10Krinkle) [14:24:53] 6Operations, 10DBA, 10MediaWiki-Configuration, 6Release-Engineering-Team, and 3 others: codfw is in read only according to mediawiki - https://phabricator.wikimedia.org/T124795#2098800 (10jcrespo) [14:25:49] 6Operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a Master-master topology between datacenters for easier failover (setup circular replication dallas -> eqiad for mysql databases) - https://phabricator.wikimedia.org/T119642#2098803 (10jcrespo) [14:27:03] !log upgrade cassandra to 2.1.3 on restbase2001 [14:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:27:13] (03PS1) 10WMDE-leszek: Whitelist feeds included on Wikimedia Germany Engineering page on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275815 (https://phabricator.wikimedia.org/T127176) [14:28:55] (03CR) 10Faidon Liambotis: "Conceptually, this doesn't make sense — neither parsoid nor restbase have a "primary" like MW has (and even the long-term for MW is to get" [puppet] - 10https://gerrit.wikimedia.org/r/275443 (https://phabricator.wikimedia.org/T125673) (owner: 10Giuseppe Lavagetto) [14:29:11] _joe_: ^ [14:31:35] 6Operations, 10media-storage: Unable to undelete file - https://phabricator.wikimedia.org/T129212#2098814 (10Closedmouth) Yeah it worked for me, thanks. [14:32:31] (03PS1) 10Jcrespo: Update s7 partitioning [software] - 10https://gerrit.wikimedia.org/r/275816 [14:32:55] (03PS1) 10Andrew Bogott: Quiet down keystone logs for now. [puppet] - 10https://gerrit.wikimedia.org/r/275817 [14:33:46] (03CR) 10Jcrespo: [C: 032] Update s7 partitioning [software] - 10https://gerrit.wikimedia.org/r/275816 (owner: 10Jcrespo) [14:33:51] (03CR) 10Giuseppe Lavagetto: "It took me a day of grepping through the puppet codebase to find all the places where we are pointing to parsoid/restbase/the api, and I'm" [puppet] - 10https://gerrit.wikimedia.org/r/275443 (https://phabricator.wikimedia.org/T125673) (owner: 10Giuseppe Lavagetto) [14:33:54] (03CR) 10Jcrespo: [V: 032] Update s7 partitioning [software] - 10https://gerrit.wikimedia.org/r/275816 (owner: 10Jcrespo) [14:33:58] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2098815 (10Krenair) Which key did you add to which user? On wikitech-static I have a separate (different from prod, different from labs) key for my own account... [14:35:27] <_joe_> paravoid: I agree with your remarks in general, but in this specific case doing this will make our lives way easier [14:35:35] it's a hack [14:35:46] <_joe_> so what do you suggest? [14:36:08] <_joe_> to go around the code switching every '.eqiad' to '.codfw' just for the switchover? [14:36:13] I think I'd prefer slightly to just do s/eqiad/codfw/ on the manifests for now [14:36:24] and work on node/etcd integration [14:36:41] <_joe_> paravoid: I don't agree, that requires to maintain a much larger patch [14:37:10] it's a temporary patch though, right [14:37:18] the concept of a "master" parsoid or restbase doesn't exist [14:38:30] these are active-active setups -- and in the case of restbase they're even part of the same cluster [14:40:46] <_joe_> sorry I went offline again :/ [14:41:00] <_joe_> paravoid: ok so we might decide to have both active-active [14:41:04] <_joe_> that's an option [14:41:52] (03Abandoned) 10BBlack: caches: default ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275812 (owner: 10BBlack) [14:42:09] <_joe_> paravoid: the parametrization of the urls allows to do both things [14:42:51] <_joe_> I don't see an issue with using $parsoid_primary now that we actually want to switch traffic over, and using "$::site" after the switchover to keep parsoid active-active [14:42:55] 6Operations, 6Project-Admins, 3DevRel-March-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2098829 (10Aklapper) A #devops tag was created two days ago. No idea how that adds to the mix (and its description really does not help). [14:43:56] (03PS3) 10BBlack: cache_text: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275811 [14:43:58] (03PS3) 10BBlack: cache_upload: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275810 [14:44:00] (03PS3) 10BBlack: cache_misc: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275809 [14:44:02] (03PS3) 10BBlack: cache_maps: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275808 [14:44:04] (03PS1) 10BBlack: cache::route_table: no default [puppet] - 10https://gerrit.wikimedia.org/r/275818 (https://phabricator.wikimedia.org/T127481) [14:44:05] 6Operations, 6Project-Admins, 3DevRel-March-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2098833 (10faidon) >>! In T119944#2098829, @Aklapper wrote: > A #devops tag was created two days ago. No idea how that adds to the mix (and its description really d... [14:44:15] (03PS7) 10Gehel: Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) [14:46:34] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:46:43] (03CR) 10Ottomata: [C: 031] "Happy to help babysit this, just ping me on IRC and we'll find some time." [puppet] - 10https://gerrit.wikimedia.org/r/274715 (https://phabricator.wikimedia.org/T113343) (owner: 10Muehlenhoff) [14:46:47] _joe_: I'm okay with it, if you're willing to make a bet with me that these primary variables are still going to be there in 6 months :P [14:46:49] 6Operations: Icinga disk space should also check inode usage - https://phabricator.wikimedia.org/T129222#2098836 (10MoritzMuehlenhoff) [14:46:59] 6Operations: Icinga disk space should also check inode usage - https://phabricator.wikimedia.org/T129222#2098848 (10MoritzMuehlenhoff) p:5Triage>3Normal [14:47:41] <_joe_> paravoid: that will mean that we're still not willing to run active/active for parsoid and restbase. make it 1 year? :P [14:47:46] (03CR) 10Ottomata: [C: 031] make timeout in rsync header a parameter, increase value for stats rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/275787 (https://phabricator.wikimedia.org/T127514) (owner: 10ArielGlenn) [14:47:53] 6Operations: Icinga disk space check should also check inode usage - https://phabricator.wikimedia.org/T129222#2098836 (10MoritzMuehlenhoff) [14:48:11] well sort of [14:48:24] I think puppet is fundamentally not the right place for switching datacenters anyway [14:48:31] these should be in either the service config or in etcd [14:48:32] <_joe_> paravoid: and on that, we agree [14:48:43] <_joe_> the service configs are in puppet [14:48:57] I know :/ [14:49:02] <_joe_> I know, it's sad, somehow [14:49:17] (03CR) 10jenkins-bot: [V: 04-1] Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [14:49:29] <_joe_> but I don't want to go around 25 software repos changing things next time we switch either [14:49:37] <_joe_> that's even worse than using puppet [14:50:00] so I still don't understand why aren't we just integrating this with etcd already [14:50:12] how difficult can it be, we've been discussing this for 3 weeks now [14:50:20] mobrovac: ^ [14:50:37] <_joe_> paravoid: it's pretty easy, it just takes someone having the time to do it I guess [14:51:09] paravoid: it's the same picture on the cache side: I'm putting switches in puppet hieradata. They'd be better off in etcd/confd, but that integration level is a lot harder. [14:51:23] !log upgrade cassandra to 2.1.3 on restbase2002 [14:51:27] mostly because of the whole mess of templating to confd to go templates and all that [14:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:51:52] it means rewriting a lot of templates from erb to erb.tpl and having confd generate them even for the first time, which races with other puppet things async [14:51:54] should be much easier for node services though [14:51:56] godog: s/2.1.3/2.1.13/ :) [14:52:23] they can just do an etcd lookup for their "backend" periodically [14:52:39] <_joe_> paravoid: which is an http call, yes [14:52:48] yes [14:52:58] <_joe_> cache that for 1 minute, was my original suggesion [14:53:08] (or ideally, watchers, I'm guessing -- but that's a premature optimization at this point) [14:53:13] why aren't we already working on _that_? [14:53:23] mobrovac: ? :) :) [14:53:40] because priorities and because my day has only 24h unfortunately [14:53:52] priorities? [14:53:54] <_joe_> pfft [14:53:59] (i won't mention the part that i'm paid for only 8 of those) :P [14:54:00] <_joe_> 24 hours [14:54:13] can you elaborate? [14:54:20] 6Operations: on labcontrol1001, /var/cache/salt has too many files! - https://phabricator.wikimedia.org/T129224#2098875 (10Andrew) [14:54:25] the 24h part? :D [14:54:28] * mobrovac joking [14:54:41] paravoid: i'd be happy to schedule some time for that in the upcoming Qs [14:55:01] switching services between DCs is a quarterly goal this quarter, though [14:55:05] it's the top priority for the entire department [14:55:17] ok, but it's happening a week from now [14:55:23] two [14:55:29] with a plan B for Apr 18th [14:55:36] and we've been discussing it already for 3 weeks now? [14:55:42] (03PS8) 10Gehel: Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) [14:56:20] paravoid: i distinctively remember gwicke suggesting DNS for this round, with _joe_ wanting to use puppet this time around [14:56:30] yes, i remember DNS too for this round [14:56:37] paravoid: i do agree that etcd / consul / etc would be "the way forward" though [14:57:27] so, DNS essentially means, we get to do the work and live with a solution that we don't like [14:57:34] and services does... what for this quarterly goal? [14:57:41] nothing, I guess? [14:57:52] s/svc.eqiad/svc/ on a couple of config files? :) [14:58:15] <_joe_> paravoid: I just did that ftr ;) [14:58:23] i'm not sure it's fair putting it that way paravoid ;) [14:58:49] what would be a fair assessment then? [14:58:50] 6Operations: on labcontrol1001, /var/cache/salt has too many files! - https://phabricator.wikimedia.org/T129224#2098891 (10Andrew) This is probably https://github.com/saltstack/salt/issues/10443 [14:59:46] paravoid: ofc we can help, but i don't think we are supposed to own part of your goal (or, at least, i dunno about it) [14:59:56] (03CR) 10BBlack: [C: 032] "compiler-verified no-op at all live clusters+sites" [puppet] - 10https://gerrit.wikimedia.org/r/275818 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [14:59:57] this is not an ops goal [15:00:01] this is a technology goal [15:00:04] it's *your* goal too [15:00:39] and the fact that even that fact is misunderstood is not very encouraging, honestly [15:00:44] hm, well, i guess it's good i got to know that before the Q was over [15:00:56] <_joe_> I think there is not much clarity about that, no [15:01:13] https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q3_Goals#Group_goal fwiw [15:03:22] 6Operations: on labcontrol1001, /var/cache/salt has too many files! - https://phabricator.wikimedia.org/T129224#2098927 (10Andrew) neodymium is running the same version of salt, and yet it is purging its files properly [15:03:30] (03CR) 10BBlack: [C: 032] "compiler-verified no-op (maps has no caches in ulsfo/codfw presently)" [puppet] - 10https://gerrit.wikimedia.org/r/275808 (owner: 10BBlack) [15:04:13] paravoid: "Serve Swift, ElasticSearch and RESTBase, Parsoid services from codfw" -> so the issue comes down to "RB and Parsoid need to contact the right MW API" [15:04:48] and "servicewhatever need to contact the codfw RESTBase/Parsoid" [15:04:58] those two things, yes [15:05:14] RECOVERY - puppetmaster https on labcontrol1001 is OK: HTTP OK: Status line output matched 400 - 330 bytes in 0.071 second response time [15:05:36] paravoid: from there, we have only cxserver left, afaik, mobileapps config uses $mw_primary, and graphoid hits the public endpoint [15:05:52] for the first one, you mean? [15:06:15] first? [15:06:36] 6Operations: on labcontrol1001, /var/cache/salt has too many files! - https://phabricator.wikimedia.org/T129224#2098950 (10Andrew) a:3ArielGlenn [15:06:40] "RB and Parsoid need to contact the right MW API" is (1) [15:06:49] "servicewhatever need to contact the codfw RESTBase/Parsoid" is (2) [15:07:11] (03PS1) 10Jcrespo: Repool db2038, db2039; depool db2035 for partitioning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275822 (https://phabricator.wikimedia.org/T120513) [15:07:12] 6Operations, 10Salt: on labcontrol1001, /var/cache/salt has too many files! - https://phabricator.wikimedia.org/T129224#2098953 (10ArielGlenn) [15:07:17] (03PS5) 10Andrew Bogott: mediawiki: Use [PT] instead of [L] for static.php rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/275582 (https://phabricator.wikimedia.org/T128747) (owner: 10Krinkle) [15:07:29] also $mw_primary is clearly not the right solution, I think we've agreed on that [15:07:33] we can use it as a stopgap to meet our goal [15:08:04] 6Operations, 10Salt: on labcontrol1001, /var/cache/salt has too many files! - https://phabricator.wikimedia.org/T129224#2098875 (10ArielGlenn) Never had this issue on palladium either. Do you know if they were in /var/cache/salt/master/jobs? [15:08:09] however, I would very much like the team to contribute to the goal by making efforts for the right solution, rather than sit around waiting for some stopgaps to be implemented by ops [15:08:11] yes, i think we agree on that paravoid [15:08:39] (puppet being the stop-gap) [15:09:34] (03CR) 10Andrew Bogott: [C: 032] mediawiki: Use [PT] instead of [L] for static.php rewrite rules [puppet] - 10https://gerrit.wikimedia.org/r/275582 (https://phabricator.wikimedia.org/T128747) (owner: 10Krinkle) [15:09:37] (03CR) 10BBlack: [C: 032] cache_misc: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275809 (owner: 10BBlack) [15:09:42] !log switching misc-web cache routing: ulsfo->codfw [15:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:09:59] paravoid: for the first part, for RB should be an easy ops/puppet config change to point to mw_primary [15:10:19] (03PS4) 10BBlack: cache_misc: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275809 [15:10:20] paravoid: Parsoid is trickier, because it keeps the config in it own repo, so no mw_primary there [15:10:25] (03CR) 10BBlack: [V: 032] cache_misc: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275809 (owner: 10BBlack) [15:12:51] (03PS2) 10Andrew Bogott: nodepool: lower task ratelimiting from 10 to 1 sec [puppet] - 10https://gerrit.wikimedia.org/r/275791 (https://phabricator.wikimedia.org/T113359) (owner: 10Hashar) [15:12:58] bblack-mba:~ bblack$ curl -sv https://phabricator.wikimedia.org/ --resolve 'phabricator.wikimedia.org:443:198.35.26.120' 2>&1 >/dev/null |egrep -- '< HTTP/1|X-Cache' [15:13:01] < HTTP/1.1 200 OK [15:13:03] < X-Cache: cp1058 pass(0), cp2012 miss(0), cp4002 pass(0), cp4003 frontend miss(0) [15:13:13] (03CR) 10Jcrespo: [C: 032] Repool db2038, db2039; depool db2035 for partitioning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275822 (https://phabricator.wikimedia.org/T120513) (owner: 10Jcrespo) [15:14:10] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2098962 (10Joe) since it seems clear to me that this system will not make it on time for the switchover, I'll temporarily re... [15:14:35] (03CR) 10Andrew Bogott: [C: 032] nodepool: lower task ratelimiting from 10 to 1 sec [puppet] - 10https://gerrit.wikimedia.org/r/275791 (https://phabricator.wikimedia.org/T113359) (owner: 10Hashar) [15:15:10] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2038, db2039; depool db2035 for partitioning (duration: 00m 40s) [15:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:17:24] PROBLEM - HTTPS on titanium is CRITICAL: SSL CRITICAL - Certificate archiva.wikimedia.org valid until 2016-04-07 15:16:02 +0000 (expires in 29 days) [15:18:45] bblack: \o/ \o/ [15:19:14] (03PS1) 10Alex Monk: Properly remove SVN Admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275824 (https://phabricator.wikimedia.org/T105676) [15:20:19] (03CR) 10DCausse: [C: 031] Build cirrus completion indices daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [15:24:37] 6Operations, 10Salt: on labcontrol1001, /var/cache/salt has too many files! - https://phabricator.wikimedia.org/T129224#2098987 (10ArielGlenn) The fixes from the bug report mentioned above (10433) are both in our version of salt, in returners/local_cache.py, one for a missing function in Python 2.6 (which we... [15:24:46] !log upgrade cassandra to 2.1.13 on restbase2003 [15:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:26:54] (03CR) 10Gehel: Build cirrus completion indices daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [15:27:09] (03CR) 10Gehel: [C: 031] Build cirrus completion indices daily [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [15:29:00] !log upgrade cassandra to 2.1.13 on restbase2004 [15:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:31] !log upgrade cassandra to 2.1.13 on restbase2005 [15:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:37] \o/ [15:33:10] (03PS4) 10BBlack: cache_upload: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275810 [15:33:29] (03CR) 10BBlack: [C: 032 V: 032] cache_upload: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275810 (owner: 10BBlack) [15:34:14] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:34:45] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2122 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2099048 (10greg) 2122 or 2212? task title says 2122 but @joe removed 2212. I filed T129188 about 2212 :) [15:34:58] numbers are hard [15:35:22] (03CR) 10Physikerwelt: Services: introduce service::packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [15:36:03] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [15:36:34] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:36:41] euh? [15:36:45] godog: that you ^ [15:36:47] ? [15:36:49] !log switching upload cache routing: ulsfo->codfw [15:36:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:37:27] bblack-mba:~ bblack$ curl -sv 'https://upload.wikimedia.org/wikipedia/commons/thumb/f/fb/Idioma_occitano_dialectos.png/260px-Idioma_occitano_dialectos.png?x=y' --resolve 'upload.wikimedia.org:443:198.35.26.112' 2>&1 >/dev/null |egrep -- '< HTTP|X-Cache' [15:37:31] < HTTP/1.1 200 OK [15:37:33] < X-Cache: cp1064 miss(0), cp2020 miss(0), cp4005 miss(0), cp4005 frontend hit(2) [15:38:14] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [15:40:00] mobrovac: nope [15:40:24] bblack: nice! [15:40:34] !log upgrade cassandra to 2.1.13 on restbase2006 [15:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:39] 6Operations, 6Project-Admins, 3DevRel-March-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2099111 (10Aklapper) >>! In T119944#2098833, @faidon wrote: >>>! In T119944#2098829, @Aklapper wrote: >> A #devops tag was created two days ago. No idea how that ad... [15:48:07] 6Operations, 10media-storage: Unable to undelete file - https://phabricator.wikimedia.org/T129212#2099115 (10Aklapper) 5Open>3Resolved Thanks for the feedback! Closing per last comment. [15:48:20] 6Operations, 10Monitoring, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#1256586 (10fgiunchedi) also `ccics_vol_status` seems to be limited to hp dl380 gen8, on a dl380 gen9 e.g. `ms-be2020` it doesn't work... [15:49:20] (03PS1) 10Muehlenhoff: Add ferm rules for statsdlb [puppet] - 10https://gerrit.wikimedia.org/r/275829 [15:49:22] (03PS1) 10Muehlenhoff: Add ferm rules for carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/275830 [15:53:11] 6Operations, 10Salt: on labcontrol1001, /var/cache/salt has too many files! - https://phabricator.wikimedia.org/T129224#2098875 (10chasemp) Just a note I think we do not check inode usage w/ our default check_disk params: > /usr/lib/nagios/plugins/check_disk -w 6% -c 3% -l -e -A -i "/srv/sd[a-b][1-3]" We co... [15:54:10] 6Operations, 6Services, 10hardware-requests: codfw: (2+2) sca & scb service clusters - https://phabricator.wikimedia.org/T128475#2099139 (10RobH) [15:56:17] 6Operations, 6Services, 10hardware-requests: codfw: (2+2) sca & scb service clusters - https://phabricator.wikimedia.org/T128475#2099173 (10RobH) The following systems have been allocated for this request: sca2001 : WMF6378 sca2002 : WMF6380 sca2003 : WMF6381 sca2004 : WMF6384 [15:56:32] 6Operations: Icinga disk space check should also check inode usage - https://phabricator.wikimedia.org/T129222#2098836 (10chasemp) notes https://phabricator.wikimedia.org/T129224#2099135 [15:59:36] (03PS1) 10Muehlenhoff: Add ferm rules for carbon (python) [puppet] - 10https://gerrit.wikimedia.org/r/275833 [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160308T1600). [16:00:04] James_F csteipp mafk schana: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:17] * mafk present [16:00:34] I'm here [16:00:59] I can SWAT. [16:01:36] I can't :) [16:02:43] o/ [16:03:13] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) (owner: 10MarcoAurelio) [16:03:33] * James_F is here too. [16:03:45] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275285 (https://phabricator.wikimedia.org/T128948) (owner: 10MarcoAurelio) [16:03:53] (03Merged) 10jenkins-bot: Modify throttle settings for frwiki and cawiki due to Workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275287 (https://phabricator.wikimedia.org/T129018) (owner: 10MarcoAurelio) [16:03:55] oh, hey everybody :) [16:04:26] (03Merged) 10jenkins-bot: Permissions configuration changes for gl.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275285 (https://phabricator.wikimedia.org/T128948) (owner: 10MarcoAurelio) [16:04:47] thcipriani@tin etc etc :D [16:06:44] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Permissions configuration changes for gl.wikipedia [[gerrit:275285]] and Modify throttle settings for frwiki and cawiki due to Workshop [[gerrit:275287]] (duration: 00m 28s) [16:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:50] ^ mafk check please [16:07:26] thcipriani: worksforme [16:07:34] mafk: cool, thanks for checking. [16:07:48] thank you for deploying [16:07:56] (03PS4) 10Thcipriani: Enable VisualEditor for new accounts on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271712 (https://phabricator.wikimedia.org/T127881) (owner: 10Jforrester) [16:08:02] 6Operations, 6Services: setup/deploy sc[a-b]200[1-2] - https://phabricator.wikimedia.org/T129234#2099212 (10RobH) [16:08:16] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271712 (https://phabricator.wikimedia.org/T127881) (owner: 10Jforrester) [16:08:25] (03PS1) 10RobH: setting sc[a-b]200[1-2] dns entries [dns] - 10https://gerrit.wikimedia.org/r/275835 [16:09:08] (03Merged) 10jenkins-bot: Enable VisualEditor for new accounts on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271712 (https://phabricator.wikimedia.org/T127881) (owner: 10Jforrester) [16:09:32] 6Operations, 10Salt: on labcontrol1001, /var/cache/salt has too many files! - https://phabricator.wikimedia.org/T129224#2099229 (10Andrew) >>! In T129224#2098953, @ArielGlenn wrote: > Never had this issue on palladium either. Do you know if they were in /var/cache/salt/master/jobs? Yes, I think that's wher... [16:10:20] (03CR) 10RobH: [C: 032] setting sc[a-b]200[1-2] dns entries [dns] - 10https://gerrit.wikimedia.org/r/275835 (owner: 10RobH) [16:11:53] !log thcipriani@tin Synchronized dblists/visualeditor-default.dblist: SWAT: Enable VisualEditor for new accounts on the German Wikipedia PART I [[gerrit:271712]] (duration: 00m 32s) [16:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:12:10] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274795 (https://phabricator.wikimedia.org/T127445) (owner: 10CSteipp) [16:12:30] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable VisualEditor for new accounts on the German Wikipedia PART II [[gerrit:271712]] (duration: 00m 29s) [16:12:32] ^ James_F check please [16:12:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:12:35] !log upgrade cassandra to 2.1.13 on restbase1001 [16:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:13:07] (03PS2) 10Chad: Gerrit: install standard base on new server lead [puppet] - 10https://gerrit.wikimedia.org/r/274150 (https://phabricator.wikimedia.org/T126794) [16:13:46] thcipriani: Hmm. [16:13:46] 6Operations, 13Patch-For-Review: Sudden increase in NOTICE events from hhvm while trying to de-pool rdb1003 for maintenance - https://phabricator.wikimedia.org/T128730#2099245 (10elukey) Hi @aaron, adding some comments: >>! In T128730#2096841, @aaron wrote: > Why not either: > a) Just depool it totally (leavi... [16:14:39] thcipriani: It's being a bit odd. [16:14:42] (03Merged) 10jenkins-bot: Update pbkdf2 hash parameters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274795 (https://phabricator.wikimedia.org/T127445) (owner: 10CSteipp) [16:14:58] odd's not good. [16:15:08] No, not so much. [16:15:12] * James_F checks his patch. [16:15:39] (03CR) 10Ottomata: "Hmm, It'd be nice to be able to recover from something more recent than a week old. If the mysql host explodes, and our latest back up is" [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [16:15:43] PROBLEM - cassandra service on restbase1002 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [16:15:54] PROBLEM - cassandra service on restbase1003 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [16:15:55] PROBLEM - restbase endpoints health on restbase1002 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [16:15:58] checking ^ [16:16:03] PROBLEM - cassandra CQL 10.64.32.159:9042 on restbase1003 is CRITICAL: Connection refused [16:16:05] PROBLEM - restbase endpoints health on restbase1003 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [16:16:12] (03CR) 10Ottomata: "Oh, weekly backups + binary logs would be fine. Is that something we can do easily?" [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [16:16:14] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp main page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp main page via mobile-sections-lead returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Barack Obama page via mobile-sections-lead) i [16:16:18] I broke it :( [16:16:23] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [16:16:41] thcipriani: It looks like just a cacheing blip. [16:16:43] PROBLEM - cassandra CQL 10.64.0.221:9042 on restbase1002 is CRITICAL: Connection refused [16:16:45] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [16:16:45] PROBLEM - restbase endpoints health on restbase1006 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [16:16:45] PROBLEM - restbase endpoints health on restbase1001 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [16:16:45] PROBLEM - restbase endpoints health on restbase1004 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [16:16:45] PROBLEM - restbase endpoints health on restbase1005 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [16:16:49] !log restarting cassandra on restbase1* [16:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:03] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200) [16:17:04] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [16:17:05] (03CR) 10Ottomata: "The raw db size is 40G now, and I think we are about to change some settings that will make it smaller. Full daily backups keeping only a" [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [16:17:08] James_F: ah, ok, cool. Continuing then. [16:17:15] thcipriani: Thank you. [16:17:24] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp main page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp main page via mobile-sections-lead returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Barack Obama page via mobile-sections-lead) i [16:17:36] akosiaris: hiyaaa, if you get a sec, could you look at https://gerrit.wikimedia.org/r/#/c/273312/ today? [16:17:39] mostly about the bacula part [16:17:44] PROBLEM - cassandra service on restbase1006 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [16:17:53] PROBLEM - cassandra CQL 10.64.48.99:9042 on restbase1005 is CRITICAL: Connection refused [16:18:13] PROBLEM - cassandra service on restbase1005 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [16:18:43] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [16:18:50] mobrovac urandom gwicke ^ I accidentally restarted cassandra in eqiad while doing the package upgrade, restarting now [16:19:03] PROBLEM - cassandra CQL 10.64.48.100:9042 on restbase1006 is CRITICAL: Connection refused [16:19:14] RECOVERY - cassandra service on restbase1002 is OK: OK - cassandra is active [16:19:14] !log thcipriani@tin Synchronized wmf-config: SWAT: Update pbkdf2 hash parameters [[gerrit:274795]] (duration: 00m 31s) [16:19:17] ^ csteipp check please [16:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:24] RECOVERY - cassandra service on restbase1003 is OK: OK - cassandra is active [16:19:33] RECOVERY - cassandra service on restbase1006 is OK: OK - cassandra is active [16:19:53] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [16:19:54] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [16:19:54] RECOVERY - cassandra service on restbase1005 is OK: OK - cassandra is active [16:19:58] thcipriani: Looks good, thanks! [16:20:05] RECOVERY - cassandra CQL 10.64.0.221:9042 on restbase1002 is OK: TCP OK - 0.004 second response time on port 9042 [16:20:10] csteipp: cool, thanks for checking. [16:20:23] 6Operations, 10Salt: on labcontrol1001, /var/cache/salt has too many files! - https://phabricator.wikimedia.org/T129224#2099248 (10Andrew) Ori suggests tmpreaper::dir if we can't get salt to handle things itself. [16:20:23] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [16:20:23] RECOVERY - restbase endpoints health on restbase1001 is OK: All endpoints are healthy [16:20:23] RECOVERY - restbase endpoints health on restbase1005 is OK: All endpoints are healthy [16:20:23] RECOVERY - restbase endpoints health on restbase1004 is OK: All endpoints are healthy [16:20:23] RECOVERY - restbase endpoints health on restbase1006 is OK: All endpoints are healthy [16:20:35] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [16:20:45] RECOVERY - cassandra CQL 10.64.48.100:9042 on restbase1006 is OK: TCP OK - 0.001 second response time on port 9042 [16:20:46] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275669 (https://phabricator.wikimedia.org/T125946) (owner: 10Nschaaf) [16:21:03] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [16:21:14] RECOVERY - restbase endpoints health on restbase1002 is OK: All endpoints are healthy [16:21:14] RECOVERY - cassandra CQL 10.64.32.159:9042 on restbase1003 is OK: TCP OK - 0.004 second response time on port 9042 [16:21:23] RECOVERY - cassandra CQL 10.64.48.99:9042 on restbase1005 is OK: TCP OK - 0.005 second response time on port 9042 [16:21:24] RECOVERY - restbase endpoints health on restbase1003 is OK: All endpoints are healthy [16:21:50] thcipriani: sorry about the noise during SWAT :( [16:22:26] godog: no problem for me :) [16:23:52] (03Merged) 10jenkins-bot: Remove reader segmentation survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275669 (https://phabricator.wikimedia.org/T125946) (owner: 10Nschaaf) [16:25:25] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Remove reader segmentation survey [[gerrit:275669]] (duration: 00m 25s) [16:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:30] ^ schana check please [16:25:39] (03PS1) 10Giuseppe Lavagetto: mediawiki::maintenance: add codfw host, multidc support [puppet] - 10https://gerrit.wikimedia.org/r/275837 (https://phabricator.wikimedia.org/T126987) [16:25:50] <_joe_> paravoid: ^^ [16:26:05] thcipriani: looks good. thanks [16:26:17] schana: neat. Thanks for checking! [16:26:38] <_joe_> (and yeah, now I'm off for reals) [16:27:26] hi schana. checking. [16:27:44] (03CR) 10Paladox: [C: 031] Gerrit: install standard base on new server lead [puppet] - 10https://gerrit.wikimedia.org/r/274150 (https://phabricator.wikimedia.org/T126794) (owner: 10Chad) [16:30:18] 6Operations, 10Traffic: confctl: improve/upgrade --tags/--find - https://phabricator.wikimedia.org/T128199#2099271 (10Joe) p:5Triage>3Low [16:30:45] (03PS1) 10RobH: wrong asset tag for mgmt on scb2001 [dns] - 10https://gerrit.wikimedia.org/r/275840 [16:30:46] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2099272 (10Papaul) Recipe had a problem so installation didn't complete. Fixing the recipe now. [16:31:24] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:32:42] (03PS2) 10RobH: wrong asset tag for mgmt on scb2001 [dns] - 10https://gerrit.wikimedia.org/r/275840 [16:33:03] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:34:50] !log upgrade cassandra to 2.1.13 on restbase1007 [16:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:36:07] (03CR) 10RobH: [C: 032] wrong asset tag for mgmt on scb2001 [dns] - 10https://gerrit.wikimedia.org/r/275840 (owner: 10RobH) [16:36:31] 6Operations, 6Services, 10hardware-requests: codfw: (2+2) sca & scb service clusters - https://phabricator.wikimedia.org/T128475#2099291 (10RobH) [16:36:33] 6Operations, 6Services: setup/deploy sc[a-b]200[1-2] - https://phabricator.wikimedia.org/T129234#2099289 (10RobH) 5Open>3Resolved [16:36:58] eh? [16:37:18] why did a patchset resolve my task? [16:37:24] when it was just suppsoed to reference it? [16:39:06] likely the commit message "fix T..." [16:39:20] IOW no Bug: [16:39:33] 6Operations, 6Services, 10hardware-requests: codfw: (2+2) sca & scb service clusters - https://phabricator.wikimedia.org/T128475#2099314 (10RobH) [16:39:36] 6Operations, 6Services: setup/deploy sc[a-b]200[1-2] - https://phabricator.wikimedia.org/T129234#2099312 (10RobH) 5Resolved>3Open I have no idea why phabricator resolved this task from that patchset. That was not my intention, so re-opening. [16:39:44] ohhh [16:39:57] robh, you should also have put 'Bug: ' before the ticket ID [16:40:14] Krenair: i dont like doing that then it puts patch for review on task [16:40:21] when no bug: just the Task means it shows post merge [16:40:22] which is fine [16:40:49] I thought it was allowed in both ways, if its required I put big: then i can change my workflow. [16:40:53] bug even [16:41:09] robh: I loathe the useless "patch for review" spam too, but apparently people like it. :-( [16:41:26] if i was asking someone else to review i wouldn't dislike it as much [16:41:39] but i dont get reviews on the mac address and os install steps, its pointless [16:42:06] "useless" [16:42:09] phab being smart and parsing my phrases in commit message to resolve task was messing with me [16:42:24] 6Operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out how to migrate the jobqueues - https://phabricator.wikimedia.org/T124673#2099319 (10Joe) Recapitulating what I *understand* to be the best thing to do: Prep work that still needs to be completed: # Add two new servers to the codfw cluster... [16:45:15] (03PS1) 10Papaul: fix: system coulnd't read the recipe Adding the last line to eliminate swap dialogue box during installation Bug:T128796 [puppet] - 10https://gerrit.wikimedia.org/r/275841 (https://phabricator.wikimedia.org/T128796) [16:46:30] !log upgrade cassandra to 2.1.13 on restbase1008 [16:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:48:13] 6Operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out how to migrate the jobqueues - https://phabricator.wikimedia.org/T124673#2099341 (10jcrespo) > mediawiki goes read-only - this should ensure no new job gets enqueued, right? That is what I would expect, but I could not verify last time I d... [16:52:52] 6Operations, 6Services: setup/deploy sc[a-b]200[1-2] - https://phabricator.wikimedia.org/T129234#2099360 (10RobH) [16:53:08] 6Operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Improve documentation about database switchover - https://phabricator.wikimedia.org/T129236#2099362 (10jcrespo) [16:53:14] 6Operations, 10hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2099377 (10Ottomata) I thought about this some more, and talked with the Analytics team. Let's go with the smaller spare (32G, 4 core). I think it will be a rare circumstance that we'd sa... [16:53:45] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2221 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2099380 (10Papaul) [16:55:57] 6Operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2099401 (10Ottomata) Hm, @robh, we could alternatively use WMF4541 for this, no? [16:56:54] (03CR) 10RobH: [C: 032] fix: system coulnd't read the recipe Adding the last line to eliminate swap dialogue box during installation Bug:T128796 [puppet] - 10https://gerrit.wikimedia.org/r/275841 (https://phabricator.wikimedia.org/T128796) (owner: 10Papaul) [16:57:29] !log upgrade cassandra to 2.1.13 on restbase1009 [16:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:58:55] (03PS1) 10Jcrespo: Update s2 partitioning for special slaves [software] - 10https://gerrit.wikimedia.org/r/275846 [16:59:46] (03CR) 10Jcrespo: [C: 032 V: 032] Update s2 partitioning for special slaves [software] - 10https://gerrit.wikimedia.org/r/275846 (owner: 10Jcrespo) [17:00:04] paravoid chasemp: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160308T1700). [17:00:04] twentyafterfour ostriches: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:42] * ostriches waves [17:00:53] (03PS9) 10Rush: Parameterize the git_server variable in global scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) (owner: 1020after4) [17:00:54] (I'm about to join a meeting, so chasemp is going to be handling today's puppet swat) [17:00:55] (03PS3) 10Rush: Gerrit: install standard base on new server lead [puppet] - 10https://gerrit.wikimedia.org/r/274150 (https://phabricator.wikimedia.org/T126794) (owner: 10Chad) [17:00:58] ah there he is :) [17:01:43] PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: Connection refused [17:02:06] me again ^ should be recovering shortly [17:02:23] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [17:02:24] bad godog, bad [17:02:29] :) [17:02:37] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: Connection refused Filippo Giunchedi 2.1.13 upgrade [17:02:37] ACKNOWLEDGEMENT - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed Filippo Giunchedi 2.1.13 upgrade [17:02:41] heh [17:03:43] (03CR) 10Rush: [C: 032 V: 032] "Based on my understanding that lead is a direct parallel to the existing ytterbium box this is not an escalation privilege and seems good " [puppet] - 10https://gerrit.wikimedia.org/r/274150 (https://phabricator.wikimedia.org/T126794) (owner: 10Chad) [17:05:54] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2221 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2099435 (10greg) now we have a third? :) [17:06:08] twentyafterfour: about? [17:06:22] (03PS5) 10Jforrester: Enable VisualEditor for IP users on the German Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271713 (https://phabricator.wikimedia.org/T127881) [17:06:45] chasemp: Lemme know when puppet's done on lead so I can try it out [17:06:56] ostriches: loooks good [17:07:04] give it a whirl [17:08:02] ostriches: when you are don there can you answer a q about https://gerrit.wikimedia.org/r/#/c/272947/ [17:08:09] (03PS4) 10Jcrespo: Revoke iron access; add salt-masters access for mysql management [puppet] - 10https://gerrit.wikimedia.org/r/275777 [17:08:45] 6Operations, 10ops-codfw: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2099453 (10Papaul) @jcrespo thanks so i will be waiting on @RobH for final confirmation on T125827 [17:10:35] chasemp: lead looks good, was able to ssh and sudo. [17:11:11] (03CR) 10ArielGlenn: [C: 031] "feel free." [puppet] - 10https://gerrit.wikimedia.org/r/275777 (owner: 10Jcrespo) [17:11:28] chasemp: What's up with 272947? [17:12:03] !log upgrade cassandra to 2.1.13 on restbase1010 [17:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:12:13] RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on port 9042 [17:12:28] ostriches: ok so, I totally get intent I think but not necessarily execution, we are setting https://gerrit.wikimedia.org/r/#/c/272947/9/hieradata/common/scap.yaml which trickles down to https://gerrit.wikimedia.org/r/#/c/272947/9/modules/scap/templates/scap.cfg.erb [17:12:53] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [17:13:02] but that setting for prod seems not to be 1:1 for the setting the file and the setting in teh file was for deployment-prep only [17:13:20] and I don't see a deployment-prep hiera setting or mention of one on-wikitech [17:13:27] (03CR) 10Jcrespo: [C: 032] Revoke iron access; add salt-masters access for mysql management [puppet] - 10https://gerrit.wikimedia.org/r/275777 (owner: 10Jcrespo) [17:14:16] I'm confused on outcome here [17:14:23] 6Operations, 10Monitoring, 10netops, 10Scap3 (scap3-adoption): Deploy libreNMS with scap3 - https://phabricator.wikimedia.org/T129136#2099469 (10thcipriani) >>! In T129136#2098628, @faidon wrote: > What would be the next steps for this? Scap makes a few different assumptions than Trebuchet that you'd need... [17:14:24] I worry this breaks deployment-prep among other things [17:14:33] chasemp: It looks like it'd end up defaulting to the prod setting which obvs wouldn't work. [17:14:39] you know about the cherry-picks on deployment-puppetmaster? [17:14:39] right [17:14:42] It'll break deployment-prep, yeah [17:14:51] ok I'm going to wait then w/o twentyafterfour around [17:15:01] Yeah I think it needs amending one last time. [17:15:07] Otherwise it lgtm [17:16:20] the change I think is ok but it's missing associated newly paramiterized things that go with it [17:16:31] and I just don't know deployment-prep well enough to roll w/ it and test [17:16:37] otherwise I would just do it [17:19:13] (03CR) 10Rush: [C: 04-1] "Looked to do this for puppet swat but AFAICT this would break deployment-prep, I confirmed w/ chad my thinking here. I'm not involved wit" [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) (owner: 1020after4) [17:19:43] * robh_away is away for the next hour (just mentioning it since on duty) [17:23:57] (03PS2) 10Andrew Bogott: Quiet down keystone logs for now. [puppet] - 10https://gerrit.wikimedia.org/r/275817 [17:26:31] (03PS1) 10Filippo Giunchedi: cassandra: bootstrap restbase1011 instances [puppet] - 10https://gerrit.wikimedia.org/r/275852 [17:26:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: bootstrap restbase1011 instances [puppet] - 10https://gerrit.wikimedia.org/r/275852 (owner: 10Filippo Giunchedi) [17:32:00] (03PS3) 10Andrew Bogott: Quiet down keystone logs for now. [puppet] - 10https://gerrit.wikimedia.org/r/275817 [17:34:45] !log applying new grants to all database servers [17:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:35:12] (03CR) 10Andrew Bogott: [C: 032] Quiet down keystone logs for now. [puppet] - 10https://gerrit.wikimedia.org/r/275817 (owner: 10Andrew Bogott) [17:35:23] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [17:36:03] (03PS1) 10Mholloway: Add Accept: header to RESTBase/Parsoid requests [puppet] - 10https://gerrit.wikimedia.org/r/275853 (https://phabricator.wikimedia.org/T128237) [17:36:50] !log sinistra signing puppet certs, salt-key, initial run [17:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:37:12] (03PS1) 10Andrew Bogott: Install designate-dashboard, attach and configure panels. [puppet] - 10https://gerrit.wikimedia.org/r/275854 [17:37:45] ^Error: Could not find any host matching 'restbase1011' [17:38:06] /etc/icinga/puppet_hostextinfo.cfg [17:38:24] (03CR) 10jenkins-bot: [V: 04-1] Install designate-dashboard, attach and configure panels. [puppet] - 10https://gerrit.wikimedia.org/r/275854 (owner: 10Andrew Bogott) [17:38:24] jynus: that'd be me, not sure why it isn't there yet [17:38:34] race condition, maybe? [17:38:44] I was checking because it may have be me [17:39:02] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2099558 (10Papaul) [17:39:03] yeah probably, I just signed the salt key for the host as puppet before couldn't complete, so perhaps it is that [17:39:20] run puppet on all hosts again, and then on neon [17:39:24] to be sure [17:39:31] (03PS1) 10Chad: phab_epipe.py: don't use lambda when it's not needed [puppet] - 10https://gerrit.wikimedia.org/r/275855 [17:39:44] on all *changed/added* hosts [17:40:07] jynus: yup, running now [17:44:42] (03PS4) 10BBlack: cache_text: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275811 [17:45:58] (03CR) 10BBlack: [C: 032 V: 032] cache_text: ulsfo->codfw T127492 [puppet] - 10https://gerrit.wikimedia.org/r/275811 (owner: 10BBlack) [17:46:40] !log switching text cache routing: ulsfo->codfw [17:46:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:49:49] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2099636 (10Papaul) [17:50:02] (03PS2) 10Andrew Bogott: Install designate-dashboard, attach and configure panels. [puppet] - 10https://gerrit.wikimedia.org/r/275854 [17:50:50] https://integration.wikimedia.org/zuul/ bot got stuck at a change [17:51:07] on gate-and-submit [17:52:19] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2099641 (10Papaul) I don't know who is responsible for the service implementation of this system. Please advance or just claim the ticket. The installation proc... [17:52:47] robh_away: when you get back, can you clarify https://phabricator.wikimedia.org/T129196 plz? kthx [17:53:21] 6Operations, 6Performance-Team, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out how to migrate the jobqueues - https://phabricator.wikimedia.org/T124673#2099642 (10ori) [17:54:50] (03CR) 1020after4: "this would _fix_ not break deployment-prep" [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) (owner: 1020after4) [17:55:25] <_joe_> greg-g: it's 2212 [17:55:52] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2212 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2099661 (10Joe) [17:56:02] _joe_: :) [17:56:10] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2212 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2097723 (10Joe) no, we don't. It's just mw2212. [17:56:31] 6Operations: mw2212 unresponsive - https://phabricator.wikimedia.org/T129188#2099668 (10greg) [17:56:34] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2212 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2099669 (10greg) [17:56:49] (03CR) 10Chad: "Where is $scap::deployment_server set for deployment-prep then?" [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) (owner: 1020after4) [17:57:02] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2212 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2097723 (10greg) Thanks @Joe :) Merged in my task from yesterday that was a collision with this one. [17:57:18] (03PS1) 10EBernhardson: Stop pushing ES updates to nobelium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275858 [17:57:34] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Log host for codfw (fluorine's equivalent) - https://phabricator.wikimedia.org/T126988#2029020 (10ori) @RoBH, I can take care of configuring this host; please assign to me once the server is racked / online / has a base install. [17:57:54] (03CR) 10jenkins-bot: [V: 04-1] Stop pushing ES updates to nobelium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275858 (owner: 10EBernhardson) [17:58:34] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [17:59:06] (03PS1) 10Cmjohnson: Adding dhcp entries for db1074-78 and labsdb1008. All are set to install jessie [puppet] - 10https://gerrit.wikimedia.org/r/275859 [17:59:15] robh_away: nvm, _joe_ clarified on task [18:00:04] yurik gwicke cscott arlolra subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160308T1800). Please do the needful. [18:00:26] no deploys [18:00:30] *parsoid [18:00:52] 6Operations, 15User-mobrovac, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#1973660 (10mark) >>! In T125069#2080421, @GWicke wrote: > Currently, loc... [18:03:11] (03PS2) 10EBernhardson: Stop pushing ES updates to nobelium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275858 [18:03:49] (03CR) 10Cmjohnson: [C: 032] Adding dhcp entries for db1074-78 and labsdb1008. All are set to install jessie [puppet] - 10https://gerrit.wikimedia.org/r/275859 (owner: 10Cmjohnson) [18:04:48] (03PS3) 10Andrew Bogott: Install designate-dashboard, attach and configure panels. [puppet] - 10https://gerrit.wikimedia.org/r/275854 [18:04:57] 6Operations, 10ops-eqiad: Rack and Initial setup db1074-79 - https://phabricator.wikimedia.org/T128753#2099697 (10Cmjohnson) [18:05:04] PROBLEM - Restbase root url on restbase1011 is CRITICAL: Connection refused [18:05:13] 6Operations, 10ops-eqiad: Rack and Initial setup db1074-79 - https://phabricator.wikimedia.org/T128753#2084820 (10Cmjohnson) fixed the mgmt issue for db1077 [18:05:32] PROBLEM - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is CRITICAL: Connection refused [18:07:13] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.113, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [18:07:42] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [18:07:57] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-67-If, Then, Else...?, and 2 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2099704 (10Jdlrobson) [18:10:42] (03PS4) 10Andrew Bogott: Install designate-dashboard, attach and configure panels. [puppet] - 10https://gerrit.wikimedia.org/r/275854 [18:11:13] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:12:17] (03CR) 1020after4: "chad: https://wikitech.wikimedia.org/wiki/Hiera:Deployment-prep" [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) (owner: 1020after4) [18:13:29] 6Operations, 6Performance-Team, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Ensure post-send handlers check and respect read-only-mode - https://phabricator.wikimedia.org/T129250#2099740 (10ori) [18:13:53] (03PS5) 10Andrew Bogott: Install designate-dashboard, attach and configure panels. [puppet] - 10https://gerrit.wikimedia.org/r/275854 [18:16:36] (03CR) 10Andrew Bogott: [C: 032] Install designate-dashboard, attach and configure panels. [puppet] - 10https://gerrit.wikimedia.org/r/275854 (owner: 10Andrew Bogott) [18:18:27] 6Operations, 10MobileFrontend, 10Traffic, 13Patch-For-Review, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2099795 (10Jdlrobson) [18:20:06] (03CR) 10Tim Landscheidt: [C: 031] "LGTM from the Puppet side (and I tested that to be no-op on toolsbeta-proxy-02); the commit message refers to nova::site (should be nginx:" [puppet] - 10https://gerrit.wikimedia.org/r/274962 (owner: 10Muehlenhoff) [18:21:54] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switch ulsfo to backend to codfw rather than eqiad - https://phabricator.wikimedia.org/T127492#2099821 (10BBlack) 5Open>3Resolved a:3BBlack [18:21:56] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Traffic Infrastructure support for Mar 2016 codfw rollout - https://phabricator.wikimedia.org/T125510#2099823 (10BBlack) [18:26:21] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Traffic Infrastructure support for Mar 2016 codfw rollout - https://phabricator.wikimedia.org/T125510#2099839 (10BBlack) Status update: The only remaining work here ahead of the big switches of the applayer services... [18:28:29] (03PS1) 10Andrew Bogott: Horizon: the package is python-designateclient, not python-designate-client [puppet] - 10https://gerrit.wikimedia.org/r/275865 [18:30:16] 6Operations, 6Labs, 10Tool-Labs, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2099871 (10Dzahn) Yea, it meant the request came via http and was 200. [18:31:01] (03CR) 10Andrew Bogott: [C: 032] Horizon: the package is python-designateclient, not python-designate-client [puppet] - 10https://gerrit.wikimedia.org/r/275865 (owner: 10Andrew Bogott) [18:33:40] 6Operations, 6Performance-Team, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Ensure maintainers of long-running scripts on terbium expect downtime for switchover - https://phabricator.wikimedia.org/T129258#2099885 (10ori) [18:34:27] ACKNOWLEDGEMENT - Restbase root url on restbase1011 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [18:34:27] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [18:34:27] ACKNOWLEDGEMENT - restbase endpoints health on restbase1011 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.113, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Filippo Giunchedi bootstrapping [18:34:59] 6Operations, 6Performance-Team, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Ensure maintainers of long-running scripts on terbium expect downtime for switchover - https://phabricator.wikimedia.org/T129258#2099885 (10ori) [18:36:11] twentyafterfour: are you about? [18:36:30] chasemp: I am [18:37:07] from your comments I think I grok the https://gerrit.wikimedia.org/r/#/c/272947/ outcome now, I have a minute, I'll merge if you want to test w/ me? [18:37:33] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 1 failures [18:37:40] assuming the swat window was inclusive not exclusive for this [18:39:06] 6Operations, 10Ops-Access-Requests: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2099926 (10Ironholds) [18:40:02] chasemp: sure, not a huge rush but if you have a minute :) [18:40:20] 7Blocked-on-Operations, 6Operations, 10Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#2099942 (10Krenair) 5Resolved>3Open ```krenair@bastion-01:~$ ldaplist -l group wmf | grep -i jminor krenair@bastion-01:~$ ldaplist -... [18:40:24] (03PS10) 10Rush: Parameterize the git_server variable in global scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) (owner: 1020after4) [18:40:29] sure better to knock it out [18:40:51] (03PS1) 10Dzahn: remove VMs cygnus and technetium [puppet] - 10https://gerrit.wikimedia.org/r/275871 [18:40:53] (03PS1) 10Cmjohnson: Fixing typos on dhcp file for db1074-78 [puppet] - 10https://gerrit.wikimedia.org/r/275872 [18:41:41] (03PS2) 10Dzahn: remove VMs cygnus and technetium [puppet] - 10https://gerrit.wikimedia.org/r/275871 (https://phabricator.wikimedia.org/T118763) [18:41:50] (03PS2) 10Cmjohnson: Fixing typos on dhcp file for db1074-78 [puppet] - 10https://gerrit.wikimedia.org/r/275872 [18:43:06] 7Blocked-on-Operations, 6Operations, 10Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#2099945 (10Krenair) Ah - that second one is due to cn != uid... But the first one definitely appears missing? [18:43:25] (03PS1) 10Dzahn: admin: remove keys of akumar,mnoushad [puppet] - 10https://gerrit.wikimedia.org/r/275873 (https://phabricator.wikimedia.org/T126012) [18:43:29] (03CR) 1020after4: [C: 031] "it's more critical to verify that git_server is set correctly in production than deployment-prep but I can help test both" [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) (owner: 1020after4) [18:44:40] (03PS1) 10Dzahn: admin: set akumar, mnoushad to absent [puppet] - 10https://gerrit.wikimedia.org/r/275874 (https://phabricator.wikimedia.org/T126012) [18:45:11] twentyafterfour: just waiting on jenkins...who is on lunch maybe :) [18:45:57] (03PS1) 10Dzahn: remove cygnus,technetium from hieradata, incl. admin groups [puppet] - 10https://gerrit.wikimedia.org/r/275877 (https://phabricator.wikimedia.org/T118763) [18:45:58] heh [18:47:06] 6Operations, 10Ops-Access-Requests: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2099978 (10mpopov) [18:47:51] it's a rebase I can see has linted fine so I'm goign to roll w/ it [18:47:54] not sure where jenkins is [18:47:57] (03CR) 10Rush: [C: 032 V: 032] Parameterize the git_server variable in global scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) (owner: 1020after4) [18:48:48] 6Operations, 10Ops-Access-Requests: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2099926 (10Deskana) Presumably, this access request needs approval from the manager of Mikhail Popov and Oliver Keyes... which is me! Approved. [18:49:31] twentyafterfour: that blew up on tin [18:49:31] Error: Failed to apply catalog: Parameter source failed on File[/etc/scap.cfg]: Could not understand source ######### [18:50:04] hmm [18:50:24] you removed the source file by renaming [18:50:34] or not [18:50:38] nvmd ok what the heck [18:51:22] something is wrong w/ that template I guess [18:52:04] chasemp: it's supposed to be content [18:52:05] not source [18:52:09] * twentyafterfour slaps forehead [18:52:13] :) [18:52:18] right [18:52:22] I'll commit another patch real quick? [18:52:27] please [18:52:33] PROBLEM - puppet last run on mw2094 is CRITICAL: CRITICAL: puppet fail [18:52:53] PROBLEM - puppet last run on mw1094 is CRITICAL: CRITICAL: Puppet has 1 failures [18:52:54] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: Puppet has 1 failures [18:53:03] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: puppet fail [18:53:12] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 1 failures [18:53:13] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: puppet fail [18:53:22] PROBLEM - puppet last run on mw1084 is CRITICAL: CRITICAL: Puppet has 1 failures [18:53:23] PROBLEM - puppet last run on mw2107 is CRITICAL: CRITICAL: puppet fail [18:53:24] PROBLEM - puppet last run on mw2133 is CRITICAL: CRITICAL: puppet fail [18:53:32] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Puppet has 1 failures [18:53:41] ^ twentyafterfour is that your bad template reference? [18:53:42] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: puppet fail [18:53:44] PROBLEM - puppet last run on mw1180 is CRITICAL: CRITICAL: Puppet has 1 failures [18:53:48] if it applied to all mw hosts I bet it is [18:53:52] PROBLEM - puppet last run on mw2150 is CRITICAL: CRITICAL: puppet fail [18:53:53] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Puppet has 1 failures [18:53:53] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [18:53:54] PROBLEM - puppet last run on mw2174 is CRITICAL: CRITICAL: Puppet has 1 failures [18:53:59] (03CR) 10CSteipp: [C: 031] Whitelist feeds included on Wikimedia Germany Engineering page on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275815 (https://phabricator.wikimedia.org/T127176) (owner: 10WMDE-leszek) [18:54:02] PROBLEM - puppet last run on mw2053 is CRITICAL: CRITICAL: puppet fail [18:54:10] (03PS1) 10Andrew Bogott: Redefine labs_horizon_host in hiera [puppet] - 10https://gerrit.wikimedia.org/r/275882 [18:54:13] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Puppet has 1 failures [18:54:13] PROBLEM - puppet last run on mw2172 is CRITICAL: CRITICAL: Puppet has 1 failures [18:54:13] PROBLEM - puppet last run on mw2200 is CRITICAL: CRITICAL: Puppet has 1 failures [18:54:22] !log uploaded linux 4.4.2-3+wmf1/jessie-wikimedia (based on Linux 4.4.4) to carbon [18:54:22] PROBLEM - puppet last run on mw2048 is CRITICAL: CRITICAL: puppet fail [18:54:22] PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: puppet fail [18:54:22] PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: puppet fail [18:54:23] PROBLEM - puppet last run on mw2098 is CRITICAL: CRITICAL: Puppet has 1 failures [18:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:32] PROBLEM - puppet last run on mw1240 is CRITICAL: CRITICAL: puppet fail [18:54:33] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: puppet fail [18:54:42] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: Puppet has 1 failures [18:54:43] PROBLEM - puppet last run on mw1080 is CRITICAL: CRITICAL: puppet fail [18:54:52] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: puppet fail [18:54:53] PROBLEM - puppet last run on mw2046 is CRITICAL: CRITICAL: puppet fail [18:54:53] PROBLEM - puppet last run on mw2027 is CRITICAL: CRITICAL: puppet fail [18:54:54] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: puppet fail [18:54:54] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: puppet fail [18:55:01] (03PS1) 1020after4: fix scap/init.pp source -> content [puppet] - 10https://gerrit.wikimedia.org/r/275883 (https://phabricator.wikimedia.org/T126259) [18:55:05] !log temp. stopped icinga-wm [18:55:08] https://gerrit.wikimedia.org/r/275883 [18:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:25] chasemp: yeah that's probably my bad [18:55:34] chasemp: see fix linked above [18:55:40] yes, that puppet error is the template thing [18:55:51] was on a random mw [18:55:59] * twentyafterfour hides [18:56:09] :( [18:56:33] no problem, i just stopped the bot for a moment, will restart it when most are recovered [18:56:43] that was a bush league [18:57:12] puppet compiler :-P [18:57:13] twentyafterfour: I don't think there is a space between 'content' and '=>'? not sure it will lint [18:57:32] :( wth [18:57:42] (03PS3) 10Cmjohnson: Fixing typos on dhcp file for db1074-78 [puppet] - 10https://gerrit.wikimedia.org/r/275872 [18:57:48] I have a commit hook for puppet lint in my repo, why didn't that catch it? [18:57:51] and what chase said [18:58:39] twentyafterfour: because it's disabled in .puppet-lint.rc :/ [18:58:48] (03PS2) 1020after4: fix scap/init.pp source -> content [puppet] - 10https://gerrit.wikimedia.org/r/275883 (https://phabricator.wikimedia.org/T126259) [18:58:49] if it's the one from ops/puppet repo [18:59:10] we need to fix the remaining ones, then we can enable it [19:00:13] puppet-lint is disabled for ops/puppet?! :-O [19:00:22] well I committed the fix to https://gerrit.wikimedia.org/r/#/c/275883/ [19:00:29] no, this one specific check about aligned arrows is [19:01:06] basically it goes like this: almost all checks were disabled, we fix one globally, we can enable it again.. on to the next one.. [19:01:27] (03CR) 10Rush: [C: 032] fix scap/init.pp source -> content [puppet] - 10https://gerrit.wikimedia.org/r/275883 (https://phabricator.wikimedia.org/T126259) (owner: 1020after4) [19:02:03] I was very much relying on my commit hook to catch my stupid mistakes. I'll look into overriding some of that stuff locally [19:02:19] you can delete .puppet-lint.rc , then it runs all checks [19:02:50] (03CR) 10Andrew Bogott: [C: 032] Redefine labs_horizon_host in hiera [puppet] - 10https://gerrit.wikimedia.org/r/275882 (owner: 10Andrew Bogott) [19:02:56] (03PS4) 10Cmjohnson: Fixing typos on dhcp file for db1074-78 [puppet] - 10https://gerrit.wikimedia.org/r/275872 [19:03:11] on the day we can run it without errors using the defaults we can close https://phabricator.wikimedia.org/T93645 [19:03:12] (03PS2) 10Andrew Bogott: Redefine labs_horizon_host in hiera [puppet] - 10https://gerrit.wikimedia.org/r/275882 [19:03:42] why does it seem like I always have to rebase at least once before I can merge..frustrating! [19:03:44] twentyafterfour: it seems good, if you can make sure scap things are sane I think it's ok [19:04:32] cmjohnson1: becaues we switched the "submit type" of the repo to "Fast Forward Only", so yea, you have to rebase each time unless nobody else merged anything in between [19:04:38] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2100032 (10Gehel) The idea was to reuse the same code to expose puppet certificates, which implies some refactoring to k8s module. This see... [19:04:58] (03CR) 10Cmjohnson: [C: 032] Fixing typos on dhcp file for db1074-78 [puppet] - 10https://gerrit.wikimedia.org/r/275872 (owner: 10Cmjohnson) [19:06:20] 6Operations, 6Commons, 10media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#2100040 (10kaldari) @fgiunchedi: Could you create a Phabricator ticket for that issue and add it as a blocker to this ticket? Are there any other known regressions? [19:06:28] (03PS3) 10Andrew Bogott: Redefine labs_horizon_host in hiera [puppet] - 10https://gerrit.wikimedia.org/r/275882 [19:06:32] chasemp: running puppet on deployment-tin [19:07:16] cmjohnson1: because you are fast enough to not have to rebase 2 times [19:07:32] apergos: hah..not that time [19:07:37] well. mostly [19:10:19] chasemp: all good on production and deployment-prep [19:16:41] 6Operations, 10Mail: fr-all fails with error 451 - https://phabricator.wikimedia.org/T129168#2100090 (10Dzahn) Hey, we just removed that again and sent a test mail and i could confirm in logs on mx1001 the error is gone and mail to fr-all@ was handled normally and given to Google. Daniel [19:20:33] 6Operations, 10Mail: fr-all fails with error 451 - https://phabricator.wikimedia.org/T129168#2100114 (10Dzahn) 5Open>3Resolved [19:21:55] 6Operations, 10Mail: move fundraising group aliases to OIT - https://phabricator.wikimedia.org/T128647#2100121 (10Dzahn) Hi, the part about fr-all@ is also fixed now. Daniel [19:26:27] (03PS3) 10Mobrovac: Introducing changeprop role and puppet module [puppet] - 10https://gerrit.wikimedia.org/r/275772 (https://phabricator.wikimedia.org/T128463) [19:27:20] (03PS5) 10CSteipp: Password policies for advanced permission groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) [19:28:29] (03CR) 10Madhuvishy: "This is good to go now." [puppet] - 10https://gerrit.wikimedia.org/r/274286 (owner: 10Madhuvishy) [19:29:14] (03CR) 10CSteipp: Password policies for advanced permission groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [19:29:43] (03PS4) 10Ottomata: eventlogging: Change client side processor format string to ignore ClientIP [puppet] - 10https://gerrit.wikimedia.org/r/274286 (owner: 10Madhuvishy) [19:29:53] (03CR) 10Ottomata: [C: 032 V: 032] eventlogging: Change client side processor format string to ignore ClientIP [puppet] - 10https://gerrit.wikimedia.org/r/274286 (owner: 10Madhuvishy) [19:33:10] (03PS1) 10Andrew Bogott: Enable designate API v2. [puppet] - 10https://gerrit.wikimedia.org/r/275890 [19:34:36] (03CR) 10Andrew Bogott: [C: 032] Enable designate API v2. [puppet] - 10https://gerrit.wikimedia.org/r/275890 (owner: 10Andrew Bogott) [19:35:16] (03PS1) 10Mobrovac: Assign changeprop service to scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/275891 (https://phabricator.wikimedia.org/T128463) [19:35:28] 6Operations, 13Patch-For-Review: Sudden increase in NOTICE events from hhvm while trying to de-pool rdb1003 for maintenance - https://phabricator.wikimedia.org/T128730#2100212 (10aaron) That procedure makes sense. As far as the slots go, having one disabled just means all the jobs in it get delayed for that l... [19:36:30] (03Abandoned) 10Mobrovac: Setup LVS for changeprop service on scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/275774 (owner: 10Mobrovac) [19:37:27] (03Abandoned) 10Mobrovac: Assign changeprop service to scb cluster [puppet] - 10https://gerrit.wikimedia.org/r/275773 (owner: 10Mobrovac) [19:41:59] (03PS1) 10Ottomata: Remove clientIp from EventLogging varnishkafka format [puppet] - 10https://gerrit.wikimedia.org/r/275892 (https://phabricator.wikimedia.org/T128407) [19:43:42] (03PS2) 10EBernhardson: Build cirrus completion indices daily [puppet] - 10https://gerrit.wikimedia.org/r/275749 [19:44:42] (03CR) 10jenkins-bot: [V: 04-1] Build cirrus completion indices daily [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [19:45:10] (03CR) 10EBernhardson: Build cirrus completion indices daily (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [19:45:17] (03CR) 10EBernhardson: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [19:46:00] (03PS3) 10EBernhardson: Build cirrus completion indices daily [puppet] - 10https://gerrit.wikimedia.org/r/275749 [19:47:18] (03CR) 10jenkins-bot: [V: 04-1] Build cirrus completion indices daily [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [19:48:05] (03PS4) 10EBernhardson: Build cirrus completion indices daily [puppet] - 10https://gerrit.wikimedia.org/r/275749 [19:50:59] (03CR) 10CSteipp: Password policies for advanced permission groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [19:52:18] (03PS2) 10EBernhardson: Don't create new log files for cirrus-suggest with logrotate [puppet] - 10https://gerrit.wikimedia.org/r/268215 [19:52:26] (03CR) 10EBernhardson: Don't create new log files for cirrus-suggest with logrotate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268215 (owner: 10EBernhardson) [19:52:47] (03CR) 10Gehel: [C: 031] Build cirrus completion indices daily [puppet] - 10https://gerrit.wikimedia.org/r/275749 (owner: 10EBernhardson) [19:52:58] (03PS1) 10Andrew Bogott: Make designate quotas fully public [puppet] - 10https://gerrit.wikimedia.org/r/275895 [19:53:17] 6Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: Dynamic backend selection via X-Wikimedia-Debug header - https://phabricator.wikimedia.org/T129000#2100288 (10ori) This has been implemented and rolled out. Keeping this open because docs on Wikitech need to be updated. [19:55:04] (03CR) 10Andrew Bogott: [C: 032] Make designate quotas fully public [puppet] - 10https://gerrit.wikimedia.org/r/275895 (owner: 10Andrew Bogott) [19:55:40] (03CR) 10Gehel: [C: 031] "Looks good to me (besides the annoying puppet-lint warning)" [puppet] - 10https://gerrit.wikimedia.org/r/268215 (owner: 10EBernhardson) [19:56:06] 6Operations, 10Traffic: Fix puppet on deployment-cache* hosts in beta labs - https://phabricator.wikimedia.org/T129270#2100309 (10Ottomata) [19:58:46] 6Operations, 6Commons, 10media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#2100345 (10Tgr) [19:59:33] (03PS1) 10Andrew Bogott: Make designate records public. [puppet] - 10https://gerrit.wikimedia.org/r/275896 [20:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160308T2000). Please do the needful. [20:00:57] jouncebot: respected bot, doing the needful. [20:02:22] (03CR) 10Andrew Bogott: [C: 032] Make designate records public. [puppet] - 10https://gerrit.wikimedia.org/r/275896 (owner: 10Andrew Bogott) [20:02:27] 6Operations, 6Commons, 10media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#2100371 (10Tgr) This is really blocked on having a sane way to test thumbnailing. @fgiunchedi do you think we could use a production cluster scaler for that? [20:11:24] bblack: hi! have a minute for more varnish questions? :) [20:13:09] SMalyshev: maybe! [20:14:13] bblack: so I'm trying to figure our how varnish handles multiple URL forms. If I have wiki/Duck it can also be /w/index.php?title=Duck and /w/index.php/Duck. So if I want to purge it from varnish, which URL I use? [20:14:42] is it three different URLs for varnish or there is a canonical form? [20:15:41] SMalyshev: IMHO, in an ideal world, every unique resource should have one canonical URL, and that's the only one that works for users, end of story. [20:16:02] in a slightly-less-ideal world, you at least redirect the non-canincal forms to the canonical one, and then you don't have to purge them. [20:16:13] in an unideal world, you have to purge them all [20:16:24] which world you live in depends on the application layer :) [20:17:10] in a future world, we'll support XKey or similar so that all three can be purged with one purge, but that's not really about solving that particular problem. It's about solving a broader meta-problem where there are distinct-but-related resources (e.g. mobile+desktop output for a single "article") [20:17:15] bblack: if I have wikidata, e.g. https://www.wikidata.org/wiki/Special:EntityData/Q4115189.ttl which form I'd use to remove it from cache? [20:17:29] bblack: and the same for https://www.wikidata.org/wiki/Special:EntityData/Q4115189.ttl?flavor=simple for example [20:17:45] you wouldn't, you'd rely on MediaWiki to do so [20:18:11] (which is currently rather flawed, but that's neither here nor here) [20:18:42] are we asking about wanting WDQS to emit its own purges of WD's content? [20:19:42] (03CR) 10Mobrovac: "I think it would be much easier if we are thinking about these services in the proper sense. Parsoid is fully stateless, while RB's statef" [puppet] - 10https://gerrit.wikimedia.org/r/275443 (https://phabricator.wikimedia.org/T125673) (owner: 10Giuseppe Lavagetto) [20:19:56] lacking the future world of XKey though: presumably flavor=simple modifies the content, so they're distinct objects with distinct purging [20:20:40] because it has no idea they exist [20:20:41] but I want to fix that... [20:20:54] is this a question about WDQS or WD itself? [20:21:07] bblack: WD itself. https://phabricator.wikimedia.org/T128667 [20:21:55] bblack: but has implications for wdqs since now because of WD messing up cache handling each WDQS instance has to request non-cached expot data. And it could use cached ones if they'd be handled properly [20:22:02] also https://phabricator.wikimedia.org/T128486 [20:22:12] 6Operations, 10Wikimedia-SVG-rendering: Install (currently non-existing) Debian packages for PT (paratype) font on image scalars - https://phabricator.wikimedia.org/T97181#2100436 (10Aklapper) >>! In T97181#1238735, @Dzahn wrote: > Faidon said 'no alien please'. So yea, needs "real" .deb packages. [[ http://c... [20:23:17] SMalyshev: the bottom line on that is that purging sucks right now, and it's going to be a while before it's fixed, and you may need to block on that if it's a blocker. Fixing it isn't easy, and it would be a negative to invoke ugly workarounds, too. [20:23:44] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2212 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2100438 (10Papaul) [20:23:54] bblack: well, there is now purging for wikidata URLs. But it's not complete. So I just want to add the missing part [20:24:26] it's not that simple, though.... [20:24:27] namely, not only purging the data for Special:EntityData/Q4115189.ttl but also for Special:EntityData/Q4115189.ttl?flavor=X [20:24:58] bblack: so where is the problem there? [20:25:15] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2212 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2097723 (10Papaul) Did a full hardware scan since this morning, the result came out with no issue found. The system is back up. [20:25:35] SMalyshev: I'm digging for links from e.g. mobile's current answer to this sort of problem [20:26:42] SMalyshev: https://github.com/wikimedia/mediawiki-extensions-MobileFrontend/commit/c39b43a72e5af84ae92ca463ba2e4c27a628dca8 [20:27:13] the similar problem with MobileFrontend is that with MFE, every article edit needs to purge e.g. de.wiki/wiki/Foo + de.m.wiki/wiki/Foo [20:27:26] so it hooked in there and multiplied the PURGE volume on every edit [20:27:41] bblack: aha, I see [20:27:50] page? [20:27:53] <_joe_> yes [20:27:56] <_joe_> just got it [20:28:02] SMalyshev: that's the only answer we really have today: find a logical way to hook up the list of all possible variants, and purge them all [20:28:06] bblack: so it's kind of weird that one url is http://en.wikipedia.org/wiki/PurgeTest but another is http://en.wikipedia.org/w/index.php?title=PurgeTest&action=history [20:28:18] SMalyshev: that's handled elsewhere too [20:28:34] in the standard wikipedia cases, MW emits PURGEs for quite a number of distinct URLs for one article edit [20:28:39] bblack: but those not all possible variants. There's http://en.wikipedia.org/PurgeTest?action=history for example [20:28:45] yup [20:29:06] bblack: so should those be included too or this is handled automatically? [20:29:10] it already purges a number of variants, enough that people aren't complaining like crazy.... [20:29:25] or it isn't handled but nobody cares? :) [20:29:50] some variants are handled, and some are not, and the ones that aren't don't seem to cause a lot of complaints right now [20:30:16] and in theory, you could hook in a bunch of WD variant-purging in a similar way and emit multiple purges per update, too [20:30:25] something is happening on db1066 [20:30:38] bblack: yeah I know, that's what I'm trying to so, I'm just trying to figure out which URLs I should use [20:30:47] but we're also facing a semi-related crisis of "Way Too Many PURGEs" right now, so I don't actually want anyone to fix those problems right now [20:31:08] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2212 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2100493 (10Papaul) Will keep an eye on this until tomorrow. [20:31:14] oh jynus there you are, good because while maybe a few bin logs could be tossed I'd really rather someone who knows something do it (i.e. you) [20:31:38] bblack: well, you won't get many more purges from this, maybe 3-4 extra URLs per wikidata edit [20:32:06] it is growing way to much, something is not normal [20:32:16] SMalyshev: in part due to legitimate variant increases, and in part due to some horrible malfunction of JobRunner stuff related to purging that may be issuing more duplicates, and perhaps in part due to intentional delayed duplication to "work around" PURGE issues.... [20:32:59] SMalyshev: basically our PURGE volume is now insane, and nobody has an answer for getting it lower, or even a clear picture of exactly which events caused it to massively increase over the past few months (although WD is itself indirectly implicated, as causing purges of articles that pull from wikidata) [20:33:47] I am depooling it [20:33:49] https://phabricator.wikimedia.org/T124418 tracks that [20:33:52] bblack: I see. that looks like a problem, though not related to one I'm looking at [20:34:04] bblack: potentially we could fingerprint all purge requests emitted by MediaWiki / log them with stacktrace [20:34:42] jynus: SELECT /* ApiQueryAllRevisions::run */ in sorting result for 5900+ seconds [20:34:46] but yeah looks like URL variants are out of control and we don't have currently any solution for it [20:34:58] either way, there were existing design problems with how we do PURGEs at all (obviously), and there are ongoing efforts to eventually fix them [20:35:14] but in the meantime, adding more purge traffic for anything sounds scary [20:35:18] (03PS1) 10Jcrespo: Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275904 [20:35:28] we're already occasionally just dropping purge requests on the floor when they overflow local queues [20:35:35] that shouldnt be there, mediawiki kills queries after 300 seconds [20:36:14] (03PS1) 10Thcipriani: Use repo_path instead of repo for deploy-local [puppet] - 10https://gerrit.wikimedia.org/r/275905 [20:36:20] (03CR) 10Jcrespo: [C: 032 V: 032] Depool db1066 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275904 (owner: 10Jcrespo) [20:36:23] bblack: well, for T128667 it won't be more traffic since it's manual purge. For T128486 it might be, but we can hold on that if needed, I think T128667 has to be fixed anyway [20:36:37] because right now we're purging half of the data and it's just wrong [20:36:51] yup [20:37:06] but we're also spamming the PURGE queues, which is also wrong, and I don't have any solution yet for that either :) [20:37:29] right. I guess to solve it we'd need cache tags... [20:37:43] after depooling (if it doesn't get moved to the other server, we investigate) [20:38:47] the queries are from the same user [20:39:03] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1066 (duration: 01m 49s) [20:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:39:25] (03PS3) 10EBernhardson: Don't create new log files for cirrus-suggest with logrotate [puppet] - 10https://gerrit.wikimedia.org/r/268215 [20:40:09] !log killing long running queries on db1066 [20:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:40:56] running for 1:30hour + ... how did that happen ? [20:41:57] same on the other host [20:42:15] I will comment privatelly [20:42:28] (03PS6) 10CSteipp: Password policies for advanced permission groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) [20:43:43] (03CR) 10CSteipp: "PS6 includes all of the local wiki groups as well. So with this, the RFC should be fulfilled." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [20:45:28] (03PS2) 10Ottomata: Update cdh submodule with oozie purge change [puppet] - 10https://gerrit.wikimedia.org/r/274829 (https://phabricator.wikimedia.org/T127988) [20:45:33] !log Depooled mw1200 for HHVM update [20:45:35] (03CR) 10Ottomata: [C: 032 V: 032] Update cdh submodule with oozie purge change [puppet] - 10https://gerrit.wikimedia.org/r/274829 (https://phabricator.wikimedia.org/T127988) (owner: 10Ottomata) [20:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:46:37] andrewbogott: am puppet merging 'Make designate records public.' [20:46:55] ottomata: thank you [20:49:03] !log Updated mw1200 to HHVM 3.12.1; repooling [20:49:06] (03PS1) 10Chad: WIP: Gerrit manifest cleanup [puppet] - 10https://gerrit.wikimedia.org/r/275911 [20:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:53:54] (03PS2) 10Rush: phab_epipe.py: don't use lambda when it's not needed [puppet] - 10https://gerrit.wikimedia.org/r/275855 (owner: 10Chad) [20:54:01] woo, hhvm updates [20:58:22] (03PS1) 10Jcrespo: Repool db1066 db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275912 [21:00:22] (03CR) 10Jcrespo: [C: 032] Repool db1066 db2040 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275912 (owner: 10Jcrespo) [21:01:11] 7Blocked-on-Operations, 6Operations, 10Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#2100627 (10akosiaris) 5Open>3Resolved Hmm, looking into my .bash_history, it seems indeed I never added @JMinor to the group. I did... [21:01:51] was there any api-related change deployed recently? [21:01:55] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1066 (duration: 00m 33s) [21:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:02:54] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2040 (duration: 00m 34s) [21:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:03:29] (03PS1) 10Andrew Bogott: Include python-designateclient in the normal list of openstack client packages. [puppet] - 10https://gerrit.wikimedia.org/r/275915 [21:04:06] (03PS1) 10BBlack: VCL: switch some 403s to 404 or 405 [puppet] - 10https://gerrit.wikimedia.org/r/275916 [21:05:33] (03PS2) 10Andrew Bogott: Include python-designateclient in the normal list of openstack client packages. [puppet] - 10https://gerrit.wikimedia.org/r/275915 [21:06:56] (03PS2) 10Chad: WIP: Gerrit manifest cleanup [puppet] - 10https://gerrit.wikimedia.org/r/275911 [21:07:15] 6Operations, 10Analytics, 6Analytics-Kanban, 13Patch-For-Review: Increase HADOOP_HEAPSIZE (-Xmx) for hive-server2 - https://phabricator.wikimedia.org/T76343#2100658 (10Nuria) [21:07:20] (03CR) 10Andrew Bogott: [C: 032] Include python-designateclient in the normal list of openstack client packages. [puppet] - 10https://gerrit.wikimedia.org/r/275915 (owner: 10Andrew Bogott) [21:11:39] (03PS1) 10Eevans: (temporarily) enable thrift rpc in staging [puppet] - 10https://gerrit.wikimedia.org/r/275917 (https://phabricator.wikimedia.org/T125906) [21:11:52] (03PS2) 10Rush: hiera_lookup: enhance help message [puppet] - 10https://gerrit.wikimedia.org/r/274917 (owner: 10Hashar) [21:13:48] (03PS1) 1020after4: Group0 to 1.27.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275929 [21:16:07] (03CR) 10Rush: [C: 032] hiera_lookup: enhance help message [puppet] - 10https://gerrit.wikimedia.org/r/274917 (owner: 10Hashar) [21:16:36] (03PS1) 10Catrope: Enable cross-wiki notifications beta feature on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275930 (https://phabricator.wikimedia.org/T124234) [21:17:31] (03CR) 10Catrope: [C: 04-2] "Not until Thursday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275930 (https://phabricator.wikimedia.org/T124234) (owner: 10Catrope) [21:21:26] (03CR) 10Mobrovac: [C: 031] (temporarily) enable thrift rpc in staging [puppet] - 10https://gerrit.wikimedia.org/r/275917 (https://phabricator.wikimedia.org/T125906) (owner: 10Eevans) [21:22:28] (03CR) 10Mobrovac: [C: 031] "It would be good if you scheduled that for PuppetSWAT. It's like regular SWAT, only for Ops." [puppet] - 10https://gerrit.wikimedia.org/r/275853 (https://phabricator.wikimedia.org/T128237) (owner: 10Mholloway) [21:28:07] (03PS1) 1020after4: 1.27.0-wmf.16 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275933 [21:29:17] (03CR) 1020after4: [C: 032] 1.27.0-wmf.16 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275933 (owner: 1020after4) [21:34:01] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2100724 (10RobH) a:5Papaul>3ori >>! In T126988#2099672, @ori wrote: > @RoBH, I can take care of configuring this host; please assign to me once the server i... [21:34:43] twentyafterfour: I'd be happy to wait until next week, but this could be the first week without those symlinks if you let it. [21:34:52] At least I'd be very curious what (if anything) fails [21:35:18] Krinkle: ok [21:35:52] (03CR) 10GWicke: [C: 031] (temporarily) enable thrift rpc in staging [puppet] - 10https://gerrit.wikimedia.org/r/275917 (https://phabricator.wikimedia.org/T125906) (owner: 10Eevans) [21:36:05] (03CR) 10Krinkle: [C: 04-1] "Let's try without symlinks this wee and see if there's no links to /static/$branchName that we forgot to catch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275933 (owner: 1020after4) [21:40:11] (03PS1) 10Jcrespo: Dump of the events present on the databases [software] - 10https://gerrit.wikimedia.org/r/275951 [21:42:42] (03PS2) 10Jcrespo: Dump of the events present on the databases [software] - 10https://gerrit.wikimedia.org/r/275951 [21:43:28] (03CR) 10Jcrespo: [C: 032 V: 032] Dump of the events present on the databases [software] - 10https://gerrit.wikimedia.org/r/275951 (owner: 10Jcrespo) [21:48:20] (03CR) 10Mholloway: "Scheduled for Thursday 3/10 Puppet SWAT." [puppet] - 10https://gerrit.wikimedia.org/r/275853 (https://phabricator.wikimedia.org/T128237) (owner: 10Mholloway) [21:51:38] (03CR) 1020after4: [C: 04-1] "ok sounds good, deploying without this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275933 (owner: 1020after4) [21:53:34] !log twentyafterfour@tin Started scap: testwiki to php-1.27.0-wmf.16 and rebuild l10n cache [21:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:54:25] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2100854 (10Krenair) Am now in, after a bit of fiddling around with passwords and `sshd -d`, it turns out the issue was with file permissions on my authorized_k... [21:56:03] 6Operations, 10MobileFrontend, 10Traffic, 13Patch-For-Review, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2100862 (10Jdlrobson) If I can get https://gerr... [22:00:55] Hello, I think I found a security issue on gerrit (I found a memory leak on JGit). Does anonymous clone requests have a ram limit ? [22:01:22] Hello, I think I found a security issue on gerrit (I found a memory leak on JGit). Does anonymous clone requests have a (non physical) ram limit ? [22:01:37] ytrezq: since this is a public channel, I'd recommend you send any security issues / questions you have to security@wikimedia.org [22:02:11] yuvipanda: I won’t details the memory leak here. [22:02:47] nor I’m asking the memory limit number [22:02:51] I was here for the last report [22:03:02] it's best for things to go to that address, a number of us see them right away [22:03:10] and they get high priority [22:03:26] apergos: I found that leak today [22:03:53] ok but shhh because this is a public channel and it's logged, no point in giving the bad guys any heads up [22:04:20] !log recreating slave watchdog events on all servers [22:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:04:25] apergos: so I won’t talk about it here [22:04:45] yep, send the email and we can see if it affects us, take steps, get more info from you, pass it on, etc [22:04:52] !log Canary application servers (mw1017-mw1025) and canary API application servers (mw1114-mw1119) upgraded to HHVM 3.12.1 [22:04:54] and thank you for reporting [22:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:06:32] apergos: and yuvipanda: there’s no point in reporting it if there’s a per request ram limit. It’s better to let the security folks handle more serious issues if it is (which normally should be). [22:07:01] so does https clone requests support a ram limit ? [22:07:04] they'll check, make sure or fix [22:07:10] and then move on [22:07:19] it's really ok [22:08:05] (03PS1) 10RobH: setting sc[ab]200[1-2] install params [puppet] - 10https://gerrit.wikimedia.org/r/276008 [22:11:02] (03PS1) 10Ottomata: Set file.encoding=UTF-8 for all java processes in analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/276010 (https://phabricator.wikimedia.org/T128607) [22:11:38] 6Operations, 10Wikimedia-General-or-Unknown: Update Wikimedia Debug extensions for Chrome and Firefox for configurable backend selection - https://phabricator.wikimedia.org/T129283#2100924 (10ori) [22:11:42] bd808: ^ [22:12:04] fancy! [22:12:30] I'll have to cargo cult some more frifox plugin knowledge [22:12:30] ytrezq: it's far easier for us (and you!) to just email the security alias [22:12:32] (03PS1) 10Jcrespo: Update slave events; add master events [software] - 10https://gerrit.wikimedia.org/r/276011 [22:12:34] 6Operations, 10MobileFrontend, 10Traffic, 13Patch-For-Review, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2100943 (10Jdlrobson) SWAT scheduled for today... [22:12:43] oohhh good, I have the ff extension [22:12:45] ytrezq: than to figure out who could be online at a particular time (we are all around the world!) and engage in real time [22:13:00] ytrezq: so I kindly request you to just email the security alias rather than report them here [22:13:00] bd808: :) [22:13:53] no I just skip that site [22:13:54] ytrezq: you can also report a security issue using this form: https://phabricator.wikimedia.org/maniphest/task/edit/form/2/ [22:14:04] no I’ll just skip that site [22:14:39] (03PS6) 10Dzahn: ganglia: add unit file template for systemd [puppet] - 10https://gerrit.wikimedia.org/r/275146 (https://phabricator.wikimedia.org/T123674) [22:15:07] (03CR) 10RobH: [C: 032] setting sc[ab]200[1-2] install params [puppet] - 10https://gerrit.wikimedia.org/r/276008 (owner: 10RobH) [22:15:57] (03PS2) 10Jcrespo: Update slave events; add master events [software] - 10https://gerrit.wikimedia.org/r/276011 [22:15:59] (03PS7) 10Dzahn: ganglia: add unit file template for systemd [puppet] - 10https://gerrit.wikimedia.org/r/275146 (https://phabricator.wikimedia.org/T123674) [22:17:02] (03CR) 10Jcrespo: [C: 032 V: 032] Update slave events; add master events [software] - 10https://gerrit.wikimedia.org/r/276011 (owner: 10Jcrespo) [22:17:04] (03CR) 10Dzahn: [C: 032] "using an "@" in the service name turns the regular unit file into a template. using that we will be able to start multiple aggregators fro" [puppet] - 10https://gerrit.wikimedia.org/r/275146 (https://phabricator.wikimedia.org/T123674) (owner: 10Dzahn) [22:19:40] !log twentyafterfour@tin Finished scap: testwiki to php-1.27.0-wmf.16 and rebuild l10n cache (duration: 26m 06s) [22:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:20:03] 6Operations, 6Services: setup/deploy sc[a-b]200[1-2] - https://phabricator.wikimedia.org/T129234#2101008 (10RobH) [22:27:04] (03CR) 10Alexandros Kosiaris: [C: 04-1] "First, some inline comments." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [22:27:34] twentyafterfour: getting no response from https://test.wikipedia.org/wiki/Special:Version [22:28:11] twentyafterfour: oh nevermind, it was just much slower than ever before (28.56s) [22:28:44] ebernhardson: it's still jitting php-1.27.0-wmf.16 code [22:28:50] it's probably running it in the interpreter [22:31:55] (03CR) 10Ottomata: "Ok, yeah. I do want to keep the data off the host, but this host will have access to HDFS, which is a pretty good backup solution. I can" [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [22:33:05] (03CR) 10Smalyshev: "Due to T128813 this probably won't have any immediate effect but will lay the grounds for proper caching when T128813 is fixed." [puppet] - 10https://gerrit.wikimedia.org/r/274864 (https://phabricator.wikimedia.org/T126730) (owner: 10Smalyshev) [22:36:00] akosiaris: yt? qs about backup stuff [22:36:19] (03PS1) 10RobH: sc[ab]200[1-2] partition update [puppet] - 10https://gerrit.wikimedia.org/r/276018 [22:38:00] (03PS2) 10RobH: sc[ab]200[1-2] partition update [puppet] - 10https://gerrit.wikimedia.org/r/276018 [22:38:13] (03PS1) 10Dzahn: ganglia-aggregator: don't set user in service template [puppet] - 10https://gerrit.wikimedia.org/r/276019 [22:38:29] (03PS2) 10Dzahn: ganglia-aggregator: don't set user in service template [puppet] - 10https://gerrit.wikimedia.org/r/276019 [22:38:48] (03CR) 10jenkins-bot: [V: 04-1] ganglia-aggregator: don't set user in service template [puppet] - 10https://gerrit.wikimedia.org/r/276019 (owner: 10Dzahn) [22:38:54] (03PS3) 10RobH: sc[ab]200[1-2] partition update [puppet] - 10https://gerrit.wikimedia.org/r/276018 [22:38:59] (03CR) 10Dzahn: [C: 032] "gmond will run as ganglia anyways" [puppet] - 10https://gerrit.wikimedia.org/r/276019 (owner: 10Dzahn) [22:40:44] (03PS4) 10RobH: sc[ab]200[1-2] partition update [puppet] - 10https://gerrit.wikimedia.org/r/276018 [22:44:30] twentyafterfour: all good, even without those symlinks? [22:44:34] (03CR) 10RobH: [C: 032] sc[ab]200[1-2] partition update [puppet] - 10https://gerrit.wikimedia.org/r/276018 (owner: 10RobH) [22:44:55] (03PS1) 10Dzahn: netboot: use same partman for all bastions [puppet] - 10https://gerrit.wikimedia.org/r/276022 (https://phabricator.wikimedia.org/T128899) [22:45:29] (03PS2) 10Dzahn: netboot: use same partman for all bastions [puppet] - 10https://gerrit.wikimedia.org/r/276022 (https://phabricator.wikimedia.org/T128899) [22:45:30] ottomata: yup, ask away [22:46:06] (03PS3) 10Dzahn: netboot: use same partman for all bastions [puppet] - 10https://gerrit.wikimedia.org/r/276022 (https://phabricator.wikimedia.org/T128899) [22:46:13] (03CR) 10Dzahn: [C: 032] netboot: use same partman for all bastions [puppet] - 10https://gerrit.wikimedia.org/r/276022 (https://phabricator.wikimedia.org/T128899) (owner: 10Dzahn) [22:46:41] akosiaris: so, if i want to do xtrabackup/innobackpex on my own then [22:46:51] should i try to adapt your bpipe-mysql-db.erb script? [22:46:56] or should I just write my own [22:47:00] ? [22:47:28] bpipe stands for bacula pipe. You probably don't want to use pipes if you write your own [22:47:38] ah ok [22:47:39] hm [22:47:45] and just ship the backup to a specific directory [22:47:48] aye [22:48:05] hm, ok, then, i guess i'll write some more generic wrapper around innobaxkupex that does incrementals [22:48:16] :q! [22:48:21] sigh.. wrong window [22:49:09] (03CR) 10BearND: [C: 031] Add Accept: header to RESTBase/Parsoid requests [puppet] - 10https://gerrit.wikimedia.org/r/275853 (https://phabricator.wikimedia.org/T128237) (owner: 10Mholloway) [22:49:11] and I don't see predump helping you either is you want xtrabackup [22:49:15] yeah [22:49:25] i mean, i don't care specificly, i just want whatever is fastest and least intrusive [22:49:48] its been a while since i've done mysql backups other than mysqldump...i used to use LVM snapshots + mylvmbackup [22:49:51] which worked pretty well [22:50:34] uhhhh should I use xbstream? [22:50:35] well, if you remove the --stream=xbstream from my script [22:50:44] then you get pretty much to where you want [22:51:10] but there are quite a few ifs/then/elses in that thing for what you want [22:51:41] yeah [22:51:44] i [22:51:48] i'm also just reading docs [22:51:51] in reality all you want is /usr/bin/innobackupex --parallel=<%= Integer(@processorcount)/2 > 1 ? Integer(@processorcount/2) --databases=$database /dev/null | $PIGZ [22:51:56] might be easier for me to jsut script something around it [22:52:09] why pigz? [22:52:14] oh reading [22:52:15] no, actually remove the | $PIGZ at the end [22:52:23] ok [22:52:23] c/p mistake [22:52:31] ok, that just makes the base backup, ja? [22:52:37] yup [22:52:38] i guess if it is fast enough I don't have to do incremental [22:52:58] rely on the binary logs you mean ? [22:53:02] uhhh [22:53:07] https://www.percona.com/doc/percona-xtrabackup/2.3/innobackupex/incremental_backups_innobackupex.html [22:53:50] !log bast2001 - powercycle, reinstall [22:53:52] ah, so yeah you could do that. Or rely on the binary logs [22:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:54:05] it's not a bad idea, though I 've never tested it [22:54:13] how often do you want to backup ? daily ? [22:54:23] losing a day's worth of data is OK ? or not ? [22:54:33] akosiaris: if it was just full dumps, i'd do daily, but if i could to unintrusive incrementals, then maybe more often [22:54:41] greg-g: I'm not sure [22:54:48] this isn't critical data, but, the more data we lose, the harder the cluster has to work when we restore [22:54:57] this is mostly metadata about oozie jobs and hive tables [22:55:32] Krinkle: can you confirm? [22:55:43] https://test.wikipedia.org/wiki/Main_Page looks ok [22:56:00] so incremental will just allow you to run the job more often keeping basically the innodb page write into a different directory [22:56:23] aye [22:56:52] but if the backup is fast enough, and say hourly is fine for you, you might not need it [22:57:04] I suppose experimentation is the only way to make sure [22:57:16] 40GB on SSDs though is not much [22:57:40] not on SSDs [22:57:43] but ja [22:57:49] gonna see how long it takes to do a full of that [22:58:09] so with 50MB/s it should be around 15mins [22:58:14] (03Abandoned) 1020after4: 1.27.0-wmf.16 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275933 (owner: 1020after4) [22:58:23] it is starting to make sense to use incrementals I think [22:58:32] cause you might not get 50MB/s [22:59:01] i can probably ahve it write the bakcups to other drives [22:59:14] but ja, i don't want to slow down the usage of mysql [22:59:33] 6Operations: Port Ganglia aggregator setup to systemd - https://phabricator.wikimedia.org/T124197#2101149 (10Dzahn) a:3Dzahn [22:59:43] (03CR) 1020after4: [C: 032] Group0 to 1.27.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275929 (owner: 1020after4) [22:59:45] 6Operations, 10media-storage: Unable to undelete file - https://phabricator.wikimedia.org/T129212#2101150 (10DMacks) Confirmed working for me also. [23:00:11] (03Merged) 10jenkins-bot: Group0 to 1.27.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275929 (owner: 1020after4) [23:03:03] 6Operations: repalce non-lvm paritioning with lvm - https://phabricator.wikimedia.org/T129287#2101177 (10RobH) [23:03:15] !log twentyafterfour@tin Synchronized w/static/: (no message) (duration: 00m 32s) [23:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:31] oops, i should have started icinga-wm earlier [23:11:08] !log neon: re-enable puppet, start icinga-wm [23:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:13:24] ACKNOWLEDGEMENT - HTTPS on titanium is CRITICAL: SSL CRITICAL - Certificate archiva.wikimedia.org valid until 2016-04-07 15:16:02 +0000 (expires in 29 days) daniel_zahn https://phabricator.wikimedia.org/T129273 [23:14:03] 6Operations, 10Deployment-Systems, 10Monitoring, 10scap, 10Scap3 (scap3-adoption): Deploy servermon with scap3 - https://phabricator.wikimedia.org/T129152#2101295 (10greg) [23:14:37] 6Operations, 10Deployment-Systems, 10Monitoring, 10scap, 10Scap3 (scap3-adoption): Deploy servermon with scap3 - https://phabricator.wikimedia.org/T129152#2096629 (10greg) [23:16:13] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [23:17:42] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [23:20:08] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.27.0-wmf.16 [23:20:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:41] !log bast2001 - install issues - extending downtime, bbiaw [23:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:33] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [23:22:40] (03CR) 10Brian Wolff: [C: 031] Password policies for advanced permission groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [23:24:13] akosiaris: fyi, took 18m37.964s to do full backup with --parallel=8 [23:41:53] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [23:52:23] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0]