[00:00:00] (03PS1) 10BBlack: rename cache_remote to cache_eqiad [puppet] - 10https://gerrit.wikimedia.org/r/274595 (https://phabricator.wikimedia.org/T127481) [00:00:02] (03PS1) 10BBlack: role::cache::instances: create, use for maps [puppet] - 10https://gerrit.wikimedia.org/r/274596 (https://phabricator.wikimedia.org/T127481) [00:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160303T0000). [00:00:04] hoo James_F MatmaRex: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:04] (03PS1) 10BBlack: role::cache::instances: use for misc [puppet] - 10https://gerrit.wikimedia.org/r/274597 (https://phabricator.wikimedia.org/T127481) [00:00:06] (03PS1) 10BBlack: role::cache::instances: use for text [puppet] - 10https://gerrit.wikimedia.org/r/274598 (https://phabricator.wikimedia.org/T127481) [00:00:08] (03PS1) 10BBlack: role::cache::instances: use for upload [puppet] - 10https://gerrit.wikimedia.org/r/274599 (https://phabricator.wikimedia.org/T127481) [00:00:10] (03PS1) 10BBlack: role::cache::instances: all backends get all backends [puppet] - 10https://gerrit.wikimedia.org/r/274600 (https://phabricator.wikimedia.org/T127481) [00:00:12] (03PS1) 10BBlack: VCL: remove tier-one conditionals on be applayer backend code [puppet] - 10https://gerrit.wikimedia.org/r/274601 (https://phabricator.wikimedia.org/T127481) [00:00:14] (03PS1) 10BBlack: VCL: cache_route_table replaces site_tier [puppet] - 10https://gerrit.wikimedia.org/r/274602 (https://phabricator.wikimedia.org/T127481) [00:00:16] (03PS1) 10BBlack: role::cache::instances: move cache route table to hiera [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) [00:00:19] I can deploy it myself [00:00:31] think I'm the only one up for SWAT anyway [00:00:49] !log restarting pdns on labservices1001 [00:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:01:28] (i'm here if james isn't) [00:01:36] Ok, I'll do swat [00:01:46] andrewbogott: I think I just did it before you and was about to log just fyi if you look through logs [00:02:03] chasemp: oh, does that mean the alerts were /caused/ by you or fixed by you? [00:02:03] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 797718 bytes in 6.620 second response time [00:02:04] RECOVERY - Auth DNS for labs pdns on labs-ns2.wikimedia.org is OK: DNS OK: 0.107 seconds response time. nagiostest.eqiad.wmflabs returns [00:02:16] andrewbogott: well fixed by me by a hair [00:02:21] I wasn't doing anything when it started [00:02:25] ok [00:02:32] this one didn’t follow the timeline [00:02:36] nope [00:02:46] (03CR) 10Hoo man: [C: 032] Set $wgWikimediaBadgesCommonsCategoryProperty to null on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274506 (https://phabricator.wikimedia.org/T128661) (owner: 10Hoo man) [00:02:48] hey [00:02:56] sorry I'm late [00:03:06] you comfortable with doing these hoo? or shall I take over after yours? [00:03:32] second one will need a scap [00:03:35] I'm fine with doing both [00:03:39] James_F|Away: ^ you aware [00:05:19] (03CR) 10jenkins-bot: [V: 04-1] wmflib: add hash_select_re and hash_deselect_re [puppet] - 10https://gerrit.wikimedia.org/r/274400 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [00:05:49] Is zuul taking a nap? [00:06:47] ops/puppet changes are clogged for some reason... [00:07:20] hoo: i'm afk for ~15 minutes, will be here to verify my change later (if zuul unclogs itself ;) ). sorry [00:07:32] MatmaRex: It'll need a full scap anyways [00:07:44] (as it introduces a new message) [00:08:10] (03Merged) 10jenkins-bot: Set $wgWikimediaBadgesCommonsCategoryProperty to null on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274506 (https://phabricator.wikimedia.org/T128661) (owner: 10Hoo man) [00:08:15] ah, here we go [00:08:23] blame brandon [00:08:28] :) [00:08:34] :P [00:08:37] :( [00:08:56] oops! [00:09:10] sorry, I was in my own world, I didn't realize it would hold up swat jenkins stuff too [00:09:12] meant bblack [00:09:22] bblack: s'alright, it's goign through [00:09:37] (03PS3) 10Dzahn: wikitech: open up dumps directory to public [puppet] - 10https://gerrit.wikimedia.org/r/274405 (https://phabricator.wikimedia.org/T54170) [00:09:39] (that's what I get for trying to not ping bblack, btw) [00:10:00] !log hoo@tin Synchronized wmf-config/Wikibase-production.php: Set $wgWikimediaBadgesCommonsCategoryProperty to null on commons (T128661) (duration: 01m 09s) [00:10:01] T128661: Don't add other projects links to commons on commons - https://phabricator.wikimedia.org/T128661 [00:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:40] probably because nodepool wasn't ready for it [00:10:53] Ok, mine looks good [00:12:34] (03PS4) 10Dzahn: wikitech: open up dumps directory to public [puppet] - 10https://gerrit.wikimedia.org/r/274405 (https://phabricator.wikimedia.org/T54170) [00:12:49] will probably wait for MatmaRex with the other change [00:12:54] although scap will take forever [00:14:06] !log wikitech: delete /a/backup/public/foo and ./bar cruft [00:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:16:26] (03PS5) 10Dzahn: wikitech: open up dumps directory to public [puppet] - 10https://gerrit.wikimedia.org/r/274405 (https://phabricator.wikimedia.org/T54170) [00:16:37] (03CR) 10Dzahn: [C: 032] wikitech: open up dumps directory to public [puppet] - 10https://gerrit.wikimedia.org/r/274405 (https://phabricator.wikimedia.org/T54170) (owner: 10Dzahn) [00:17:09] (03CR) 10Dzahn: [V: 032] wikitech: open up dumps directory to public [puppet] - 10https://gerrit.wikimedia.org/r/274405 (https://phabricator.wikimedia.org/T54170) (owner: 10Dzahn) [00:19:19] !log Restarted hhvm on mw1025 because of "Cannot access property on non-object in /srv/mediawiki/php-1.27.0-wmf.14/includes/filerepo/LocalRepo.php" [00:19:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:20:42] (03CR) 10jenkins-bot: [V: 04-1] role::cache::instances: create, use for maps [puppet] - 10https://gerrit.wikimedia.org/r/274596 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [00:20:43] Keeps happening, but only on that machine [00:20:48] !log upgrade elastic1013.eqiad.wmnet to elasticsearch 1.7.5 [00:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:21:08] ugh [00:21:11] it has the wrong file [00:21:34] 98473a4a0fc926f44d7855042a4a1fb2 /srv/mediawiki/php-1.27.0-wmf.14/includes/filerepo/LocalRepo.php [00:21:36] vs 81d8a9a39f2007c1ec7bcc60c585c998 /srv/mediawiki/php-1.27.0-wmf.14/includes/filerepo/LocalRepo.php [00:22:05] (03CR) 10jenkins-bot: [V: 04-1] role::cache::instances: use for misc [puppet] - 10https://gerrit.wikimedia.org/r/274597 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [00:22:29] hoo: i'm around now [00:22:37] (03CR) 10jenkins-bot: [V: 04-1] role::cache::instances: use for text [puppet] - 10https://gerrit.wikimedia.org/r/274598 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [00:22:42] MatmaRex: Awesome [00:22:51] you're aware that we need to SWAT, I guess? [00:22:59] * scap [00:23:28] (03CR) 10jenkins-bot: [V: 04-1] role::cache::instances: use for upload [puppet] - 10https://gerrit.wikimedia.org/r/274599 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [00:23:52] (03CR) 10jenkins-bot: [V: 04-1] role::cache::instances: all backends get all backends [puppet] - 10https://gerrit.wikimedia.org/r/274600 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [00:23:56] !log Ran sync-common on mw1025, because it apparently didn't pick up recent changes [00:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:24:15] probably not in dsh [00:24:25] it is [00:24:39] PROBLEM - DPKG on elastic1013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:25:24] mutante: I’m about to go to dinner, but would appreciate a puppet fix for labtestweb2001 if you get to it [00:25:34] MatmaRex: ^ [00:25:36] (03PS1) 10Thcipriani: Update scap to v.3.0.3-1 [puppet] - 10https://gerrit.wikimedia.org/r/274608 [00:26:04] andrewbogott: i'm on silver and one change upcoming in a second. i did NOT realize there is any issue on labstestweb2001 but will certainlyu look [00:26:27] bd808: thcipriani: Shall I file a scap bug or just whatever it? [00:26:31] hoo: yeah [00:26:35] a puppet error seems strange when not even editing a .pp [00:26:54] (03CR) 10jenkins-bot: [V: 04-1] VCL: remove tier-one conditionals on be applayer backend code [puppet] - 10https://gerrit.wikimedia.org/r/274601 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [00:26:54] hoo: what's the scap problem? [00:27:16] bd808: mw1015 had the wrong file, despite being in dsh [00:27:20] see above [00:27:20] hoo: "need"… it'd only affect mw.org at the moment as the only prod wiki, right? [00:27:38] and l10nupdate would pick it up before the next deployment [00:27:41] (03CR) 10jenkins-bot: [V: 04-1] VCL: cache_route_table replaces site_tier [puppet] - 10https://gerrit.wikimedia.org/r/274602 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [00:27:42] hopefully [00:27:51] hoo: mw1025 being stale? [00:27:55] MatmaRex: If that's ok with you [00:28:05] bd808: Yeah, not sure it was stale or just corrupted [00:28:17] you can probably easily verify with the md5 I posted above [00:28:24] s/verify/find out/ [00:30:01] (03PS1) 10Dzahn: wikitech-apache: enable (fancy) indexing for dumps dir [puppet] - 10https://gerrit.wikimedia.org/r/274609 (https://phabricator.wikimedia.org/T54170) [00:30:27] (03PS2) 10Dzahn: wikitech-apache: enable (fancy) indexing for dumps dir [puppet] - 10https://gerrit.wikimedia.org/r/274609 (https://phabricator.wikimedia.org/T54170) [00:30:37] (03CR) 10Dzahn: "also needs https://gerrit.wikimedia.org/r/#/c/274609/" [puppet] - 10https://gerrit.wikimedia.org/r/274405 (https://phabricator.wikimedia.org/T54170) (owner: 10Dzahn) [00:30:45] !log Ran sync-common on mw1025 [00:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:30:49] (03CR) 10jenkins-bot: [V: 04-1] role::cache::instances: move cache route table to hiera [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [00:30:51] bd808: Already did [00:31:09] (03CR) 10Dzahn: [C: 032 V: 032] wikitech-apache: enable (fancy) indexing for dumps dir [puppet] - 10https://gerrit.wikimedia.org/r/274609 (https://phabricator.wikimedia.org/T54170) (owner: 10Dzahn) [00:31:15] hoo: the really important part is the backend changes [00:31:17] I don't suppose you ran it with --verbose to see what was updated? [00:31:37] bd808: No :/ [00:31:54] I suppose its more than a single rsync now? [00:32:09] meh. it will be in sal now. if it acts up again we can look into it [00:32:38] sync-common is rsync + cdb rebuilds [00:32:48] andrewbogott: i see the issue.. which puppetmaster controls labstestweb2001 [00:33:08] i'll figure it out if you are at dinner [00:33:34] silver is fine and all done [00:33:48] (03CR) 10Dzahn: "https://wikitech.wikimedia.org/dumps/" [puppet] - 10https://gerrit.wikimedia.org/r/274609 (https://phabricator.wikimedia.org/T54170) (owner: 10Dzahn) [00:34:35] 6Operations, 10Dumps-Generation, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#2082702 (10Dzahn) @MZMcBride -> https://wikitech.wikimedia.org/dumps/ [00:35:00] 6Operations, 10Dumps-Generation, 6Labs, 10wikitech.wikimedia.org: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#2082704 (10Dzahn) [00:36:59] !log hoo@tin Synchronized php-1.27.0-wmf.15/extensions/TemplateData/: Change default format to null instead of 'inline' (duration: 01m 02s) [00:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:37:30] MatmaRex: please verify [00:37:57] 6Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 6Labs, 10wikitech.wikimedia.org: since dumps are public on wikitech, do we still want them on dumps.wm.org ? - https://phabricator.wikimedia.org/T128680#2082710 (10Dzahn) [00:37:59] bblack: http://i.imgur.com/FpFSb9C.png :P [00:38:14] 6Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 6Labs, 10wikitech.wikimedia.org: copy wikitech dumps to dumps server ? - https://phabricator.wikimedia.org/T128680#2082710 (10Dzahn) [00:38:18] 6Operations, 6Services, 10hardware-requests: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2082727 (10GWicke) Would it make sense / be realistic to consider running zotero in containers, rather than dedicating hardware to it? [00:38:32] 6Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 6Labs, 10wikitech.wikimedia.org: copy wikitech dumps to dumps server ? - https://phabricator.wikimedia.org/T128680#2082710 (10Dzahn) a:5Dzahn>3None [00:38:55] hoo: thanks, looking [00:38:57] greg-g: :) [00:38:59] 6Operations, 10Dumps-Generation, 6Labs, 10wikitech.wikimedia.org: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#562758 (10Dzahn) 5Open>3Resolved [00:40:40] greg-g: I hear phab's arc has the answer to this problem btw (of tiny stepwise commits for clearer review vs giant chunky "changes" that are better for merging, and all that) [00:40:54] looking forward to finding out anyways [00:42:36] bblack: 'tis true [00:44:21] hoo: (not that easy to find a tempalte this would affect on mw.org…) [00:45:39] 7Puppet, 10Phabricator, 13Patch-For-Review: Create puppet role for Phabricator hosted repo testing - https://phabricator.wikimedia.org/T104827#2082735 (10Negative24) 5Open>3stalled [00:46:02] 7Puppet, 10Phabricator, 13Patch-For-Review: Create puppet role for Phabricator hosted repo testing - https://phabricator.wikimedia.org/T104827#1428544 (10Negative24) a:5Negative24>3None [00:46:10] hoo: (and when you do find one, it's only used on Translate pages, making it impossible to test in VE) [00:46:25] hoo: but the API part seems to work. i trust that Parsoid guys tested their stuff [00:46:39] Ok, then [00:46:42] so should be fine. thanks [00:47:39] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2082739 (10RobH) None of the spares in codfw have more than 32GB. I'll escalate the HP quote for 64GB for purchase review.... [00:49:12] (03PS1) 10Dzahn: wikitech-apache: follow-up fix for labstestwiki config [puppet] - 10https://gerrit.wikimedia.org/r/274613 [00:49:34] (03PS2) 10Dzahn: wikitech-apache: follow-up fix for labstestwiki config [puppet] - 10https://gerrit.wikimedia.org/r/274613 [00:49:51] (03CR) 10Dzahn: [C: 032] wikitech-apache: follow-up fix for labstestwiki config [puppet] - 10https://gerrit.wikimedia.org/r/274613 (owner: 10Dzahn) [00:50:06] (03CR) 10Dzahn: [V: 032] wikitech-apache: follow-up fix for labstestwiki config [puppet] - 10https://gerrit.wikimedia.org/r/274613 (owner: 10Dzahn) [00:50:19] !log Phabricator will be going down for maintenance around 01:00 UTC (Approximately 10 minutes from now) [00:50:56] ^ I expect the maintenance to last for about 20 minutes. Icinga has been silenced for 01:00 to 02:00 [00:51:49] (scheduled service downtime for phabricator and puppet related alerts on iridium) [00:51:56] (03CR) 10Dzahn: "had to also do it for "labstest" or break labstestweb https://gerrit.wikimedia.org/r/#/c/274613/" [puppet] - 10https://gerrit.wikimedia.org/r/274402 (owner: 10Dzahn) [00:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:52:22] andrewbogott: fixed [00:52:33] ah, how did that happen :) [00:52:48] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2082753 (10RobH) [00:54:18] PROBLEM - Disk space on labservices1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=76%) [00:55:57] 6Operations, 6Labs, 10Labs-Infrastructure: labservices1001 ran out of disk space - https://phabricator.wikimedia.org/T126572#2082775 (10Dzahn) 5Resolved>3Open 16:58 < icinga-wm> PROBLEM - Disk space on labservices1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=76%) [00:56:24] (03PS3) 1020after4: Move /srv/phab/repos to /srv/repos refs T125853 [puppet] - 10https://gerrit.wikimedia.org/r/274484 (https://phabricator.wikimedia.org/T125853) [00:56:37] (03PS13) 1020after4: Clean up phabricator roles in puppet to remove tag pinning. [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) [00:57:09] (03PS2) 1020after4: Phabricator: support systemd as well as upstart. [puppet] - 10https://gerrit.wikimedia.org/r/274488 [00:57:18] CUSTOM - Disk space on labservices1001 is CRITICAL: DISK CRITICAL - free space: / 343 MB (3% inode=76%) [00:57:59] RECOVERY - Disk space on labservices1001 is OK: DISK OK [00:58:34] 6Operations, 6Labs, 10Labs-Infrastructure: labservices1001 ran out of disk space - https://phabricator.wikimedia.org/T126572#2082779 (10Dzahn) 5Open>3Resolved 17:02 < icinga-wm> RECOVERY - Disk space on labservices1001 is OK: DISK OK ? [01:00:04] !log upgrade elastic1014.eqiad.wmnet to elasticsearch 1.7.5 [01:00:06] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160303T0100). Please do the needful. [01:00:38] 6Operations, 10Traffic, 10netops: Office network using monkeybrains.net instead of connection to SFO pop site - https://phabricator.wikimedia.org/T128669#2082783 (10Dzahn) [01:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:01:02] 6Operations, 10Traffic, 10netops: Office network using monkeybrains.net instead of connection to SFO pop site - https://phabricator.wikimedia.org/T128669#2082348 (10Dzahn) {F3510969} [01:01:03] mutante: around? [01:01:11] 6Operations, 10Traffic, 10netops: Office network using monkeybrains.net instead of connection to SFO pop site - https://phabricator.wikimedia.org/T128669#2082789 (10Dzahn) [01:01:51] twentyafterfour: yes [01:02:22] ready to merge patches? [01:02:44] yes, and stopped using phab now [01:03:04] well, meta problem, the list of todo is on phab :) [01:03:13] yes ... [01:03:19] but i have it open in this tab [01:03:21] I'll copy it to etherpad [01:03:23] ok [01:04:11] so, stopping services and then https://gerrit.wikimedia.org/r/#/c/274484 [01:04:19] https://etherpad.wikimedia.org/p/phab-2016-02-03 [01:04:41] mutante: yes, I've disabled puppet and icinga alerts [01:04:58] cool, ok, who stops the services [01:05:06] on iridium and puppetmaster [01:05:18] puppetmaster needs to stop? [01:05:27] I'm in iridium got phd stopped only need to stop apache [01:05:28] no, bad phrasing. i am on those 2 machines [01:05:34] oh ok [01:05:35] ok, merging [01:05:46] (03CR) 10Dzahn: [C: 032] Move /srv/phab/repos to /srv/repos refs T125853 [puppet] - 10https://gerrit.wikimedia.org/r/274484 (https://phabricator.wikimedia.org/T125853) (owner: 1020after4) [01:06:25] ok, merged on master [01:06:36] who's captain america? :) [01:06:51] am i running puppet? [01:07:10] mutante: I can, just need patches merged I've got the rest on iridium [01:07:28] !log iridium - stop apache [01:07:36] twentyafterfour: ok, go ahead [01:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:08:37] phab looks unhappy [01:08:44] gwicke: on purpose :) [01:08:54] gwicke: downtime for maint [01:09:04] ah, okay [01:11:09] PROBLEM - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! [01:12:17] ah, well, that icinga-wm message was separate and not covered by downtime because it's not on the host iridium itself [01:12:21] turns that off [01:12:30] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! [01:12:54] mutante: ok [01:13:15] damn sorry about missing that icinga alert [01:13:20] ACKNOWLEDGEMENT - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! daniel_zahn scheduled maintenance [01:13:20] ACKNOWLEDGEMENT - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! daniel_zahn scheduled maintenance [01:13:20] ACKNOWLEDGEMENT - PyBal backends health check on lvs1012 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! daniel_zahn scheduled maintenance [01:13:25] mutante: ready for second patch I think [01:13:30] don't worry, no pages [01:13:34] ok! [01:13:47] (03CR) 10Dzahn: [C: 032] Clean up phabricator roles in puppet to remove tag pinning. [puppet] - 10https://gerrit.wikimedia.org/r/269561 (https://phabricator.wikimedia.org/T125851) (owner: 1020after4) [01:14:11] twentyafterfour: done [01:16:02] !log testing puppet on iridium [01:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:17:49] !log puppet says "Provider scap3 is not functional on this host" [01:17:51] wtf [01:18:27] !log elastic1013 "dpkg reports broken packages " [01:18:50] scap package not installed :( [01:18:53] eh... which other host do we know it's functional on? [01:18:59] ugh [01:19:06] can I apt-get install it and then submit patch to puppet? [01:19:11] or do I need to fix it in puppet first [01:19:40] I guess that dependency got dropped somewhere in all the rebasing [01:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:19:46] usually always puppet but in this case, yes, apt-get it [01:19:48] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [01:19:51] ok [01:20:09] RECOVERY - PyBal backends health check on lvs1012 is OK: PYBAL OK - All pools are healthy [01:21:05] sweet no other puppet errors, running it for real now [01:21:46] !log manually installed scap package on iridium, will fix in puppet immediately after maintenance is finished [01:22:31] wtf now puppet tried to install scap but the wrong version [01:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:23:37] did we get the new build that unbreaks scap in apt yet? [01:23:56] bd808: apparently but puppet is trying to install the old one [01:24:15] for the right distro versions? [01:24:17] probably pinned in the puppet manifest [01:24:47] I hadn't heard from tyler that the latest build was live yet [01:24:52] everything trusty? [01:25:15] mutante: can you merge https://gerrit.wikimedia.org/r/#/c/274608/ [01:25:26] bd808: it's in apt. pinned in puppet though. [01:25:45] why is the old version removed from apt, that's my question [01:25:55] because puppet is trying to install a package that doesn't exist due to that fact [01:26:00] eh, i can merge that, assuming it's ok for non-phab production deployment too, to update scap [01:26:13] https://gerrit.wikimedia.org/r/#/c/274608/ [01:26:36] (03PS2) 10Dzahn: Update scap to v.3.0.3-1 [puppet] - 10https://gerrit.wikimedia.org/r/274608 (owner: 10Thcipriani) [01:26:44] * twentyafterfour believes it should be ok. thcipriani or bd808 will have to confirm :) [01:27:01] I have it up for puppetswat tomorrow morning. [01:27:09] thcipriani: is it ok if we do it now? [01:27:16] yup, it should be fine [01:27:20] mutante: ^ [01:27:28] twentyafterfour: our apt repo can only hold one version at a time :/ [01:27:38] (03CR) 10Dzahn: [C: 032] "spontaneous puppet swat reschedule" [puppet] - 10https://gerrit.wikimedia.org/r/274608 (owner: 10Thcipriani) [01:27:54] :-/ then pinning the version in puppet seems unwise [01:28:06] * twentyafterfour tests again on iridium [01:28:27] puppet-merged [01:28:53] w/in 62 [01:28:56] pinning the version keeps puppet from randomly upgrading, but would cause issues with new provisioning [01:30:48] RECOVERY - DPKG on elastic1013 is OK: All packages OK [01:30:57] ensure => present instead of ensure => latest should also keep it from randomly upgrading [01:31:20] phabricator returns error: [01:31:22] PhutilTypeExtraParametersException [01:31:23] Got unexpected parameters: claim [01:31:33] trying to reach https://phabricator.wikimedia.org/T128550 [01:32:06] actually now it appears on any url [01:32:09] Danny_B: known, it's a scheduled maintenance period [01:32:17] phab is switching the deployment method [01:32:33] to scap [01:32:52] Could not resolve host: deploy.eqiad.wmnet [01:32:59] sigh... why can't mainainers put the maintain notice when maintaining so one would get the info instead of error [01:33:10] Danny_B: see status line in topic [01:33:29] Danny_B: there is no easy way to throw up a maintenance notice on phabricator when phab is offline [01:33:31] twentyafterfour: sigh, really? let's put it in /etc/resolv.conf but also puppetized [01:33:40] mutante: ok [01:33:45] mutante: not every single phab user is on irc... [01:33:59] I don't get why it doesn't resolve, it's resolving for me from outside [01:34:12] twentyafterfour: rewriterule .* /maintenance.html ? [01:34:24] Danny_B: lets not screw up every tab someone has [01:34:31] open for phab [01:34:47] p858snake: ??? [01:34:48] fair discussion, but maybe after it's back [01:35:13] I prefer something to error out and keep the correct url, compared to be redirected to a incorrect url (and can't get back in some cases) [01:35:33] twentyafterfour: where does it work for you? [01:39:09] sorry it was supposed to be deployment.eqiad.wmnet not deploy. fixing it in scap.cfg [01:40:57] Danny_B: no apache running so no rewriterule [01:41:12] 503 -> 500 -> 404. [01:41:22] We've had all the errors now from Phab in the last 10 minutes :P [01:41:46] puppet run finished, with one error but it's not critical [01:41:53] I can patch that shortly [01:41:55] :) [01:42:52] 503 Service Unavailable, 500 PhutilTypeExtraParametersException (PhutilTypeExtraParametersException), 404 Not Found (Apache), 500 FilesystemException ('main-header.json' does not exist) [01:42:57] last one is current [01:43:45] Status: Phabricator issues known [01:44:53] and phab is back [01:44:57] !log phabricator is back online [01:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:45:08] nice work [01:45:35] mutante: thanks, your help is much appreciated [01:45:40] I'll fix that one puppet error now [01:45:45] you're welcome, ok [01:46:15] Woohoo [01:46:16] (got 4 tickets to file now) [01:47:59] in under half of the scheduled window [01:48:35] !log upgrade elastic1015.eqiad.wmnet to elasticsearch 1.7.5 [01:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:49:57] !log Logstash elasticsearch cluster not responsive; investigating [01:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:51:58] !log New index not being created due to low disk watermark exceeded on logstash1006 [01:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:53:22] (03PS1) 1020after4: Remove git::install for phabricator/tools [puppet] - 10https://gerrit.wikimedia.org/r/274618 (https://phabricator.wikimedia.org/T125851) [01:54:03] (03CR) 10Dzahn: [C: 032] "we talked about it already, scap now" [puppet] - 10https://gerrit.wikimedia.org/r/274618 (https://phabricator.wikimedia.org/T125851) (owner: 1020after4) [01:54:11] !log Deleted logstash-2016.02.03 index to free disk space [01:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:59:18] !log puppet ran on iridium, no errors. :) [01:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:59:47] half the scheduled window but double my planned window ;) [02:00:02] * twentyafterfour was probably optimistic to get it done in 30 minutes [02:00:59] 6Operations: Remove rbraceysherman@ from fr-all list - https://phabricator.wikimedia.org/T128639#2082943 (10Dzahn) @JGulingan ok, i added you and @bbogaert to that new ticket T128647. Should i make one in zendesk too or is phabricator fine? [02:03:19] !log Events flowing into logstash elasticsearch cluster again after forcing allocation of missing shard replica [02:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:03:58] event volume is still really low [02:04:22] I'm going to kick all the logstash processes too [02:04:39] twentyafterfour: ok, so any other similar solution... (i am not server expert...) basicaly at least simple plaintext notice "we are having maintenance, eta: dd. mm. yyyy. hh:mm" would be much more friendly than hitting various errors/timeouts etc... [02:05:12] Danny_B: agreed, file a task about it? I'll look into what can be done [02:06:00] will do [02:06:24] (it's not only about phab actually, i've been discussing that with apergos re the dump outage as well) [02:08:01] (03PS7) 1020after4: Parameterize the git_server variable in global scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) [02:10:00] PROBLEM - puppet last run on mw2013 is CRITICAL: CRITICAL: Puppet has 1 failures [02:12:07] 6Operations, 10Wikimedia-Mailing-lists: Reset mailing list admin password for wikimedia-dz - https://phabricator.wikimedia.org/T128512#2082980 (10Dzahn) [02:18:14] !log upgrade elastic1016.eqiad.wmnet to elasticserach 1.7.5 [02:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:24] (03PS1) 1020after4: Redirect create-task form, with or without slash (fixes T127286) [puppet] - 10https://gerrit.wikimedia.org/r/274621 (https://phabricator.wikimedia.org/T127286) [02:28:55] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.14) (duration: 13m 44s) [02:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:09] RECOVERY - puppet last run on mw2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:00:18] RECOVERY - Last backup of the others filesystem on labstore1001 is OK: OK - Last run for unit replicate-others was successful [03:03:45] !log upgrade elastic1017.eqiad.wmnet to elasticsearch 1.7.5 [03:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:04:45] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.15) (duration: 18m 46s) [03:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:05:49] PROBLEM - Last backup of the others filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-others was exit-code [03:13:24] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Mar 3 03:13:23 UTC 2016 (duration 8m 38s) [03:13:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:20:04] looks like phabricator has remained stable.. :) [03:20:24] I'm going afk for a bit, call/page me if something comes up with phab [03:43:17] (03PS2) 10BBlack: role::cache::instances: all backends get all backends [puppet] - 10https://gerrit.wikimedia.org/r/274600 (https://phabricator.wikimedia.org/T127481) [03:43:19] (03PS2) 10BBlack: VCL: remove tier-one conditionals on be applayer backend code [puppet] - 10https://gerrit.wikimedia.org/r/274601 (https://phabricator.wikimedia.org/T127481) [03:43:21] (03PS2) 10BBlack: VCL: cache_route_table replaces site_tier [puppet] - 10https://gerrit.wikimedia.org/r/274602 (https://phabricator.wikimedia.org/T127481) [03:43:23] (03PS2) 10BBlack: role::cache::instances: move cache route table to hiera [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) [03:43:25] (03PS3) 10BBlack: wmflib: add hash_select_re and hash_deselect_re [puppet] - 10https://gerrit.wikimedia.org/r/274400 (https://phabricator.wikimedia.org/T127481) [03:43:27] (03PS2) 10BBlack: role::cache::instances: create, use for maps [puppet] - 10https://gerrit.wikimedia.org/r/274596 (https://phabricator.wikimedia.org/T127481) [03:43:29] (03PS2) 10BBlack: role::cache::instances: use for misc [puppet] - 10https://gerrit.wikimedia.org/r/274597 (https://phabricator.wikimedia.org/T127481) [03:43:31] (03PS2) 10BBlack: role::cache::instances: use for text [puppet] - 10https://gerrit.wikimedia.org/r/274598 (https://phabricator.wikimedia.org/T127481) [03:43:33] (03PS2) 10BBlack: role::cache::instances: use for upload [puppet] - 10https://gerrit.wikimedia.org/r/274599 (https://phabricator.wikimedia.org/T127481) [03:44:52] (03PS2) 10BBlack: 2layer: remove outdated pybal weight junk [puppet] - 10https://gerrit.wikimedia.org/r/274584 (https://phabricator.wikimedia.org/T127481) [03:44:59] (03CR) 10BBlack: [C: 032 V: 032] 2layer: remove outdated pybal weight junk [puppet] - 10https://gerrit.wikimedia.org/r/274584 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [03:45:32] (03PS2) 10BBlack: 2layer: move $mma out of storage block [puppet] - 10https://gerrit.wikimedia.org/r/274585 (https://phabricator.wikimedia.org/T127481) [03:45:38] (03CR) 10BBlack: [C: 032 V: 032] 2layer: move $mma out of storage block [puppet] - 10https://gerrit.wikimedia.org/r/274585 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [03:46:10] (03PS2) 10BBlack: geoip.inc.vcl.erb: move to text extra_vcl [puppet] - 10https://gerrit.wikimedia.org/r/274586 (https://phabricator.wikimedia.org/T127481) [03:46:17] (03CR) 10BBlack: [C: 032 V: 032] geoip.inc.vcl.erb: move to text extra_vcl [puppet] - 10https://gerrit.wikimedia.org/r/274586 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [03:46:38] (03CR) 10jenkins-bot: [V: 04-1] VCL: remove tier-one conditionals on be applayer backend code [puppet] - 10https://gerrit.wikimedia.org/r/274601 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [03:46:48] (03PS2) 10BBlack: v::c::directors: remove dead code/comments [puppet] - 10https://gerrit.wikimedia.org/r/274587 (https://phabricator.wikimedia.org/T127481) [03:46:52] (03CR) 10jenkins-bot: [V: 04-1] role::cache::instances: all backends get all backends [puppet] - 10https://gerrit.wikimedia.org/r/274600 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [03:46:55] (03CR) 10BBlack: [C: 032 V: 032] v::c::directors: remove dead code/comments [puppet] - 10https://gerrit.wikimedia.org/r/274587 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [03:47:15] (03PS2) 10BBlack: misc-backend: clean up elsif whitespace [puppet] - 10https://gerrit.wikimedia.org/r/274588 (https://phabricator.wikimedia.org/T127481) [03:47:19] (03CR) 10jenkins-bot: [V: 04-1] VCL: cache_route_table replaces site_tier [puppet] - 10https://gerrit.wikimedia.org/r/274602 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [03:47:22] (03CR) 10BBlack: [C: 032 V: 032] misc-backend: clean up elsif whitespace [puppet] - 10https://gerrit.wikimedia.org/r/274588 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [03:47:59] (03CR) 10jenkins-bot: [V: 04-1] role::cache::instances: move cache route table to hiera [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [03:48:29] (03CR) 10jenkins-bot: [V: 04-1] wmflib: add hash_select_re and hash_deselect_re [puppet] - 10https://gerrit.wikimedia.org/r/274400 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [03:48:59] (03CR) 10jenkins-bot: [V: 04-1] role::cache::instances: create, use for maps [puppet] - 10https://gerrit.wikimedia.org/r/274596 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [03:49:25] (03CR) 10jenkins-bot: [V: 04-1] role::cache::instances: use for misc [puppet] - 10https://gerrit.wikimedia.org/r/274597 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [03:49:49] (03CR) 10jenkins-bot: [V: 04-1] role::cache::instances: use for text [puppet] - 10https://gerrit.wikimedia.org/r/274598 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [03:50:15] (03CR) 10jenkins-bot: [V: 04-1] role::cache::instances: use for upload [puppet] - 10https://gerrit.wikimedia.org/r/274599 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [03:51:13] !log upgrade elastic1018.eqiad.wmnet to elasticsearch 1.7.5 [03:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:54:49] (03PS3) 10BBlack: role::cache::instances: all backends get all backends [puppet] - 10https://gerrit.wikimedia.org/r/274600 (https://phabricator.wikimedia.org/T127481) [03:54:51] (03PS3) 10BBlack: VCL: remove tier-one conditionals on be applayer backend code [puppet] - 10https://gerrit.wikimedia.org/r/274601 (https://phabricator.wikimedia.org/T127481) [03:54:53] (03PS3) 10BBlack: VCL: cache_route_table replaces site_tier [puppet] - 10https://gerrit.wikimedia.org/r/274602 (https://phabricator.wikimedia.org/T127481) [03:54:55] (03PS3) 10BBlack: role::cache::instances: move cache route table to hiera [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) [03:54:57] (03PS4) 10BBlack: wmflib: add hash_select_re and hash_deselect_re [puppet] - 10https://gerrit.wikimedia.org/r/274400 (https://phabricator.wikimedia.org/T127481) [03:54:59] (03PS3) 10BBlack: role::cache::instances: create, use for maps [puppet] - 10https://gerrit.wikimedia.org/r/274596 (https://phabricator.wikimedia.org/T127481) [03:55:01] (03PS3) 10BBlack: role::cache::instances: use for misc [puppet] - 10https://gerrit.wikimedia.org/r/274597 (https://phabricator.wikimedia.org/T127481) [03:55:03] (03PS3) 10BBlack: role::cache::instances: use for text [puppet] - 10https://gerrit.wikimedia.org/r/274598 (https://phabricator.wikimedia.org/T127481) [03:55:05] (03PS3) 10BBlack: role::cache::instances: use for upload [puppet] - 10https://gerrit.wikimedia.org/r/274599 (https://phabricator.wikimedia.org/T127481) [03:55:07] (03PS2) 10BBlack: v::c::directors: remove defaulting of service/dc [puppet] - 10https://gerrit.wikimedia.org/r/274592 (https://phabricator.wikimedia.org/T127481) [03:55:09] (03PS2) 10BBlack: role::cache: undo fe_t[12]_opts complexity [puppet] - 10https://gerrit.wikimedia.org/r/274593 (https://phabricator.wikimedia.org/T127481) [03:55:11] (03PS2) 10BBlack: VCL: move layer from vcl_config to instance param [puppet] - 10https://gerrit.wikimedia.org/r/274594 (https://phabricator.wikimedia.org/T127481) [03:55:13] (03PS2) 10BBlack: rename cache_remote to cache_eqiad [puppet] - 10https://gerrit.wikimedia.org/r/274595 (https://phabricator.wikimedia.org/T127481) [03:55:15] (03PS2) 10BBlack: VCL: rename remaining "backend" cache backends [puppet] - 10https://gerrit.wikimedia.org/r/274591 (https://phabricator.wikimedia.org/T127481) [03:55:17] (03PS2) 10BBlack: VCL: explicit applayer backend selection [puppet] - 10https://gerrit.wikimedia.org/r/274590 (https://phabricator.wikimedia.org/T127481) [03:55:19] (03PS2) 10BBlack: text-backend: clean up applayer backend logic [puppet] - 10https://gerrit.wikimedia.org/r/274589 (https://phabricator.wikimedia.org/T127481) [04:01:23] (03CR) 10BBlack: [C: 032 V: 032] text-backend: clean up applayer backend logic [puppet] - 10https://gerrit.wikimedia.org/r/274589 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [04:05:28] !log disabling puppet on caches for a bit, JIC [04:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:05:55] (03CR) 10BBlack: [C: 032 V: 032] VCL: explicit applayer backend selection [puppet] - 10https://gerrit.wikimedia.org/r/274590 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [04:07:01] (03CR) 10BBlack: [C: 032 V: 032] VCL: rename remaining "backend" cache backends [puppet] - 10https://gerrit.wikimedia.org/r/274591 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [04:08:45] (03CR) 10BBlack: [C: 032 V: 032] v::c::directors: remove defaulting of service/dc [puppet] - 10https://gerrit.wikimedia.org/r/274592 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [04:09:16] (03CR) 10BBlack: [C: 032 V: 032] role::cache: undo fe_t[12]_opts complexity [puppet] - 10https://gerrit.wikimedia.org/r/274593 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [04:11:48] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: puppet fail [04:12:08] figures it would be the go tpl stuff :P [04:12:58] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: puppet fail [04:13:00] PROBLEM - puppet last run on cp1044 is CRITICAL: CRITICAL: puppet fail [04:14:55] (03PS1) 10BBlack: bugfix: missing } in d37a3550 [puppet] - 10https://gerrit.wikimedia.org/r/274628 (https://phabricator.wikimedia.org/T127481) [04:15:12] (03CR) 10BBlack: [C: 032 V: 032] bugfix: missing } in d37a3550 [puppet] - 10https://gerrit.wikimedia.org/r/274628 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [04:18:09] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: puppet fail [04:18:19] PROBLEM - puppet last run on cp1044 is CRITICAL: CRITICAL: puppet fail [04:18:40] RECOVERY - puppet last run on cp1071 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [04:19:38] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 2 failures [04:19:58] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [04:20:00] RECOVERY - puppet last run on cp1044 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [04:20:41] (03PS3) 10BBlack: VCL: move layer from vcl_config to instance param [puppet] - 10https://gerrit.wikimedia.org/r/274594 (https://phabricator.wikimedia.org/T127481) [04:21:19] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [04:21:47] (03CR) 10BBlack: [C: 032 V: 032] VCL: move layer from vcl_config to instance param [puppet] - 10https://gerrit.wikimedia.org/r/274594 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [04:22:26] (03PS3) 10BBlack: rename cache_remote to cache_eqiad [puppet] - 10https://gerrit.wikimedia.org/r/274595 (https://phabricator.wikimedia.org/T127481) [04:22:33] (03CR) 10BBlack: [C: 032 V: 032] rename cache_remote to cache_eqiad [puppet] - 10https://gerrit.wikimedia.org/r/274595 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [04:29:59] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Puppet has 2 failures [04:30:18] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 2 failures [04:30:40] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Puppet has 2 failures [04:30:40] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 2 failures [04:30:47] bleh, it's a race condition [04:31:18] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 2 failures [04:31:19] several will fail like that, then recover shortly. nothing critical to see here. [04:31:19] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Puppet has 2 failures [04:31:19] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 2 failures [04:31:28] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 2 failures [04:31:28] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Puppet has 2 failures [04:31:28] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Puppet has 2 failures [04:31:49] PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: Puppet has 2 failures [04:31:49] PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: Puppet has 2 failures [04:32:29] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [04:33:38] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [04:33:39] RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [04:33:39] RECOVERY - puppet last run on cp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:33:50] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:34:19] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:34:49] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:34:50] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:34:50] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:34:58] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [04:34:59] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:34:59] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:35:42] (03PS4) 10BBlack: role::cache::instances: create, use for maps [puppet] - 10https://gerrit.wikimedia.org/r/274596 (https://phabricator.wikimedia.org/T127481) [04:36:38] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 10.34% of data above the critical threshold [100000000.0] [04:37:05] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2083214 (10EBernhardson) I looked back at your email and you are totally correct. I'm double checking with @tfinc but i'm pretty sure we will do the full 16 server refresh t... [04:38:01] (03PS4) 10BBlack: role::cache::instances: use for misc [puppet] - 10https://gerrit.wikimedia.org/r/274597 (https://phabricator.wikimedia.org/T127481) [04:38:14] (03PS4) 10BBlack: role::cache::instances: use for text [puppet] - 10https://gerrit.wikimedia.org/r/274598 (https://phabricator.wikimedia.org/T127481) [04:38:22] (03PS4) 10BBlack: role::cache::instances: use for upload [puppet] - 10https://gerrit.wikimedia.org/r/274599 (https://phabricator.wikimedia.org/T127481) [05:00:45] (03PS4) 10BBlack: role::cache::instances: all backends get all backends [puppet] - 10https://gerrit.wikimedia.org/r/274600 (https://phabricator.wikimedia.org/T127481) [05:00:47] (03PS4) 10BBlack: VCL: remove tier-one conditionals on be applayer backend code [puppet] - 10https://gerrit.wikimedia.org/r/274601 (https://phabricator.wikimedia.org/T127481) [05:00:49] (03PS4) 10BBlack: VCL: cache_route_table replaces site_tier [puppet] - 10https://gerrit.wikimedia.org/r/274602 (https://phabricator.wikimedia.org/T127481) [05:00:51] (03PS4) 10BBlack: role::cache::instances: move cache route table to hiera [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) [05:00:53] (03PS5) 10BBlack: wmflib: add hash_select_re and hash_deselect_re [puppet] - 10https://gerrit.wikimedia.org/r/274400 (https://phabricator.wikimedia.org/T127481) [05:00:55] (03PS5) 10BBlack: role::cache::instances: create and use for maps [puppet] - 10https://gerrit.wikimedia.org/r/274596 (https://phabricator.wikimedia.org/T127481) [05:00:57] (03PS5) 10BBlack: role::cache::instances: use for misc cluster [puppet] - 10https://gerrit.wikimedia.org/r/274597 (https://phabricator.wikimedia.org/T127481) [05:00:59] (03PS5) 10BBlack: role::cache::instances: use for text cluster [puppet] - 10https://gerrit.wikimedia.org/r/274598 (https://phabricator.wikimedia.org/T127481) [05:01:01] (03PS5) 10BBlack: role::cache::instances: use for upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/274599 (https://phabricator.wikimedia.org/T127481) [05:02:24] !log upgrade elastic1019.eqiad.wmnet to elasticseach 1.7.5 [05:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:07:04] 6Operations, 6Labs, 10Labs-Infrastructure: Estimate hardware requirements for relevance lab elasticsearch servers - https://phabricator.wikimedia.org/T128433#2083262 (10EBernhardson) I've been running some more relevance lab tests against nobelium, this time with a more reperesentative query set. Specificall... [05:10:04] (03PS5) 10BBlack: role::cache::instances: all backends get all backends [puppet] - 10https://gerrit.wikimedia.org/r/274600 (https://phabricator.wikimedia.org/T127481) [05:10:06] (03PS5) 10BBlack: VCL: remove tier-one conditionals on be applayer backend code [puppet] - 10https://gerrit.wikimedia.org/r/274601 (https://phabricator.wikimedia.org/T127481) [05:10:08] (03PS5) 10BBlack: VCL: cache_route_table replaces site_tier [puppet] - 10https://gerrit.wikimedia.org/r/274602 (https://phabricator.wikimedia.org/T127481) [05:10:10] (03PS5) 10BBlack: role::cache::instances: move cache route table to hiera [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) [05:10:12] (03PS6) 10BBlack: wmflib: add hash_select_re and hash_deselect_re [puppet] - 10https://gerrit.wikimedia.org/r/274400 (https://phabricator.wikimedia.org/T127481) [05:10:14] (03PS6) 10BBlack: role::cache::instances: create and use for maps [puppet] - 10https://gerrit.wikimedia.org/r/274596 (https://phabricator.wikimedia.org/T127481) [05:10:16] (03PS6) 10BBlack: role::cache::instances: use for misc cluster [puppet] - 10https://gerrit.wikimedia.org/r/274597 (https://phabricator.wikimedia.org/T127481) [05:10:18] (03PS6) 10BBlack: role::cache::instances: use for text cluster [puppet] - 10https://gerrit.wikimedia.org/r/274598 (https://phabricator.wikimedia.org/T127481) [05:10:20] (03PS6) 10BBlack: role::cache::instances: use for upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/274599 (https://phabricator.wikimedia.org/T127481) [05:13:37] (03PS6) 10BBlack: role::cache::instances: all backends get all backends [puppet] - 10https://gerrit.wikimedia.org/r/274600 (https://phabricator.wikimedia.org/T127481) [05:13:39] (03PS6) 10BBlack: VCL: remove tier-one conditionals on be applayer backend code [puppet] - 10https://gerrit.wikimedia.org/r/274601 (https://phabricator.wikimedia.org/T127481) [05:13:41] (03PS6) 10BBlack: VCL: cache_route_table replaces site_tier [puppet] - 10https://gerrit.wikimedia.org/r/274602 (https://phabricator.wikimedia.org/T127481) [05:13:43] (03PS6) 10BBlack: role::cache::instances: move cache route table to hiera [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) [05:13:45] (03PS7) 10BBlack: wmflib: add hash_select_re and hash_deselect_re [puppet] - 10https://gerrit.wikimedia.org/r/274400 (https://phabricator.wikimedia.org/T127481) [05:13:47] (03PS7) 10BBlack: role::cache::instances: create and use for maps [puppet] - 10https://gerrit.wikimedia.org/r/274596 (https://phabricator.wikimedia.org/T127481) [05:13:49] (03PS7) 10BBlack: role::cache::instances: use for misc cluster [puppet] - 10https://gerrit.wikimedia.org/r/274597 (https://phabricator.wikimedia.org/T127481) [05:13:51] (03PS7) 10BBlack: role::cache::instances: use for text cluster [puppet] - 10https://gerrit.wikimedia.org/r/274598 (https://phabricator.wikimedia.org/T127481) [05:13:53] (03PS7) 10BBlack: role::cache::instances: use for upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/274599 (https://phabricator.wikimedia.org/T127481) [05:15:50] (03CR) 10jenkins-bot: [V: 04-1] role::cache::instances: move cache route table to hiera [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [05:19:10] (03CR) 10BBlack: [C: 032 V: 032] role::cache::instances: create and use for maps [puppet] - 10https://gerrit.wikimedia.org/r/274596 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [05:19:17] (03CR) 10BBlack: [C: 032 V: 032] role::cache::instances: use for misc cluster [puppet] - 10https://gerrit.wikimedia.org/r/274597 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [05:19:27] (03CR) 10BBlack: [C: 032 V: 032] role::cache::instances: use for text cluster [puppet] - 10https://gerrit.wikimedia.org/r/274598 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [05:19:33] (03CR) 10BBlack: [C: 032 V: 032] role::cache::instances: use for upload cluster [puppet] - 10https://gerrit.wikimedia.org/r/274599 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [05:21:06] 6Operations, 6Labs: revise/fix labstore replicate backup jobs - https://phabricator.wikimedia.org/T127567#2083269 (10Dzahn) just my 2 cents from the merged task. that output line "Last run result for unit replicate-tools was exit-code " really looked as if there was just a typo where it should be 'was $exit-co... [05:21:49] (03CR) 10BBlack: [C: 032 V: 032] wmflib: add hash_select_re and hash_deselect_re [puppet] - 10https://gerrit.wikimedia.org/r/274400 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [05:24:41] 6Operations, 10Traffic, 6WMF-Communications, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2083273 (10Dzahn) I'm interested and happy to be a CC: but Brandon Black should definitely be the TO: [05:29:41] 5 # We really like nesting classes: [05:29:41] 6 --no-nested_classes_or_defines-check [05:29:46] do we really? [05:31:56] well, they are certainly rife in certain in certain parts of the repo [05:32:20] i think they're horrible, personally. puppet's scoping is bad enough without them. [05:32:43] at one point it will be a question if we keep that one check disabled forever or not [05:32:49] to resolve https://phabricator.wikimedia.org/T93645 [05:33:26] so it's against the official style [05:33:41] it depends on whether we think someone will volunteer to correct existing cases [05:34:17] it would have to be done quite carefully, for the same reason -- i.e., tricky variable scoping [05:35:11] a happy middle-ground might be to add exemptions for all existing cases, even if they are many [05:35:28] yea, the chances are relatively good though. we have found volunteers for a couple other things like this. other puppet-lint exceptions that could then later be removed.. and reduced the number of checks that get skipped [05:35:33] it would be an achievement to prevent the pattern from spreading further [05:35:54] also, we can add these special comments around them to make lint skip some existing cases [05:36:06] while jenkins still wont encourage adding new ones [05:36:26] so remove the global exception but add special comments [05:36:39] yea [05:38:58] thanks for your comments. also see: [05:39:00] git log -p .puppet-lint.rc [05:44:02] thanks for doing this [05:46:43] (03PS1) 10BBlack: wmflib: bugfix for hash_deselect_re tests [puppet] - 10https://gerrit.wikimedia.org/r/274635 (https://phabricator.wikimedia.org/T127481) [05:46:45] (03PS1) 10BBlack: wmflib: document hash_(de)select_re for ori [puppet] - 10https://gerrit.wikimedia.org/r/274636 [05:46:49] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [05:46:56] aww :) [05:47:08] (03CR) 10BBlack: [C: 032 V: 032] wmflib: bugfix for hash_deselect_re tests [puppet] - 10https://gerrit.wikimedia.org/r/274635 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [05:47:23] (03CR) 10BBlack: [C: 032 V: 032] wmflib: document hash_(de)select_re for ori [puppet] - 10https://gerrit.wikimedia.org/r/274636 (owner: 10BBlack) [05:47:31] (03CR) 10Ori.livneh: "Thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/274636 (owner: 10BBlack) [05:47:58] !log upgrade elastic1020.eqiad.wmnet to elasticsearch 1.7.5 [05:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:51:37] 7Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 7Tracking: Remove all ::beta roles in puppet - https://phabricator.wikimedia.org/T86644#973295 (10Dzahn) @Joe your opinion on https://gerrit.wikimedia.org/r/#/c/247324/ would be appreciated [05:53:00] PROBLEM - puppet last run on cp2012 is CRITICAL: CRITICAL: Puppet has 2 failures [05:53:26] ^ apparently I left some of my race condition laying around, it's playing out again now as I enable them... [05:53:29] PROBLEM - puppet last run on cp2020 is CRITICAL: CRITICAL: Puppet has 2 failures [05:53:59] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 2 failures [05:54:00] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 2 failures [05:54:38] PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: Puppet has 2 failures [05:54:38] PROBLEM - puppet last run on cp2018 is CRITICAL: CRITICAL: Puppet has 2 failures [05:55:00] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Puppet has 2 failures [05:55:40] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Puppet has 2 failures [05:55:40] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Puppet has 2 failures [05:55:40] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 2 failures [05:55:48] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: Puppet has 2 failures [05:55:48] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Puppet has 2 failures [05:55:49] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 2 failures [05:56:18] PROBLEM - puppet last run on cp2026 is CRITICAL: CRITICAL: Puppet has 2 failures [05:56:40] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 2 failures [05:56:49] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: Puppet has 2 failures [05:56:49] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Puppet has 2 failures [05:56:50] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: Puppet has 2 failures [05:57:38] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [05:57:58] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 2 failures [05:58:09] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 2 failures [05:58:09] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 2 failures [05:58:10] PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Puppet has 2 failures [05:58:20] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Puppet has 2 failures [05:58:20] PROBLEM - puppet last run on cp2004 is CRITICAL: CRITICAL: Puppet has 2 failures [05:58:39] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: Puppet has 2 failures [05:58:39] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 2 failures [05:58:49] PROBLEM - puppet last run on cp2019 is CRITICAL: CRITICAL: Puppet has 2 failures [05:59:09] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 2 failures [05:59:10] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 2 failures [05:59:19] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 2 failures [05:59:19] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Puppet has 2 failures [05:59:20] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: Puppet has 2 failures [05:59:20] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 2 failures [05:59:20] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 2 failures [05:59:29] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: Puppet has 2 failures [05:59:59] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 2 failures [06:00:18] (03PS4) 10BBlack: Add a cluster_be_recv_pre_purge handler & normalize paths [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [06:00:59] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:01:09] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 2 failures [06:01:09] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 2 failures [06:01:10] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Puppet has 2 failures [06:01:49] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 2 failures [06:01:49] RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:02:00] RECOVERY - puppet last run on cp2004 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:02:38] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Puppet has 2 failures [06:02:38] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Puppet has 2 failures [06:02:49] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:03:49] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:03:58] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Puppet has 2 failures [06:04:00] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Puppet has 2 failures [06:04:00] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [06:04:08] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [06:04:20] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:04:22] bleh [06:05:24] (03CR) 10BBlack: [C: 031] "LGTM, and rebased onto the evening's refactors too. Waiting for the morning to merge though..." [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [06:05:30] RECOVERY - puppet last run on cp2012 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:05:49] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [06:05:49] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [06:05:59] RECOVERY - puppet last run on cp2020 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:06:20] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:06:20] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:06:29] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:06:29] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:06:49] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:06:59] RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:06:59] RECOVERY - puppet last run on cp2018 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:07:29] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:08:00] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:08:10] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:08:10] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:08:10] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:08:10] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:08:10] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:08:10] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:08:11] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:08:23] (03CR) 10BBlack: [C: 031] Allow maps for test and test2 [puppet] - 10https://gerrit.wikimedia.org/r/274475 (owner: 10Yurik) [06:08:39] RECOVERY - puppet last run on cp2026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:08:49] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:08:49] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:09:08] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:09:18] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:09:18] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:09:19] RECOVERY - puppet last run on cp2022 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:09:58] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:09:59] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:10:30] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:10:50] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:10:59] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:10:59] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:10:59] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:11:10] RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:11:20] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:11:38] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:11:39] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:11:40] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:11:49] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:12:18] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:30:29] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:38] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: puppet fail [06:31:38] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:49] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:49] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:49] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:19] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:09] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:09] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:19] PROBLEM - puppet last run on mw2043 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:19] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:19] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:38] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:58] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:26] !log upgrade elastic1021.eqiad.wmnet to elasticsearch 1.7.5 [06:39:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:56:00] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:56:10] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:28] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:56:28] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:29] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:56:40] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:59] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:49] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:59] RECOVERY - puppet last run on cp1057 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:57:59] RECOVERY - puppet last run on mw2043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:59] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:00] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:58:18] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:46] (03PS2) 10Muehlenhoff: Include yhsm-yubikey-ksm in yubiauth role [puppet] - 10https://gerrit.wikimedia.org/r/274359 [07:12:05] 6Operations, 10Ops-Access-Requests: Access to deployment group for user madhuvishy - https://phabricator.wikimedia.org/T128666#2083329 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [07:17:12] 6Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation, 6Labs, 10wikitech.wikimedia.org: copy wikitech dumps to dumps server ? - https://phabricator.wikimedia.org/T128680#2083348 (10ArielGlenn) p:5Lowest>3Low a:3ArielGlenn [07:19:31] 6Operations, 10Ops-Access-Requests: Access to deployment group for user madhuvishy - https://phabricator.wikimedia.org/T128666#2083353 (10MoritzMuehlenhoff) 5Open>3Resolved Password has been reset and sent to the list admin. [07:19:59] 6Operations, 10Ops-Access-Requests: Access to deployment group for user madhuvishy - https://phabricator.wikimedia.org/T128666#2083356 (10MoritzMuehlenhoff) 5Resolved>3Open Sorry, wrong tab, reopening. [07:20:26] 6Operations, 10Wikimedia-Mailing-lists: Reset mailing list admin password for wikimedia-dz - https://phabricator.wikimedia.org/T128512#2083360 (10MoritzMuehlenhoff) 5Open>3Resolved Password has been reset and sent to the list admin. [07:20:45] (03PS1) 10Jcrespo: Enable pt-heartbeat on all misc master (except m1) [puppet] - 10https://gerrit.wikimedia.org/r/274640 (https://phabricator.wikimedia.org/T114752) [07:22:14] (03PS2) 10Jcrespo: Enable pt-heartbeat on all misc masters (except m1) [puppet] - 10https://gerrit.wikimedia.org/r/274640 (https://phabricator.wikimedia.org/T114752) [07:22:30] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [07:25:58] (03CR) 10Muehlenhoff: "That was announced in Tech/News:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) (owner: 10Faidon Liambotis) [07:26:00] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:31:48] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 204, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [07:33:10] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [07:34:05] ? [07:36:40] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [07:36:48] text caches eqiad traffic is a bit spikey http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Text+caches+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [07:37:08] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 206, down: 0, dormant: 0, excluded: 0, unused: 0 [07:37:45] <_joe_> ori: strangely, it's the only cluster that's spiky [07:38:03] <_joe_> but if you look at http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Text+caches+esams&m=cpu_report&s=by+name&mc=2&g=network_report you see the scale difference [07:38:14] 6Operations, 10Ops-Access-Requests: Access to deployment group for user madhuvishy - https://phabricator.wikimedia.org/T128666#2083392 (10MoritzMuehlenhoff) Yes, this needs manager approval by Nuria. Also @greg needs to confirm as the project lead for the deployment group (I saw the call for help on the Engi... [07:43:23] 6Operations, 10Analytics, 10Traffic, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2083401 (10elukey) Finally we should have a good picture of what changes between 3.0 and 4.0 api from a general overview: * https://www.varnish-cache.... [07:52:46] 6Operations, 10Wikimedia-Mailing-lists: Reset mailing list admin password for wikimedia-dz - https://phabricator.wikimedia.org/T128512#2083408 (10Vikoula5) all is good thank you [07:59:49] 6Operations: Identify servers with h310 controllers - https://phabricator.wikimedia.org/T84356#2083412 (10MoritzMuehlenhoff) a:3Cmjohnson [07:59:55] 6Operations: Identify servers with h310 controllers - https://phabricator.wikimedia.org/T84356#926194 (10MoritzMuehlenhoff) That ticket is from 2014, is that still an issue, have the controllers been replaced? Or can the ticket simply be closed? [08:00:46] 6Operations: deploy eventlog2001 services - https://phabricator.wikimedia.org/T93220#2083415 (10MoritzMuehlenhoff) a:3Ottomata [08:11:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] Include yhsm-yubikey-ksm in yubiauth role [puppet] - 10https://gerrit.wikimedia.org/r/274359 (owner: 10Muehlenhoff) [08:15:38] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [08:16:08] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 204, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [08:23:14] (03PS3) 10Jcrespo: Enable pt-heartbeat on all misc masters (except m1) [puppet] - 10https://gerrit.wikimedia.org/r/274640 (https://phabricator.wikimedia.org/T114752) [08:25:03] (03CR) 10Jcrespo: [C: 032] Enable pt-heartbeat on all misc masters (except m1) [puppet] - 10https://gerrit.wikimedia.org/r/274640 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [08:27:25] !log altering heartbeat table on all production servers [08:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:30:55] 6Operations, 6Performance-Team, 10Traffic: Segment Navigation Timing data by continent - https://phabricator.wikimedia.org/T128709#2083445 (10ori) [08:32:01] (03CR) 10Giuseppe Lavagetto: role::memcached: create cross-dc replication for sessions (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [08:32:55] (03CR) 10Giuseppe Lavagetto: "Puppet compiler seems to show that the patch DTRT" [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [08:33:01] (03PS13) 10Giuseppe Lavagetto: role::memcached: create cross-dc replication for sessions [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) [08:33:19] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:33:36] !log depooling scb1001 for nodejs upgrade [08:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:33:50] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 206, down: 0, dormant: 0, excluded: 0, unused: 0 [08:35:28] <_joe_> !log disabled puppet across the main redises fleet in order to merge https://gerrit.wikimedia.org/r/271261 safely [08:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:36:07] (03CR) 10Giuseppe Lavagetto: [C: 032] "LGTM, I'll apply puppet with caution" [puppet] - 10https://gerrit.wikimedia.org/r/271261 (https://phabricator.wikimedia.org/T126470) (owner: 10Giuseppe Lavagetto) [08:43:29] PROBLEM - MariaDB Slave SQL: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1060, Errmsg: Error Duplicate column name shard on query. Default database: heartbeat. Query: ALTER TABLE heartbeat ADD COLUMN shard varbinary(10) DEFAULT NULL, ENGINE=MyISAM [08:43:41] I am going to purposely break replication on labs and dbstore. The reason is that the schema change is safer done like that for production (it will not work with multi-source replication). [08:47:00] RECOVERY - MariaDB Slave SQL: s1 on dbstore1002 is OK: OK slave_sql_state Slave_SQL_Running: Yes [08:49:00] sadly, this will happen 11 times, (and again tomorrow for the delayed slaves) [08:49:25] !log repooled scb1001, depooling scb1002 for nodejs upgrade [08:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:54:02] (03CR) 10Mobrovac: Add a cluster_be_recv_pre_purge handler & normalize paths (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [08:54:37] db1047 will break, too (it is multisource, too) [08:59:02] !log repooled scb1002 [08:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:11:18] 6Operations: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2083523 (10MoritzMuehlenhoff) a:3elukey [09:13:02] 6Operations, 10CirrusSearch, 6Discovery, 13Patch-For-Review: Only use newer (elastic10{16..31}) servers as master capable elasticsearch nodes - https://phabricator.wikimedia.org/T112556#2083525 (10MoritzMuehlenhoff) a:3Gehel [09:13:58] 6Operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=7 dev=sdh failed - https://phabricator.wikimedia.org/T127824#2083540 (10MoritzMuehlenhoff) a:3Papaul [09:15:24] 6Operations, 10ops-codfw: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2083544 (10MoritzMuehlenhoff) a:3Papaul [09:21:06] <_joe_> !log puppet re-enabled everywhere, now troubleshooting ipsec issues [09:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:45:51] 6Operations: OCG needs to migrate away from rdb1002 and get its own Redis instance - https://phabricator.wikimedia.org/T128491#2083600 (10Aklapper) //Please move off-topic "how to (not) use patch-for-review" discussions to T95309, T104413, T61831, or whatever. (Likely same for "tracking".) Thanks.// [09:45:58] PROBLEM - IPsec on mc1003 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2003_v4 [09:46:01] !log schema change finished on all hosts (except delayed slaves) [09:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:47:49] PROBLEM - IPsec on mc2003 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1003_v4 [09:50:31] 6Operations, 10Traffic: 3 Varnish cache_upload servers crashed in a short time window - https://phabricator.wikimedia.org/T125401#2083609 (10MoritzMuehlenhoff) > IMHO, 4.4.x is getting close anyways, we may as well see if this problem just goes away after the switch to it. Agreed, I'm mostly done with 4.4.3 e... [09:52:39] (03CR) 10Filippo Giunchedi: [C: 031] Parameterize the git_server variable in global scap.cfg [puppet] - 10https://gerrit.wikimedia.org/r/272947 (https://phabricator.wikimedia.org/T126259) (owner: 1020after4) [09:55:02] 6Operations, 7Graphite, 13Patch-For-Review: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#2083620 (10fgiunchedi) [09:56:22] (03PS2) 10Muehlenhoff: Include rsyncd ferm service in the statistics role [puppet] - 10https://gerrit.wikimedia.org/r/274391 [09:57:01] !log Added es2014,es2016,es2018 to tendril [ T127330 ] [09:57:02] T127330: Migration from es2001-es2010 to es2011-es2019 - https://phabricator.wikimedia.org/T127330 [09:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:57:07] (03CR) 10Muehlenhoff: [C: 032 V: 032] Include rsyncd ferm service in the statistics role [puppet] - 10https://gerrit.wikimedia.org/r/274391 (owner: 10Muehlenhoff) [10:05:42] (03PS3) 10Filippo Giunchedi: swiftrepl: fix destination container listing limit [software] - 10https://gerrit.wikimedia.org/r/272455 (https://phabricator.wikimedia.org/T125791) [10:05:47] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swiftrepl: fix destination container listing limit [software] - 10https://gerrit.wikimedia.org/r/272455 (https://phabricator.wikimedia.org/T125791) (owner: 10Filippo Giunchedi) [10:05:59] PROBLEM - IPsec on mc1014 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2014_v4 [10:06:37] <_joe_> grrr [10:06:49] <_joe_> IPSEC issues, sigh [10:07:48] !log elastic1022.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [10:07:48] PROBLEM - IPsec on mc2014 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1014_v4 [10:07:50] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [10:07:50] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [10:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:12:20] PROBLEM - IPsec on mc1011 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2011_v4 [10:13:20] PROBLEM - IPsec on mc2011 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1011_v4 [10:14:48] PROBLEM - IPsec on mc1006 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2006_v4 [10:16:54] <_joe_> ok so this is a general problem, I see [10:16:58] PROBLEM - IPsec on mc2006 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1006_v4 [10:17:08] RECOVERY - IPsec on mc2011 is OK: Strongswan OK - 1 ESP OK [10:17:49] RECOVERY - IPsec on mc1011 is OK: Strongswan OK - 1 ESP OK [10:17:49] <_joe_> !log restarted strongswan on mc1011 [10:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:18:01] <_joe_> so that solves the issue, apparently [10:21:52] <_joe_> !log rolling restart of strongswan on eqiad failing servers [10:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:22:09] RECOVERY - IPsec on mc1003 is OK: Strongswan OK - 1 ESP OK [10:22:19] RECOVERY - IPsec on mc2003 is OK: Strongswan OK - 1 ESP OK [10:23:49] RECOVERY - IPsec on mc1006 is OK: Strongswan OK - 1 ESP OK [10:23:53] 6Operations, 10Traffic, 10netops: Office network using monkeybrains.net instead of connection to SFO pop site - https://phabricator.wikimedia.org/T128669#2083645 (10faidon) 5Open>3Invalid Our BGP sessions with the office are indeed down, but this is an OIT matter, you should ask them about it. If the pro... [10:24:09] RECOVERY - IPsec on mc1014 is OK: Strongswan OK - 1 ESP OK [10:24:09] RECOVERY - IPsec on mc2006 is OK: Strongswan OK - 1 ESP OK [10:24:10] RECOVERY - IPsec on mc2014 is OK: Strongswan OK - 1 ESP OK [10:26:40] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#2083664 (10faidon) >>! In T124671#2080234, @Elitre wrote: > (Possibly silly question alert: in case things went wrong, would this be refle... [10:26:46] /win 23 [10:28:09] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "minor nitpicks, but it's ok to merge otherwise" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274411 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [10:32:53] (03PS2) 10Elukey: Remove rdb1003 from the job queue pool for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274411 (https://phabricator.wikimedia.org/T123675) [10:34:01] (03PS3) 10Elukey: Remove rdb1003 from the job queue pool for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274411 (https://phabricator.wikimedia.org/T123675) [10:36:09] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable Redis cross-dc replication - https://phabricator.wikimedia.org/T126470#2083676 (10Joe) Replication of the session redises is up and running. Some ipsec hiccups, I will honestly evaluate if it's stable enough or if I should... [10:36:59] !log Changing local replica topology for shard es2 in codfw for T127330 [10:37:00] T127330: Migration from es2001-es2010 to es2011-es2019 - https://phabricator.wikimedia.org/T127330 [10:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:44:29] 6Operations, 6Labs, 10Tool-Labs: Add other Tools administrators to the Icinga notification group - https://phabricator.wikimedia.org/T128715#2083683 (10scfc) [10:45:07] 6Operations, 6Labs, 10Tool-Labs: Make icinga-wm report Tools homepage check at #wikimedia-labs, too - https://phabricator.wikimedia.org/T128716#2083696 (10scfc) [10:46:09] (03PS3) 10Muehlenhoff: Backport a change from 1.0.2g-1 to add the new exported symbols SRP_VBASE_get1_by_user and SRP_user_pwd_free [debs/openssl] - 10https://gerrit.wikimedia.org/r/274413 [10:46:19] PROBLEM - IPsec on mc1004 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2004_v4 [10:46:27] <_joe_> grrr [10:46:39] 6Operations, 6Discovery, 10MediaWiki-Vendor, 10Wikimedia-Logstash, and 2 others: Upgrade ruflin/elastica to 2.3.1 - https://phabricator.wikimedia.org/T127831#2083709 (10dcausse) Looks like it's a breaking change... https://github.com/ruflin/Elastica/commit/c51800f26426076d782b6cc2cb37345b64a6e615#diff-e42... [10:48:08] RECOVERY - IPsec on mc1004 is OK: Strongswan OK - 1 ESP OK [10:48:57] !log Changing local replica topology for shard es3 in codfw for T127330 [10:48:58] T127330: Migration from es2001-es2010 to es2011-es2019 - https://phabricator.wikimedia.org/T127330 [10:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:53:47] (03CR) 10Muehlenhoff: [C: 032 V: 032] Backport a change from 1.0.2g-1 to add the new exported symbols SRP_VBASE_get1_by_user and SRP_user_pwd_free [debs/openssl] - 10https://gerrit.wikimedia.org/r/274413 (owner: 10Muehlenhoff) [10:54:49] !log replicate swift unsharded -deleted containers eqiad -> codfw T128096 [10:54:50] T128096: Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096 [10:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:55:21] 6Operations, 10media-storage: Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2083732 (10fgiunchedi) I've reviewed the errors from channel `FileOperation` and most s... [10:59:00] <_joe_> /win 20 [10:59:28] (03PS2) 10Faidon Liambotis: Kill misc::limn and the limn module [puppet] - 10https://gerrit.wikimedia.org/r/231144 [11:00:26] (03CR) 10Faidon Liambotis: [C: 032] Kill misc::limn and the limn module [puppet] - 10https://gerrit.wikimedia.org/r/231144 (owner: 10Faidon Liambotis) [11:02:34] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove rdb1003 from the job queue pool for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274411 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [11:04:49] PROBLEM - IPsec on mc1015 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2015_v4 [11:05:30] (03PS1) 10Jcrespo: Enable the new pt-heartbeat on core production hosts [puppet] - 10https://gerrit.wikimedia.org/r/274670 (https://phabricator.wikimedia.org/T114752) [11:06:58] PROBLEM - IPsec on mc2015 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1015_v4 [11:09:09] PROBLEM - IPsec on mc1008 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2008_v4 [11:09:40] PROBLEM - IPsec on mc2008 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1008_v4 [11:10:38] (03PS1) 10Volans: Depooled es200{6,8} to migrate data to es201{5,6} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274671 (https://phabricator.wikimedia.org/T127330) [11:10:40] PROBLEM - IPsec on mc2001 is CRITICAL: Strongswan CRITICAL - ok: 1 connecting: mc1017_v4 [11:11:59] PROBLEM - IPsec on mc1018 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2016_v4 [11:11:59] PROBLEM - IPsec on mc1017 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2001_v4 [11:12:19] <_joe_> sigh, this is happening reliably now [11:12:30] PROBLEM - IPsec on mc2016 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1016_v4, mc1018_v4 [11:12:40] (03PS2) 10Volans: Depool es200{6,8} to migrate data to es201{5,6} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274671 (https://phabricator.wikimedia.org/T127330) [11:13:29] PROBLEM - IPsec on mc1006 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2006_v4 [11:14:19] PROBLEM - IPsec on mc1016 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2016_v4 [11:14:47] (03PS3) 10Volans: Depool es200{6,8} to migrate data to es201{5,6} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274671 (https://phabricator.wikimedia.org/T127330) [11:15:49] PROBLEM - IPsec on mc2006 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1006_v4 [11:16:23] (03PS4) 10Volans: Depool es200{6,8} to migrate data to es201{5,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274671 (https://phabricator.wikimedia.org/T127330) [11:16:40] PROBLEM - IPsec on mc1007 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2007_v4 [11:17:15] (03CR) 10Jcrespo: [C: 031] Depool es200{6,8} to migrate data to es201{5,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274671 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [11:17:50] (03CR) 10Volans: [C: 032] Depool es200{6,8} to migrate data to es201{5,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274671 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [11:18:37] (03Merged) 10jenkins-bot: Depool es200{6,8} to migrate data to es201{5,7} [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274671 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [11:19:39] PROBLEM - IPsec on mc2007 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1007_v4 [11:20:41] (03PS7) 10Faidon Liambotis: Do not rewrite https -> http for IRC notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (https://phabricator.wikimedia.org/T122933) [11:21:10] PROBLEM - IPsec on mc1002 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2002_v4 [11:21:11] !log volans@tin Synchronized wmf-config/db-codfw.php: Depool es2005,es2008 to migrate data to es2015,es2017 T127330 (duration: 00m 53s) [11:21:12] T127330: Migration from es2001-es2010 to es2011-es2019 - https://phabricator.wikimedia.org/T127330 [11:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:22:02] <_joe_> I am looking at the ipsec issues [11:22:18] <_joe_> well, let's call them "alerts", I'm doubting it's a real issue [11:23:19] PROBLEM - IPsec on mc2002 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1002_v4 [11:23:58] PROBLEM - IPsec on mc1012 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2012_v4 [11:25:39] PROBLEM - IPsec on mc2012 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1012_v4 [11:26:19] PROBLEM - IPsec on mc1009 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2009_v4 [11:28:28] PROBLEM - IPsec on mc2009 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1009_v4 [11:36:50] PROBLEM - IPsec on mc1005 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2005_v4 [11:37:48] PROBLEM - IPsec on mc1004 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2004_v4 [11:37:56] <_joe_> it is a problem indeed [11:39:09] PROBLEM - IPsec on mc2005 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1005_v4 [11:39:10] (03PS1) 10Mobrovac: Services: introduce service::packages [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) [11:39:24] (03CR) 10Tim Landscheidt: "@Dzahn: Ping." [puppet] - 10https://gerrit.wikimedia.org/r/270025 (owner: 10Tim Landscheidt) [11:39:39] RECOVERY - IPsec on mc1017 is OK: Strongswan OK - 1 ESP OK [11:40:19] PROBLEM - IPsec on mc2004 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1004_v4 [11:40:28] !log upgrading cp1008 to openssl 1.0.2g [11:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:41:00] (03PS2) 10Muehlenhoff: Define eventbus ferm service in the role [puppet] - 10https://gerrit.wikimedia.org/r/274389 [11:43:49] RECOVERY - IPsec on mc2001 is OK: Strongswan OK - 2 ESP OK [11:44:58] PROBLEM - IPsec on mc1010 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2010_v4 [11:46:02] (03PS1) 10Giuseppe Lavagetto: redis::multidc::instances: disable replication [puppet] - 10https://gerrit.wikimedia.org/r/274676 [11:46:10] PROBLEM - IPsec on mc1013 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2013_v4 [11:47:30] PROBLEM - IPsec on mc2010 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1010_v4 [11:48:12] 6Operations: Queires in Hue always return an empty result set - https://phabricator.wikimedia.org/T128039#2083854 (10MoritzMuehlenhoff) a:3Ottomata [11:48:19] PROBLEM - IPsec on mc2013 is CRITICAL: Strongswan CRITICAL - ok: 0 connecting: mc1013_v4 [11:48:57] (03CR) 10Mobrovac: "PCC run: https://puppet-compiler.wmflabs.org/1927/" [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [11:51:31] (03PS2) 10Giuseppe Lavagetto: redis::multidc::instances: disable replication [puppet] - 10https://gerrit.wikimedia.org/r/274676 [11:52:03] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "We don't want this traffic to go through unencrypted" [puppet] - 10https://gerrit.wikimedia.org/r/274676 (owner: 10Giuseppe Lavagetto) [11:52:59] !log Migrating data es2006->es2015 and es2008->es2017->es2019 T127330 [11:53:00] T127330: Migration from es2001-es2010 to es2011-es2019 - https://phabricator.wikimedia.org/T127330 [11:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:54:44] (03PS1) 10ArielGlenn: start a directory of tiny tools I use for cleanup, so they don't get lost [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/274677 [11:55:35] <_joe_> !log disabled notifications from the redises IPSEC checks while replication is disabled [11:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:57:09] (03CR) 10ArielGlenn: [C: 032] "might fold this into dumpadmin if I need it enough, but for now I needed it and wrote it today, so stashing it here to avoid losing it." [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/274677 (owner: 10ArielGlenn) [11:57:23] (03CR) 10ArielGlenn: [V: 032] start a directory of tiny tools I use for cleanup, so they don't get lost [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/274677 (owner: 10ArielGlenn) [12:00:13] (03PS4) 10Elukey: Remove rdb1003 from the job queue pool for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274411 (https://phabricator.wikimedia.org/T123675) [12:01:42] (03CR) 10Elukey: [C: 032] Remove rdb1003 from the job queue pool for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274411 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [12:01:51] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Apart from a minor comment looks good to me;" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [12:02:06] !log elastic1023.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [12:02:07] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [12:02:07] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [12:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:05:55] (03CR) 10Mobrovac: "@_joe_, do you see a case where multiple services would be installed on the same machines, but some need the dev pkgs and some don't? We u" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [12:06:56] (03CR) 10Mobrovac: "So, IMO, it's a really a per-realm question, not per-host." [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [12:07:02] !log elukey@tin Synchronized wmf-config/jobqueue-eqiad.php: Remove rdb1003 from the Redis JobQueue pool for maintenance (duration: 00m 32s) [12:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:08:20] ---^ hhvm/mediawiki didn't like it for the moment https://logstash.wikimedia.org/#/dashboard/elasticsearch/hhvm [12:08:45] (03PS1) 10ArielGlenn: write dumpruninfo per job, combining after each write [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/274678 [12:09:52] (03PS2) 10Mobrovac: Services: introduce service::packages [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) [12:10:01] _joe_ a lot of notices for Notice: JobQueueGroup::__destruct: 1 buffered job(s) of type(s) RefreshLinksJob never inserted. in /srv/mediawiki/php-1.27.0-wmf.15/includes/jobqueue/JobQueueGroup.php on line 421 [12:10:17] <_joe_> elukey: revert now [12:10:18] seem consistent [12:10:29] <_joe_> Warning: Cannot modify header information - headers already sent in /srv/mediawiki/php-1.27.0-wmf.14/includes/WebResponse.php on line 42 [12:10:36] (03PS1) 10Elukey: Revert "Remove rdb1003 from the job queue pool for maintenance." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274679 [12:10:37] <_joe_> this has something to do with your change [12:10:41] (03CR) 10Mobrovac: Services: introduce service::packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [12:11:08] PROBLEM - wikidata.org dispatch lag is higher than 300s on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1138 bytes in 0.056 second response time [12:11:11] <_joe_> elukey: somehow your patch had some whitespace where it shouldn't have [12:11:18] (03CR) 10Elukey: [C: 032] Revert "Remove rdb1003 from the job queue pool for maintenance." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274679 (owner: 10Elukey) [12:11:24] doing it now [12:12:08] <_joe_> elukey: need help? [12:12:51] !log elukey@tin Synchronized wmf-config/jobqueue-eqiad.php: Revert - Remove rdb1003 from the Redis JobQueue pool for maintenance (duration: 00m 28s) [12:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:13:08] _joe_ should be reverted now [12:13:14] <_joe_> elukey: next time, just remove the lines from the damn file :P [12:13:40] I knooowww I didn't want to make a mess, it took me 1 minute [12:13:42] <_joe_> (and revert your change afterwards) [12:14:07] so whitespaces?? [12:14:18] <_joe_> probably, I don't have time to look now, sorry [12:15:29] sure I'll try to take a look and see what was the major issue [12:16:31] (03CR) 10Giuseppe Lavagetto: "@mobrovac: agreed; I was just checking :)" [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [12:16:50] RECOVERY - wikidata.org dispatch lag is higher than 300s on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1657 bytes in 0.152 second response time [12:17:07] <_joe_> elukey: this ^^ was a consequence too [12:17:34] <_joe_> which might be more serious than just your error and is related to the error you linked earlier [12:17:40] PROBLEM - Swift HTTP frontend on ms-fe1004 is CRITICAL: Connection refused [12:17:48] PROBLEM - Swift HTTP backend on ms-fe1004 is CRITICAL: Connection refused [12:17:55] <_joe_> elukey: describe what you was on the ticket AND ping aaron [12:17:55] that's me ^ not downtimed heh [12:18:06] <_joe_> godog: ah ok I was like "WTF" [12:18:29] _joe_ yep I am going to do it, collecting info from logstash [12:18:31] heheh [12:19:22] (03CR) 10Giuseppe Lavagetto: [C: 031] Services: introduce service::packages [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [12:21:28] RECOVERY - Swift HTTP frontend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.010 second response time [12:21:29] RECOVERY - Swift HTTP backend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.011 second response time [12:22:29] (03PS4) 10Gehel: Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) [12:22:45] !log temporary repool ms-fe1004, apply https://gerrit.wikimedia.org/r/#/c/273431 to test T128081 [12:22:45] T128081: UnicodeDecodeError invalid continuation byte on ms-fe1004 - https://phabricator.wikimedia.org/T128081 [12:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:23:14] (03PS1) 10Jcrespo: [WIP]New check for pt-heartbeat-wikimedia including the shards [puppet] - 10https://gerrit.wikimedia.org/r/274680 [12:23:30] (03PS5) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/273254 (https://phabricator.wikimedia.org/T124444) [12:32:04] 6Operations, 6Services, 10hardware-requests: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2084031 (10mobrovac) Thank you @RobH for this detailed run-down of specs! It made me realise I made a mistake in the initial request. Namely, we will in fact need to have the spe... [12:37:36] 6Operations, 10Traffic, 6WMF-Communications, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2084041 (10Florian) I would also like to get notified about the progress, but I'm probably the last person to think about t... [12:38:30] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2029006 (10mark) @RobH: why are we buying a single server here instead of topping up our spare pool if (apparently) we're ru... [12:40:48] 6Operations: Sudden increase in NOTICE events from hhvm while trying to de-pool rdb1003 for maintenance - https://phabricator.wikimedia.org/T128730#2084049 (10elukey) [12:41:30] 6Operations: Sudden increase in NOTICE events from hhvm while trying to de-pool rdb1003 for maintenance - https://phabricator.wikimedia.org/T128730#2084062 (10elukey) [12:41:43] 6Operations: Sudden increase in NOTICE events from hhvm while trying to de-pool rdb1003 for maintenance - https://phabricator.wikimedia.org/T128730#2084049 (10elukey) p:5Triage>3High [12:42:58] (03PS1) 10ArielGlenn: add facter fact to pull out nginx version [puppet/nginx] - 10https://gerrit.wikimedia.org/r/274683 [12:45:35] (03PS3) 10Filippo Giunchedi: swift: return 400 on UnicodeDecodeErrors [puppet] - 10https://gerrit.wikimedia.org/r/273431 (https://phabricator.wikimedia.org/T128081) [12:47:34] (03PS4) 10Filippo Giunchedi: swift: return 400 on UnicodeDecodeErrors [puppet] - 10https://gerrit.wikimedia.org/r/273431 (https://phabricator.wikimedia.org/T128081) [12:48:56] !log elastic1024.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [12:48:57] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [12:48:57] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [12:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:53:50] 6Operations, 6Labs, 10Monitoring, 10Tool-Labs: Make icinga-wm report Tools homepage check at #wikimedia-labs, too - https://phabricator.wikimedia.org/T128716#2084074 (10Peachey88) [12:54:13] 6Operations, 6Labs, 10Monitoring, 10Tool-Labs: Add other Tools administrators to the Icinga notification group - https://phabricator.wikimedia.org/T128715#2084075 (10Peachey88) [12:59:58] (03PS1) 10Giuseppe Lavagetto: ferm: add global class for allowing ipsec traffic [puppet] - 10https://gerrit.wikimedia.org/r/274686 [13:00:06] <_joe_> bblack: ^^ [13:00:13] <_joe_> no touching role::ipsec [13:04:03] (03PS3) 10Muehlenhoff: Define eventbus ferm service in the role [puppet] - 10https://gerrit.wikimedia.org/r/274389 [13:04:57] (03CR) 10Muehlenhoff: [C: 032 V: 032] Define eventbus ferm service in the role [puppet] - 10https://gerrit.wikimedia.org/r/274389 (owner: 10Muehlenhoff) [13:07:43] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/1929/" [puppet] - 10https://gerrit.wikimedia.org/r/274686 (owner: 10Giuseppe Lavagetto) [13:15:19] (03PS1) 10Muehlenhoff: Initial role for 2fa bastion host [puppet] - 10https://gerrit.wikimedia.org/r/274692 [13:19:10] (03CR) 10BBlack: [C: 031] ferm: add global class for allowing ipsec traffic [puppet] - 10https://gerrit.wikimedia.org/r/274686 (owner: 10Giuseppe Lavagetto) [13:21:04] _joe_: ^ [13:23:11] (03PS2) 10BBlack: Allow maps for test and test2 [puppet] - 10https://gerrit.wikimedia.org/r/274475 (owner: 10Yurik) [13:23:33] (03CR) 10BBlack: [C: 032] Allow maps for test and test2 [puppet] - 10https://gerrit.wikimedia.org/r/274475 (owner: 10Yurik) [13:23:41] (03CR) 10BBlack: [V: 032] Allow maps for test and test2 [puppet] - 10https://gerrit.wikimedia.org/r/274475 (owner: 10Yurik) [13:24:49] (03CR) 10BBlack: "Won't this break when it runs before the nginx package is installed on first puppet run?" [puppet/nginx] - 10https://gerrit.wikimedia.org/r/274683 (owner: 10ArielGlenn) [13:25:18] (03CR) 10BBlack: "(also, can't we assume nginx >X for whatever the value of X is on at least precise, if not trusty?)" [puppet/nginx] - 10https://gerrit.wikimedia.org/r/274683 (owner: 10ArielGlenn) [13:25:42] <_joe_> bblack: yeah thanks, will merge later :) [13:30:23] (03PS5) 10BBlack: Add a cluster_be_recv_pre_purge handler & normalize paths [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [13:31:27] (03CR) 10BBlack: Add a cluster_be_recv_pre_purge handler & normalize paths (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [13:31:54] (03CR) 10BBlack: [C: 031] "Yeah, comments were due to rebase mistakes, it was late. Should be ok now." [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [13:32:51] (03PS6) 10BBlack: Add a cluster_be_recv_pre_purge handler & normalize paths [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [13:33:55] 6Operations, 10Analytics, 10Traffic, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2084136 (10elukey) Summary after today's brainstorming with @ema: * We progressed a bit the remaining unclear points about the parse* functions, some... [13:34:41] 6Operations, 6Office-IT, 10Traffic, 10netops: Office network using monkeybrains.net instead of connection to SFO pop site - https://phabricator.wikimedia.org/T128669#2084138 (10Peachey88) [13:44:49] (03PS7) 10BBlack: role::cache::instances: all backends get all backends [puppet] - 10https://gerrit.wikimedia.org/r/274600 (https://phabricator.wikimedia.org/T127481) [13:44:51] (03PS7) 10BBlack: VCL: remove tier-one conditionals on be applayer backend code [puppet] - 10https://gerrit.wikimedia.org/r/274601 (https://phabricator.wikimedia.org/T127481) [13:44:53] (03PS7) 10BBlack: VCL: cache_route_table replaces site_tier [puppet] - 10https://gerrit.wikimedia.org/r/274602 (https://phabricator.wikimedia.org/T127481) [13:44:56] (03PS7) 10BBlack: role::cache::instances: move cache route table to hiera [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) [13:44:57] (03PS1) 10BBlack: text VCL: re-refactor backend selection [puppet] - 10https://gerrit.wikimedia.org/r/274695 (https://phabricator.wikimedia.org/T127481) [13:47:18] !log elastic1025.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [13:47:19] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [13:47:19] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [13:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:51:38] (03PS1) 10Muehlenhoff: Add systemd unit for logstash [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) [13:52:08] (03CR) 10Muehlenhoff: [C: 04-2] "Totally untested for now" [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) (owner: 10Muehlenhoff) [13:55:54] (03CR) 10Mobrovac: [C: 031] "Looking good! Thnx a lot Brandon for your efforts!" [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [13:59:21] (03PS3) 10ArielGlenn: dumps mirroring tool, don't assume dest is local filesystem [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/270001 [14:00:04] (03PS4) 10ArielGlenn: dumps mirroring tool, don't assume dest is local filesystem [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/270001 [14:00:50] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps mirroring tool, don't assume dest is local filesystem [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/270001 (owner: 10ArielGlenn) [14:09:30] (03PS1) 10ArielGlenn: rsync dumps and other datasets to labs without using nfs [puppet] - 10https://gerrit.wikimedia.org/r/274701 [14:10:41] \o/ [14:11:03] the words "without" and "NFS" together always improve my happiness [14:11:47] (03CR) 10jenkins-bot: [V: 04-1] rsync dumps and other datasets to labs without using nfs [puppet] - 10https://gerrit.wikimedia.org/r/274701 (owner: 10ArielGlenn) [14:12:03] unless there is a "we couldn't leave" before :-P [14:13:48] (03CR) 10BBlack: [C: 032] Add a cluster_be_recv_pre_purge handler & normalize paths [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [14:16:16] (03PS2) 10BBlack: text VCL: re-refactor backend selection [puppet] - 10https://gerrit.wikimedia.org/r/274695 (https://phabricator.wikimedia.org/T127481) [14:16:18] (03PS8) 10BBlack: role::cache::instances: all backends get all backends [puppet] - 10https://gerrit.wikimedia.org/r/274600 (https://phabricator.wikimedia.org/T127481) [14:16:20] (03PS8) 10BBlack: VCL: remove tier-one conditionals on be applayer backend code [puppet] - 10https://gerrit.wikimedia.org/r/274601 (https://phabricator.wikimedia.org/T127481) [14:16:22] (03PS8) 10BBlack: VCL: cache_route_table replaces site_tier [puppet] - 10https://gerrit.wikimedia.org/r/274602 (https://phabricator.wikimedia.org/T127481) [14:16:24] (03PS8) 10BBlack: role::cache::instances: move cache route table to hiera [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) [14:16:47] (03CR) 10BBlack: [C: 032 V: 032] text VCL: re-refactor backend selection [puppet] - 10https://gerrit.wikimedia.org/r/274695 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [14:17:35] (03PS5) 10Filippo Giunchedi: swift: return 400 on UnicodeDecodeErrors [puppet] - 10https://gerrit.wikimedia.org/r/273431 (https://phabricator.wikimedia.org/T128081) [14:17:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: return 400 on UnicodeDecodeErrors [puppet] - 10https://gerrit.wikimedia.org/r/273431 (https://phabricator.wikimedia.org/T128081) (owner: 10Filippo Giunchedi) [14:18:31] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: VCL source-DC switching: add "direct" capability - https://phabricator.wikimedia.org/T127483#2084248 (10BBlack) [14:18:33] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Refactor VCL for 3-tier capabilities and source-DC switching - https://phabricator.wikimedia.org/T127481#2084247 (10BBlack) [14:19:17] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: VCL source-DC switching: add "direct" capability - https://phabricator.wikimedia.org/T127483#2044873 (10BBlack) [14:19:19] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Refactor VCL for 3-tier capabilities and source-DC switching - https://phabricator.wikimedia.org/T127481#2044829 (10BBlack) [14:19:55] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable VCL source-DC switching via confd - https://phabricator.wikimedia.org/T127482#2084252 (10BBlack) [14:19:57] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Refactor VCL for 3-tier capabilities and source-DC switching - https://phabricator.wikimedia.org/T127481#2044829 (10BBlack) [14:20:17] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable VCL source-DC switching via confd - https://phabricator.wikimedia.org/T127482#2044847 (10BBlack) [14:20:19] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Refactor VCL for 3-tier capabilities and source-DC switching - https://phabricator.wikimedia.org/T127481#2044829 (10BBlack) [14:20:51] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Refactor VCL for applayer datacenter-switching - https://phabricator.wikimedia.org/T127484#2084262 (10BBlack) [14:20:53] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable VCL applayer datacenter-switch via confd - https://phabricator.wikimedia.org/T127485#2084263 (10BBlack) [14:21:07] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Refactor VCL for applayer datacenter-switching - https://phabricator.wikimedia.org/T127484#2044926 (10BBlack) [14:21:09] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable VCL applayer datacenter-switch via confd - https://phabricator.wikimedia.org/T127485#2044954 (10BBlack) [14:21:55] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable VCL source-DC switching via confd - https://phabricator.wikimedia.org/T127482#2084268 (10BBlack) [14:21:57] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switch ulsfo to backend to codfw rather than eqiad - https://phabricator.wikimedia.org/T127492#2084267 (10BBlack) [14:30:22] (03PS1) 10ArielGlenn: dumps rsync to a mirror: remove local fs related cruft [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/274708 [14:30:31] 6Operations: Reinstall bast1001 with jessie - https://phabricator.wikimedia.org/T123721#2084290 (10MoritzMuehlenhoff) a:5MoritzMuehlenhoff>3None [14:32:49] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps rsync to a mirror: remove local fs related cruft [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/274708 (owner: 10ArielGlenn) [14:34:38] !log elastic1026.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [14:34:40] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [14:34:40] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [14:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:37:24] (03PS1) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) [14:37:42] (03PS1) 10ArielGlenn: dumps rsync to mirrors: pep8 fixes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/274712 [14:39:26] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps rsync to mirrors: pep8 fixes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/274712 (owner: 10ArielGlenn) [14:40:00] (03PS2) 10ArielGlenn: rsync dumps and other datasets to labs without using nfs [puppet] - 10https://gerrit.wikimedia.org/r/274701 [14:40:17] (03PS2) 10Giuseppe Lavagetto: ferm: add global class for allowing ipsec traffic [puppet] - 10https://gerrit.wikimedia.org/r/274686 [14:41:29] (03CR) 10Giuseppe Lavagetto: [C: 032] ferm: add global class for allowing ipsec traffic [puppet] - 10https://gerrit.wikimedia.org/r/274686 (owner: 10Giuseppe Lavagetto) [14:44:52] (03CR) 10ArielGlenn: "Hm, yes it will break. How can I work around that?" [puppet/nginx] - 10https://gerrit.wikimedia.org/r/274683 (owner: 10ArielGlenn) [14:45:18] (03PS3) 10ArielGlenn: rsync dumps and other datasets to labs without using nfs [puppet] - 10https://gerrit.wikimedia.org/r/274701 [14:47:51] (03PS1) 10Muehlenhoff: Enable base::firewall on eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/274715 (https://phabricator.wikimedia.org/T113343) [14:49:47] 6Operations, 10Analytics, 10Traffic, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2084318 (10Ottomata) > Would it be feasible to just reuse varnishncsa and pipe its output (apache like log format) to a piece of software that just par... [14:55:00] 6Operations, 10ops-codfw: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2084327 (10Papaul) The system is out of warranty, will open a task for procurement of new disk. [14:55:52] (03PS1) 10Filippo Giunchedi: graphite: switch carbon-c-relay to carbon_ch hash [puppet] - 10https://gerrit.wikimedia.org/r/274716 (https://phabricator.wikimedia.org/T105218) [14:56:19] 6Operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=7 dev=sdh failed - https://phabricator.wikimedia.org/T127824#2084331 (10Papaul) The system is out of warranty, will open a task for procurement of new disk. [14:56:31] 6Operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=7 dev=sdh failed - https://phabricator.wikimedia.org/T127824#2084332 (10Papaul) p:5Triage>3Normal [14:56:58] (03PS9) 10BBlack: role::cache::instances: all backends get all backends [puppet] - 10https://gerrit.wikimedia.org/r/274600 (https://phabricator.wikimedia.org/T127481) [14:57:00] (03PS9) 10BBlack: VCL: remove tier-one conditionals on be applayer backend code [puppet] - 10https://gerrit.wikimedia.org/r/274601 (https://phabricator.wikimedia.org/T127481) [14:57:02] (03PS9) 10BBlack: VCL: cache_route_table replaces site_tier [puppet] - 10https://gerrit.wikimedia.org/r/274602 (https://phabricator.wikimedia.org/T127481) [14:57:04] (03PS9) 10BBlack: role::cache::instances: move cache route table to hiera [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) [14:57:06] (03PS1) 10BBlack: cache_maps: remove probe [puppet] - 10https://gerrit.wikimedia.org/r/274718 (https://phabricator.wikimedia.org/T127481) [14:57:14] (03PS1) 10Giuseppe Lavagetto: Revert "redis::multidc::instances: disable replication" [puppet] - 10https://gerrit.wikimedia.org/r/274719 [14:57:20] (03PS2) 10Giuseppe Lavagetto: Revert "redis::multidc::instances: disable replication" [puppet] - 10https://gerrit.wikimedia.org/r/274719 [14:57:24] 6Operations, 10ops-codfw: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2084336 (10Papaul) p:5Triage>3Normal [14:57:48] 6Operations, 10ops-codfw: es2009 degraded RAID - https://phabricator.wikimedia.org/T125442#2084337 (10Papaul) p:5Triage>3Low [14:57:57] (03CR) 10BBlack: [C: 032] cache_maps: remove probe [puppet] - 10https://gerrit.wikimedia.org/r/274718 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [14:58:05] (03CR) 10BBlack: [V: 032] cache_maps: remove probe [puppet] - 10https://gerrit.wikimedia.org/r/274718 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [14:58:07] 6Operations, 10ops-codfw, 10DBA, 13Patch-For-Review: es2010 controller issue - https://phabricator.wikimedia.org/T127769#2084338 (10Papaul) p:5Triage>3Low [14:58:18] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "redis::multidc::instances: disable replication" [puppet] - 10https://gerrit.wikimedia.org/r/274719 (owner: 10Giuseppe Lavagetto) [14:58:32] (03PS3) 10Giuseppe Lavagetto: Revert "redis::multidc::instances: disable replication" [puppet] - 10https://gerrit.wikimedia.org/r/274719 [14:58:39] <_joe_> grrr [14:58:47] (03PS2) 10Filippo Giunchedi: graphite: switch carbon-c-relay to carbon_ch hash [puppet] - 10https://gerrit.wikimedia.org/r/274716 (https://phabricator.wikimedia.org/T105218) [14:58:55] (03CR) 10Giuseppe Lavagetto: [V: 032] Revert "redis::multidc::instances: disable replication" [puppet] - 10https://gerrit.wikimedia.org/r/274719 (owner: 10Giuseppe Lavagetto) [14:59:41] (03PS1) 10Joal: Add oozie-ste.xml configuration to handle SLAs [puppet/cdh] - 10https://gerrit.wikimedia.org/r/274720 [15:00:15] (03PS2) 10Jcrespo: [WIP]New check for pt-heartbeat-wikimedia including the shards [puppet] - 10https://gerrit.wikimedia.org/r/274680 [15:03:17] !log uploaded openssl 1.0.2g for jessie-wikimedia [15:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:04:04] (03PS3) 10Filippo Giunchedi: graphite: switch carbon-c-relay to carbon_ch hash [puppet] - 10https://gerrit.wikimedia.org/r/274716 (https://phabricator.wikimedia.org/T105218) [15:06:15] 6Operations, 10Analytics, 10Traffic, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2084349 (10elukey) @Ottomata the code is really optimized as you were saying to map VSL structures to log tags in apache format, but it is essentiall... [15:06:36] (03PS1) 10Filippo Giunchedi: swift: log full request on exception in rewrite.py [puppet] - 10https://gerrit.wikimedia.org/r/274721 [15:06:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: log full request on exception in rewrite.py [puppet] - 10https://gerrit.wikimedia.org/r/274721 (owner: 10Filippo Giunchedi) [15:06:59] (03PS3) 10Jcrespo: [WIP]New check for pt-heartbeat-wikimedia including the shards [puppet] - 10https://gerrit.wikimedia.org/r/274680 [15:08:29] (03PS2) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) [15:12:49] !log cassandra throttle 1001, 1002, 1007-a, 1007-b, and 1010-a to 30mbps T95253 [15:12:50] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [15:12:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:15] (03PS10) 10BBlack: role::cache::instances: all backends get all backends [puppet] - 10https://gerrit.wikimedia.org/r/274600 (https://phabricator.wikimedia.org/T127481) [15:13:17] (03PS10) 10BBlack: VCL: remove tier-one conditionals on be applayer backend code [puppet] - 10https://gerrit.wikimedia.org/r/274601 (https://phabricator.wikimedia.org/T127481) [15:13:19] (03PS10) 10BBlack: VCL: cache_route_table replaces site_tier [puppet] - 10https://gerrit.wikimedia.org/r/274602 (https://phabricator.wikimedia.org/T127481) [15:13:21] (03PS10) 10BBlack: role::cache::instances: move cache route table to hiera [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) [15:14:24] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1010-b instance [puppet] - 10https://gerrit.wikimedia.org/r/274724 (https://phabricator.wikimedia.org/T128107) [15:15:44] urandom: ^ [15:15:48] (03PS1) 10Jcrespo: Add the shard name when checking lag (for pt-heartbeat) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274726 [15:17:00] (03CR) 10Jcrespo: [C: 032] Add the shard name when checking lag (for pt-heartbeat) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/274726 (owner: 10Jcrespo) [15:17:19] godog: \o/ [15:17:27] 6Operations, 13Patch-For-Review: Decom/reclaim berkelium/curium - https://phabricator.wikimedia.org/T125962#2084373 (10Cmjohnson) p:5Triage>3Normal These server were part of the original varnish for eqiad. They are set for decommission and removal. The ssds will be removed and destroyed. [15:17:52] (03CR) 10Eevans: [C: 031] cassandra: add restbase1010-b instance [puppet] - 10https://gerrit.wikimedia.org/r/274724 (https://phabricator.wikimedia.org/T128107) (owner: 10Filippo Giunchedi) [15:17:55] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1010-b instance [puppet] - 10https://gerrit.wikimedia.org/r/274724 (https://phabricator.wikimedia.org/T128107) (owner: 10Filippo Giunchedi) [15:18:01] (03PS4) 10Jcrespo: New check for pt-heartbeat-wikimedia including the shards [puppet] - 10https://gerrit.wikimedia.org/r/274680 [15:18:23] godog: you going to throttle the other nodes in the rack? do you want me to? [15:18:56] urandom: yeah I did that beforehand but will keep an eye [15:19:04] godog: great; thanks! [15:19:18] godog: we should definitely raise it to see what happens [15:19:37] moar is better (if it doesn't impact) [15:19:52] (03PS1) 10Muehlenhoff: Provide new meta package for Linux 4.4 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/274727 [15:20:24] jynus: I (finally) documented one of the labs db things that I’ve muttered to you about. Comment at your convenience. https://phabricator.wikimedia.org/T128737 [15:22:21] no need for downtime [15:22:49] well, maybe a bit, it the client is as slow to failover as mediawiki [15:23:27] I can do some of it, but I will need you for questions, like "space", etc. [15:23:39] jynus: yeah, the downtime is because the labs resolvers are stupid about choosing between the two servers. [15:24:05] ‘space’ meaning, ‘will mysql fit on those servers’? [15:24:38] I mean, I can blindly apply the role to a random server, but I want you to understant what will happen and aprove it [15:25:28] I know my "phyisical" servers. not sure what it is there ( I assume it will have to share memory and other resources) [15:25:30] ah, yes, that’s sensible :) [15:25:46] "how much memory is ok to use", etc [15:32:06] (03PS11) 10BBlack: role::cache::instances: all backends get all backends [puppet] - 10https://gerrit.wikimedia.org/r/274600 (https://phabricator.wikimedia.org/T127481) [15:32:08] (03PS11) 10BBlack: VCL: remove tier-one conditionals on be applayer backend code [puppet] - 10https://gerrit.wikimedia.org/r/274601 (https://phabricator.wikimedia.org/T127481) [15:32:10] (03PS11) 10BBlack: VCL: cache_route_table replaces site_tier [puppet] - 10https://gerrit.wikimedia.org/r/274602 (https://phabricator.wikimedia.org/T127481) [15:32:12] (03PS11) 10BBlack: role::cache::instances: move cache route table to hiera [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) [15:32:14] (03PS1) 10BBlack: v::c::directors: filter keyspaces on dynamic==yes [puppet] - 10https://gerrit.wikimedia.org/r/274729 (https://phabricator.wikimedia.org/T127481) [15:33:04] (03PS1) 10ArielGlenn: turn on checkpointing for eswiki, itwiki, ruwiki, wikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/274730 [15:36:22] (03CR) 10BBlack: [C: 032] v::c::directors: filter keyspaces on dynamic==yes [puppet] - 10https://gerrit.wikimedia.org/r/274729 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [15:36:39] PROBLEM - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is CRITICAL: Connection refused [15:36:41] (03PS2) 10ArielGlenn: turn on checkpointing for eswiki, itwiki, ruwiki, wikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/274730 [15:36:43] (03CR) 10BBlack: [C: 032] role::cache::instances: all backends get all backends [puppet] - 10https://gerrit.wikimedia.org/r/274600 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [15:37:23] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [15:37:23] (03CR) 10Physikerwelt: [C: 031] Services: introduce service::packages [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [15:37:27] 6Operations, 10Analytics, 10Traffic, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2084445 (10Ottomata) > the code is really optimized as you were saying to map VSL structures to log tags in apache format, but it is essentially what n... [15:38:09] (03PS3) 10ArielGlenn: turn on checkpointing for eswiki, itwiki, ruwiki, wikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/274730 [15:40:50] yay for compiler-tests that still fail on real hosts :P [15:42:04] (03PS2) 10Andrew Bogott: Re-enable instance manipulation in horizon. [puppet] - 10https://gerrit.wikimedia.org/r/274440 [15:42:43] !log elastic1027.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [15:42:44] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [15:42:44] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [15:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:47] (03CR) 10ArielGlenn: [C: 032] turn on checkpointing for eswiki, itwiki, ruwiki, wikidatawiki [puppet] - 10https://gerrit.wikimedia.org/r/274730 (owner: 10ArielGlenn) [15:43:01] (03CR) 10BryanDavis: Add systemd unit for logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) (owner: 10Muehlenhoff) [15:43:24] jouncebot: help [15:43:40] jouncebot: next [15:43:40] In 0 hour(s) and 16 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160303T1600) [15:44:19] (03PS3) 10Andrew Bogott: Re-enable instance manipulation in horizon. [puppet] - 10https://gerrit.wikimedia.org/r/274440 [15:46:13] (03CR) 10Andrew Bogott: [C: 032] Re-enable instance manipulation in horizon. [puppet] - 10https://gerrit.wikimedia.org/r/274440 (owner: 10Andrew Bogott) [15:48:41] I'm looking for a review of https://gerrit.wikimedia.org/r/#/c/274382/4 (refactoring of exposing puppet certificates). I'd like to get it out of the way as soon as possible... [15:48:57] Anyone interested? [15:49:59] <_joe_> gehel: what do you offer in exchange? [15:50:11] _joe_: chocolate ? [15:50:19] <_joe_> gehel: cool, swiss chocolate [15:50:23] <_joe_> you bought me out [15:50:27] I'm in :) [15:50:33] Send me your address and I'll mail it ... [15:50:56] There is a rule in gmail against chocolate as attachment [15:51:37] volans: you already have an intrinsic interest in having this patch merged, no chocolate for you this time! [15:51:51] I tried... :( [15:52:26] volans: you'll find another occasion... [15:56:19] (03PS1) 10BBlack: wmflib: fix hash_(de)select_re for ruby 1.8 compat [puppet] - 10https://gerrit.wikimedia.org/r/274738 (https://phabricator.wikimedia.org/T127481) [15:57:42] (03CR) 10BBlack: [C: 032 V: 032] wmflib: fix hash_(de)select_re for ruby 1.8 compat [puppet] - 10https://gerrit.wikimedia.org/r/274738 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [15:58:36] 6Operations, 10Analytics, 10Traffic, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2084490 (10faidon) >>! In T124278#2084445, @Ottomata wrote: >> the code is really optimized as you were saying to map VSL structures to log tags in apa... [15:58:52] 6Operations, 10Analytics, 10Traffic, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2084491 (10Ottomata) Ja! Am happy to help too! I’ve poked around in that code a bit too, so I might be able to offer some help. [15:58:58] (03PS2) 10Jcrespo: Enable the new pt-heartbeat on core production hosts [puppet] - 10https://gerrit.wikimedia.org/r/274670 (https://phabricator.wikimedia.org/T114752) [16:00:00] 6Operations, 10media-storage: Unable to delete, restore/undelete, move or upload new versions of files on several wikis ("inconsistent state within the internal storage backends") - https://phabricator.wikimedia.org/T128096#2084492 (10fgiunchedi) replication of `-deleted` containers has almost finished, though... [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160303T1600). Please do the needful. [16:00:04] dcausse andrewbogott: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:21] o/ [16:00:25] if nobody minds I'd like to take over this swat [16:00:36] greg-g: ^^? [16:01:40] jzerebecki: go for it! [16:01:46] (03PS1) 10Andrew Bogott: Disable 'shelve instance' in the horizon webui. [puppet] - 10https://gerrit.wikimedia.org/r/274739 [16:01:56] * andrewbogott is here [16:01:57] (03CR) 10JanZerebecki: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272963 (https://phabricator.wikimedia.org/T127943) (owner: 10DCausse) [16:02:16] (03CR) 10JanZerebecki: "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274457 (owner: 10Andrew Bogott) [16:02:25] (03CR) 10JanZerebecki: [C: 032] Change the wikitech favicon and logo to the actual wikitech logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274457 (owner: 10Andrew Bogott) [16:02:43] (03Merged) 10jenkins-bot: Enable 'popqual' (quality+pageviews) scoring method for the completion suggester [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272963 (https://phabricator.wikimedia.org/T127943) (owner: 10DCausse) [16:03:07] (03Merged) 10jenkins-bot: Change the wikitech favicon and logo to the actual wikitech logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274457 (owner: 10Andrew Bogott) [16:03:11] (03CR) 10Andrew Bogott: [C: 032] Disable 'shelve instance' in the horizon webui. [puppet] - 10https://gerrit.wikimedia.org/r/274739 (owner: 10Andrew Bogott) [16:03:46] (03PS1) 10BBlack: v::c::directors: fix select compat for ruby 1.8 here as well... [puppet] - 10https://gerrit.wikimedia.org/r/274740 (https://phabricator.wikimedia.org/T127481) [16:04:07] (03PS2) 10BBlack: v::c::directors: fix select compat for ruby 1.8 here as well... [puppet] - 10https://gerrit.wikimedia.org/r/274740 (https://phabricator.wikimedia.org/T127481) [16:04:38] (03PS1) 10Krinkle: gerrit: Whitelist PNG as safe to render in change sets [puppet] - 10https://gerrit.wikimedia.org/r/274741 [16:04:47] (03CR) 10Volans: [C: 031] "LGTM, as discussed I still don't like the coupling with the master role but given our current tooling is acceptable. We just need to docum" [puppet] - 10https://gerrit.wikimedia.org/r/274670 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [16:05:05] (03PS5) 10Gehel: Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) [16:05:44] (03PS3) 10Jcrespo: Enable the new pt-heartbeat on core production hosts [puppet] - 10https://gerrit.wikimedia.org/r/274670 (https://phabricator.wikimedia.org/T114752) [16:06:06] (03CR) 10BBlack: [C: 032] v::c::directors: fix select compat for ruby 1.8 here as well... [puppet] - 10https://gerrit.wikimedia.org/r/274740 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [16:06:24] !log jzerebecki@tin Synchronized wmf-config/CirrusSearch-production.php: CirrusSearch: Enable popqual (quality+pageviews) scoring method for the completion suggester T127943 (duration: 00m 37s) [16:06:25] T127943: Enable pageviews scoring for completion suggester in prod - https://phabricator.wikimedia.org/T127943 [16:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:33] dcausse: please check [16:06:33] (03PS4) 10Jcrespo: Enable the new pt-heartbeat on core production hosts [puppet] - 10https://gerrit.wikimedia.org/r/274670 (https://phabricator.wikimedia.org/T114752) [16:07:31] jzerebecki: sounds good, I can see the config change (the code that uses it will run on tuesday), thanks! [16:07:40] yw [16:07:59] jzerebecki: is my change merged and sync’d as well? [16:08:13] andrewbogott: syncing... [16:08:19] ‘k [16:08:27] !log jzerebecki@tin Synchronized w/static/favicon/wikitech.ico: Change the wikitech favicon and logo to the actual wikitech logo a29196d359b9924719b9166dca98a474ad9a6a2b 1 of 2 (duration: 00m 30s) [16:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:08:35] (03PS6) 10Gehel: Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) [16:09:11] !log jzerebecki@tin Synchronized w/static/images/project-logos/wikitech.png: Change the wikitech favicon and logo to the actual wikitech logo a29196d359b9924719b9166dca98a474ad9a6a2b 2 of 2 (duration: 00m 29s) [16:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:20] andrewbogott: please check [16:09:22] expect possible a few minor cp* CRITICAL for puppet compile fail. it's a race condition again, but hopefully double-puppet-agent in the salt command minimizes it [16:09:49] PROBLEM - puppet last run on cp1043 is CRITICAL: CRITICAL: puppet fail [16:09:59] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: puppet fail [16:10:09] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: Puppet has 1 failures [16:10:20] jzerebecki: it half worked, it must need a followup [16:10:29] but, not urgent, nothing is broken in the meantime [16:10:30] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Puppet has 1 failures [16:10:36] !log downtime on all mariadb replication lag checks in preparation to changing its check [16:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:02] andrewbogott: do you want to do it this swat? [16:11:11] jzerebecki: um… sure, let me look [16:11:19] PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: puppet fail [16:11:50] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:11:59] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:12:10] (03CR) 10Volans: [C: 031] "Logic looks ok but don't trust my perl reviews." [puppet] - 10https://gerrit.wikimedia.org/r/274680 (owner: 10Jcrespo) [16:12:20] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Puppet has 1 failures [16:12:28] RECOVERY - puppet last run on cp1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:12:41] jzerebecki: in InitialiseSettings.php, the favicon section... [16:12:49] the keys in that table are db names, right? [16:12:59] (03PS5) 10Jcrespo: Enable the new pt-heartbeat on core production hosts [puppet] - 10https://gerrit.wikimedia.org/r/274670 (https://phabricator.wikimedia.org/T114752) [16:13:05] I ask because there’s a wikitech-> entry [16:13:09] PROBLEM - puppet last run on cp1044 is CRITICAL: CRITICAL: puppet fail [16:13:09] RECOVERY - puppet last run on cp1050 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [16:13:11] but the dbname for wikitech is ‘labswiki' [16:14:10] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:14:38] (03PS1) 10Andrew Bogott: The wikitech db is called 'labswiki,' not 'wikitech' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274743 [16:14:46] jzerebecki: in which case ^ would be correct [16:15:08] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 1 failures [16:15:34] (03CR) 10Jcrespo: [C: 032] Enable the new pt-heartbeat on core production hosts [puppet] - 10https://gerrit.wikimedia.org/r/274670 (https://phabricator.wikimedia.org/T114752) (owner: 10Jcrespo) [16:15:42] andrewbogott, wikitech.dblist should include labswiki though [16:15:52] Krenair: ah, ok [16:15:59] so maybe we’re already done and the favicon is just cached? [16:16:18] Krenair: context is https://gerrit.wikimedia.org/r/#/c/274457/ [16:16:28] did you purge the file jzerebecki? [16:16:43] Krenair: only file-sync [16:16:47] okay [16:16:56] when you change static files, you should purge them from varnish [16:16:59] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:17:23] jzerebecki, andrewbogott: although actually.. wikitech isn't behind varnish [16:17:28] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: puppet fail [16:17:38] yeah [16:17:55] Krenair: can you do some kind of firebug/inspection magic to see what the filename is of the favicon that wikitech serves up? [16:18:00] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [16:18:33] !log restbase1001 bump stream throughput to 60mbps on restbase1001 [16:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:18:38] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures [16:19:18] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [16:19:33] hm, I tried to check w/safari but does safari even use favicons? [16:19:37] the source contains [16:19:43] which shows the new one [16:19:51] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:19:55] aha [16:19:57] now it's started showing [16:19:58] jzerebecki, Krenair, confirmed, with a new browser the icon is correct [16:20:04] so, we’re done [16:20:07] thank you jzerebecki! [16:20:28] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:20:29] (03Abandoned) 10Andrew Bogott: The wikitech db is called 'labswiki,' not 'wikitech' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274743 (owner: 10Andrew Bogott) [16:20:39] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: Puppet has 1 failures [16:20:49] yw. SWAT done. [16:21:19] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet has 1 failures [16:21:21] man, that logo does not really work as a favicon [16:21:24] well, maybe it will grow on me [16:22:38] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet has 1 failures [16:22:38] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:22:39] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 1 failures [16:22:44] unicorns are way cooler [16:22:49] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 1 failures [16:23:09] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:23:35] +1 unicorns [16:24:29] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [16:24:38] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:24:48] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:25:16] !log krinkle@terbium mwscript deleteEqualMessages.php --wiki frwiki [16:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:48] * mafk likes deleteEqualMessages.php [16:25:50] !log elastic1028.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [16:25:52] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [16:25:52] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [16:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:27:54] (03PS1) 10BBlack: r::c::instances - hack around cache_maps eqiad-only [puppet] - 10https://gerrit.wikimedia.org/r/274746 (https://phabricator.wikimedia.org/T109162) [16:28:29] (03CR) 10JanZerebecki: [C: 031] gerrit: Whitelist PNG as safe to render in change sets [puppet] - 10https://gerrit.wikimedia.org/r/274741 (owner: 10Krinkle) [16:29:37] (03CR) 10BBlack: [C: 032] r::c::instances - hack around cache_maps eqiad-only [puppet] - 10https://gerrit.wikimedia.org/r/274746 (https://phabricator.wikimedia.org/T109162) (owner: 10BBlack) [16:30:59] (03PS7) 10Alex Monk: labs dnsrecursor IP aliasing: work on all projects, not just some arbitrary ones [puppet] - 10https://gerrit.wikimedia.org/r/268921 [16:32:13] (03PS12) 10BBlack: VCL: remove tier-one conditionals on be applayer backend code [puppet] - 10https://gerrit.wikimedia.org/r/274601 (https://phabricator.wikimedia.org/T127481) [16:32:15] (03PS12) 10BBlack: VCL: cache_route_table replaces site_tier [puppet] - 10https://gerrit.wikimedia.org/r/274602 (https://phabricator.wikimedia.org/T127481) [16:32:17] (03PS12) 10BBlack: role::cache::instances: move cache route table to hiera [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) [16:32:47] RECOVERY - puppet last run on cp1044 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:32:50] jouncebot, next [16:32:50] In 0 hour(s) and 27 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160303T1700) [16:32:58] RECOVERY - puppet last run on cp1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:43:54] (03CR) 10BBlack: [C: 032] "compiler-tested: functional no-op at all clusters/tiers" [puppet] - 10https://gerrit.wikimedia.org/r/274601 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [16:44:14] (03CR) 10BBlack: [C: 032] "compiler-tested: functional no-op at all clusters/tiers" [puppet] - 10https://gerrit.wikimedia.org/r/274602 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [16:44:36] (03CR) 10BBlack: [C: 032] "compiler-tested: functional no-op at all clusters/tiers" [puppet] - 10https://gerrit.wikimedia.org/r/274603 (https://phabricator.wikimedia.org/T127481) (owner: 10BBlack) [16:47:26] 6Operations: Queries in Hue always return an empty result set - https://phabricator.wikimedia.org/T128039#2084660 (10Aklapper) [16:50:38] wikibugs: ping [16:50:51] <_joe_> ostriches: still importing in phab? [16:50:58] <_joe_> it's mostly unavailable now [16:51:05] Final batch. [16:51:11] It should be done in the next ~10m [16:51:14] <_joe_> ok [16:51:19] <_joe_> just letting you know [16:53:01] <_joe_> ostriches: also, please log when you're doing so, I was pinged about phab and couldn't find any reference [16:53:33] it seems to work intermittently for me, reload reload -> win [16:53:59] F5 F5 F5 [16:55:48] PROBLEM - MySQL Replication Heartbeat on db1023 is CRITICAL: CRIT replication delay 418 seconds [16:56:36] Hi it seems that the redirections script in phabricator for gerrits redirection is not working. [16:56:42] Such as https://phabricator.wikimedia.org/r/revision/phabricator/extensions;502260881b2648a81815227f2df9899b074e9f77 [16:57:39] I got 99 problems today but redirectors ain't one [16:58:28] hahaha [16:58:37] now there's a parody that needs to be written [16:59:46] RECOVERY - MySQL Replication Heartbeat on db1023 is OK: OK replication delay -0 seconds [17:00:04] _joe_ jynus: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160303T1700). Please do the needful. [17:00:04] Krenair: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:38] Krenair, please wait a second while I check the replication status [17:01:45] <_joe_> jynus: let me know if you need me [17:02:23] it is ok, no outages, but I was changing the checks [17:02:32] and want to be sure there are no 200 pages [17:03:43] (03PS1) 10BryanDavis: OCG: limit log events sent to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/274753 [17:05:01] ok, let's see [17:06:14] (03CR) 10Ori.livneh: [C: 032] "sigh. It's a shame that you have to be the one dealing with this." [puppet] - 10https://gerrit.wikimedia.org/r/274753 (owner: 10BryanDavis) [17:07:36] (03PS5) 10Alex Monk: Add public-wiki-rewrites to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/273410 (https://phabricator.wikimedia.org/T99096) [17:07:58] (03CR) 10jenkins-bot: [V: 04-1] Add public-wiki-rewrites to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/273410 (https://phabricator.wikimedia.org/T99096) (owner: 10Alex Monk) [17:08:36] ugh [17:08:36] what [17:09:55] !log elastic1029.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [17:09:57] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [17:09:57] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [17:09:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:03] (03CR) 10Volans: "I understand that the keep of the directory structure was done to not break the previous module that was using it (kerberos), but seems a " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [17:11:42] (03PS2) 10Dzahn: gerrit: Whitelist PNG as safe to render in change sets [puppet] - 10https://gerrit.wikimedia.org/r/274741 (owner: 10Krinkle) [17:11:58] (03PS6) 10Alex Monk: Add public-wiki-rewrites to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/273410 (https://phabricator.wikimedia.org/T99096) [17:12:08] jynus: You'll be happy to know that the final 2 repositories have been indexed. Should be the last of those shenanigans. [17:12:24] why is gerrit's built in rebasing so broken? [17:12:51] you mean why does it whine when you can chery pick to HEAD just fine ? [17:12:53] dunno [17:13:17] Krenair: jgit. [17:13:18] ostriches, can you give a general look at krenair's patches, they are phabricator related? [17:13:55] jynus: Which patches are these? [17:14:10] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160303T1700 [17:14:16] (03CR) 10Gehel: "I'd prefer not to change the structure, unless yuvi or joe push for it (as I understand it, they are the original authors of the k8s modul" [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [17:14:18] we usually require the +1 from the owners [17:14:22] (03PS2) 10Alex Monk: phabricator: Add NE flag to old task creation URL redirect [puppet] - 10https://gerrit.wikimedia.org/r/272426 (https://phabricator.wikimedia.org/T127286) [17:14:31] jynus, uhh... you do? [17:14:35] who is' [17:14:42] who is 'the owners'? [17:15:15] well, I would say releng owns phab, at least the service, but maybe they disagree [17:15:16] (03CR) 10Chad: [C: 031] phabricator: Add NE flag to old task creation URL redirect [puppet] - 10https://gerrit.wikimedia.org/r/272426 (https://phabricator.wikimedia.org/T127286) (owner: 10Alex Monk) [17:15:40] (03CR) 10Chad: [C: 031] phabricator: Send weekly mail every week instead of on certain monthdays [puppet] - 10https://gerrit.wikimedia.org/r/274033 (owner: 10Alex Monk) [17:15:43] (03CR) 10Chad: [C: 031] gerrit: Whitelist PNG as safe to render in change sets [puppet] - 10https://gerrit.wikimedia.org/r/274741 (owner: 10Krinkle) [17:15:53] I will check the correctness, I do not know if the logic is wanted [17:16:02] they good [17:16:13] (03CR) 10Volans: "I just wanted to point out if we want to simplify it that it's surely simpler now that in the future." [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [17:16:33] 6Operations, 6Services, 10hardware-requests: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2084754 (10RobH) I've noted the correction (match SCA not SCB) and requested updated quotes linked off T128671. While all initial specification requests for hardware are on this... [17:17:12] (03CR) 10Dzahn: [C: 04-1] "in discussion with csteipp" [puppet] - 10https://gerrit.wikimedia.org/r/274741 (owner: 10Krinkle) [17:17:29] (03PS7) 10Alex Monk: Add public-wiki-rewrites to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/273410 (https://phabricator.wikimedia.org/T99096) [17:17:31] (03PS2) 10Alex Monk: Merge labtestwikitech apache config with wikitech apache config [puppet] - 10https://gerrit.wikimedia.org/r/273411 [17:17:49] (03CR) 10Jcrespo: [C: 032] phabricator: Add NE flag to old task creation URL redirect [puppet] - 10https://gerrit.wikimedia.org/r/272426 (https://phabricator.wikimedia.org/T127286) (owner: 10Alex Monk) [17:17:57] ostriches: MediaWiki is not redirecting properly to the correct patch. Please see https://gerrit.wikimedia.org/r/#/c/274582/ [17:18:49] jynus, are you seriously suggesting that not only ops +2 is required, but a +1 from someone else is required to get something merged? [17:18:53] this is not okay [17:19:10] (03CR) 10jenkins-bot: [V: 04-1] Merge labtestwikitech apache config with wikitech apache config [puppet] - 10https://gerrit.wikimedia.org/r/273411 (owner: 10Alex Monk) [17:19:11] it's difficult enough to get people to merge stuff in this repository as it is [17:19:27] 6Operations, 6Services, 10hardware-requests: Hardware request for SCA and SCB in codfw - https://phabricator.wikimedia.org/T128475#2084761 (10faidon) Same as the terbium replacement concern here: these boxes aren't very special — can't we just expand our spares pool instead and allocate from there? [17:19:27] Krenair, please read https://wikitech.wikimedia.org/wiki/PuppetSWAT [17:19:28] Krenair: I'll totes +1 your stuff [17:19:40] (03PS4) 10ArielGlenn: rsync dumps and other datasets to labs without using nfs [puppet] - 10https://gerrit.wikimedia.org/r/274701 [17:20:17] jynus, I have read it [17:20:32] you do not agree with it? [17:20:45] I do agree with it [17:21:11] (03CR) 10ArielGlenn: [C: 032] rsync dumps and other datasets to labs without using nfs [puppet] - 10https://gerrit.wikimedia.org/r/274701 (owner: 10ArielGlenn) [17:21:37] you do not think appropiate to ask people with phabricator knowledge its ok for phabricator patches? [17:21:57] I do think it is appropriate to ask for that [17:22:13] (03PS2) 10Alex Monk: phabricator: Send weekly mail every week instead of on certain monthdays [puppet] - 10https://gerrit.wikimedia.org/r/274033 [17:23:33] apache has been reloaded, you can test the first one [17:24:14] AaronSchulz: I'm not seeing reoccurence of https://phabricator.wikimedia.org/T128096 since 12:06 UTC but let me know what you think re: my last update! [17:24:42] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable Redis cross-dc replication - https://phabricator.wikimedia.org/T126470#2084777 (10Joe) After introducing ferm rules specifically allowing ipsec, the ipsec connections seem to be way more stable. Sessions are now replicate... [17:24:52] (03Abandoned) 10Andrew Bogott: Add an additional puppet config to use with minimal runs. [puppet] - 10https://gerrit.wikimedia.org/r/221563 (owner: 10Andrew Bogott) [17:25:39] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable Redis cross-dc replication - https://phabricator.wikimedia.org/T126470#2084781 (10Joe) I will close this ticket as this was adding and testing that encrypted replication is indeed possible. [17:25:56] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Enable Redis cross-dc replication - https://phabricator.wikimedia.org/T126470#2084782 (10Joe) 5Open>3Resolved [17:25:58] 6Operations, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Check the redis (jobqueue) configuration in codfw - https://phabricator.wikimedia.org/T124672#2084783 (10Joe) [17:26:11] jynus, looks good [17:26:17] (03CR) 10Andrew Bogott: "I object to this but also don't understand it at all, so removing myself as reviewer to preserve brainspace" [puppet] - 10https://gerrit.wikimedia.org/r/273410 (https://phabricator.wikimedia.org/T99096) (owner: 10Alex Monk) [17:26:21] (03PS3) 10Alex Monk: Merge labtestwikitech apache config with wikitech apache config [puppet] - 10https://gerrit.wikimedia.org/r/273411 [17:26:44] (03PS2) 10Dzahn: Initial role for 2fa bastion host [puppet] - 10https://gerrit.wikimedia.org/r/274692 (owner: 10Muehlenhoff) [17:26:56] (03CR) 10Dzahn: [C: 032] Initial role for 2fa bastion host [puppet] - 10https://gerrit.wikimedia.org/r/274692 (owner: 10Muehlenhoff) [17:27:06] (03CR) 10Alex Monk: "Andrew: Uhh... Details of your objection would be very much appreciated." [puppet] - 10https://gerrit.wikimedia.org/r/273410 (https://phabricator.wikimedia.org/T99096) (owner: 10Alex Monk) [17:28:07] (03CR) 10Andrew Bogott: [C: 031] "This is great! Let me know what I can do to help test." [puppet] - 10https://gerrit.wikimedia.org/r/268921 (owner: 10Alex Monk) [17:28:48] (03CR) 10Dzahn: [C: 031] "recheck - can the jenkins please stand up" [puppet] - 10https://gerrit.wikimedia.org/r/274692 (owner: 10Muehlenhoff) [17:28:51] (03CR) 10Andrew Bogott: "Typo! I meant to say "I do not object to this" :)" [puppet] - 10https://gerrit.wikimedia.org/r/273410 (https://phabricator.wikimedia.org/T99096) (owner: 10Alex Monk) [17:28:58] hey ostriches, does anything new need to be documented with regards to repo creation? [17:29:08] the phab import? [17:29:08] code review plot twist up there [17:29:17] are there jenkins issues? [17:29:29] (03CR) 10Gehel: "I definitely agree with you. The structure is overkill. I just have a case of "if it's not broken, don't fix it" and "done is better than " [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [17:29:33] jynus: feels like yes [17:30:05] jynus: nah, just a little delayed [17:30:20] Krenair: ready for me to merge https://gerrit.wikimedia.org/r/#/c/273411 ? [17:30:25] (03CR) 10Dzahn: [C: 032] Initial role for 2fa bastion host [puppet] - 10https://gerrit.wikimedia.org/r/274692 (owner: 10Muehlenhoff) [17:30:34] worked now [17:30:38] puppet compiler is failing to connect to gerrit, though [17:30:46] let me recheck [17:30:47] andrewbogott, once the parent commit is through, I guess so [17:31:02] andrewbogott, did you check it in puppet-compiler? [17:31:14] Krenair: does it actually depend on the public-rewrites thing or is that just an artifact of your local git? [17:31:15] 6Operations, 10ops-eqiad: Rack and Initial setup db1074-79 - https://phabricator.wikimedia.org/T128753#2084820 (10Cmjohnson) [17:31:19] ostriches: Ive reported it here https://phabricator.wikimedia.org/T128751 [17:31:23] be back later. gotta afk [17:31:24] Krenair: Yeah, it should be documented. [17:31:47] (03CR) 10Dduvall: [C: 04-1] "Something is off here but I'm not sure what exactly. (see comment)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274579 (https://phabricator.wikimedia.org/T105967) (owner: 10Dduvall) [17:31:51] andrewbogott, it doesn't actually depend, I just happened to be working on this after the parent commit [17:32:02] ok, I’ll rearrange [17:32:21] (hm, openstack gerrit has a gui for changing the rebase, um, base.) [17:32:35] andrewbogott: It's also several versions ahead of us ;-) [17:32:58] ostriches: and so ugly I can’t even believe it. Please don’t do whatever it is they did [17:33:07] those OSM commits could also do with review [17:33:24] though I should be able to find a MW dev to deal with those [17:33:59] 6Operations: Identify servers with h310 controllers - https://phabricator.wikimedia.org/T84356#2084876 (10Cmjohnson) 5Open>3Resolved I don't think it's much of an issue but more for informational purposes. We can close this task and add the information to the inventory tracking sheeting in gdocs [17:34:14] andrewbogott: seriously the UI escalated terribleness [17:34:27] ostriches, are you going to write that in now, should I leave a task for it, or..? [17:34:38] need to make sure it doesn't get forgotten [17:35:07] Leave a task for it, I won't get to doc-writing today [17:35:16] so, I have problems with the compiler, and do not have access, I am going to go with the simpler ones first [17:35:20] “What should we do with this bucket full of widgets?” “I dunno, just dump it all over the user’s head, I guess?" [17:36:11] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "1) I agree the dir structure is overkill; it was done when this seemed an "una-tantum" and not given too much thought" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [17:36:16] there was a period in the early 00's where "web master" was still a thing and "boxes" was a design strategy [17:36:16] (03PS3) 10Jcrespo: phabricator: Send weekly mail every week instead of on certain monthdays [puppet] - 10https://gerrit.wikimedia.org/r/274033 (owner: 10Alex Monk) [17:36:20] this reminds me of it so much [17:36:55] ostriches, -> https://phabricator.wikimedia.org/T128757 [17:37:15] ty [17:39:02] (03PS4) 10Andrew Bogott: Merge labtestwikitech apache config with wikitech apache config [puppet] - 10https://gerrit.wikimedia.org/r/273411 (owner: 10Alex Monk) [17:39:22] (03CR) 10Jcrespo: [C: 032] phabricator: Send weekly mail every week instead of on certain monthdays [puppet] - 10https://gerrit.wikimedia.org/r/274033 (owner: 10Alex Monk) [17:39:40] jynus, would you mind if I also added another related patch to this swat? [17:39:45] the puppet compiler it was my fault [17:39:50] go on [17:39:57] I'll need to rebase it [17:40:15] rebases are free [17:40:34] :-) [17:41:53] we cannot "test", cron, but looks alright, I will put you in charge or check that mails continue getting sent [17:42:05] 6Operations, 10ops-eqiad: Rack and Initial setup db1074-79 - https://phabricator.wikimedia.org/T128753#2085004 (10Cmjohnson) @jcrespo I can fit all of these into rack. They're 1u instead of 2u. [17:42:13] (03PS2) 10Alex Monk: Redirect create-task form, with or without slash (fixes T127286) [puppet] - 10https://gerrit.wikimedia.org/r/274621 (https://phabricator.wikimedia.org/T127286) (owner: 1020after4) [17:42:42] jynus, I'll let you know if there are issues when it's next supposed to run [17:42:52] actually, I am not sure about that [17:43:01] (03PS3) 10Alex Monk: phabricator: Redirect create-task form, with or without slash (fixes T127286) [puppet] - 10https://gerrit.wikimedia.org/r/274621 (https://phabricator.wikimedia.org/T127286) (owner: 1020after4) [17:43:01] 0 0 1,8,15,22,29 * 1 /usr/local/bin/project_changes.sh [17:43:08] (03CR) 10Alex Monk: [C: 031] phabricator: Redirect create-task form, with or without slash (fixes T127286) [puppet] - 10https://gerrit.wikimedia.org/r/274621 (https://phabricator.wikimedia.org/T127286) (owner: 1020after4) [17:43:14] ^is that the intended schedule? [17:43:14] jynus, ^ that's the one [17:43:26] jynus, that looks like the existing schedule [17:43:40] I think [17:43:40] no, that is the one, after [17:43:48] see the 1 [17:44:14] I will revert, and just ping me with a proper one, ok? outside of puppet swat [17:44:16] oh... it's got the old monthday part, with the new weekday part? [17:44:25] interesting [17:44:59] (03PS1) 10Jcrespo: Revert "phabricator: Send weekly mail every week instead of on certain monthdays" [puppet] - 10https://gerrit.wikimedia.org/r/274760 [17:45:10] (03CR) 10Jcrespo: [C: 032] Revert "phabricator: Send weekly mail every week instead of on certain monthdays" [puppet] - 10https://gerrit.wikimedia.org/r/274760 (owner: 10Jcrespo) [17:45:49] I wouldn't discard puppet weirdness [17:46:00] but still, this is so simple it can be done at any time [17:47:03] (03CR) 10Andrew Bogott: [C: 031] "Puppet compiler confirms this is a no-op on Silver, and makes harmless changes on labtestweb." [puppet] - 10https://gerrit.wikimedia.org/r/273411 (owner: 10Alex Monk) [17:47:23] yeah, will look later [17:47:25] thanks [17:47:41] let me check that it reverts correctly [17:48:22] it didn't [17:49:38] so, aparently, all are mandatory [17:49:47] so, I will edit it manually [17:49:52] and run phabe once [17:49:58] *puppet once more [17:50:53] so the change (and the revert) do not change non-used params [17:50:59] that is the issue [17:51:40] ok? [17:51:48] as I said, puppet weirdness [17:52:03] even the simplest of patches can have unintended consequences [17:52:15] hence the carefulness [17:54:40] andrewbogott, I was about to merge https://gerrit.wikimedia.org/r/#/c/273411/4 [17:55:12] oh, Krenair, I see that you may want to re-review [17:55:20] tell me what to do [17:55:24] jynus: it’s probably not needed, go ahead and merge [17:55:29] it’s simple and I tested it on the compiler [17:55:39] sounds ok [17:55:41] I have one other, I can wait [17:55:47] ok, going with it, then [17:55:59] will look at that patch [17:56:28] (03PS2) 10Ori.livneh: Report save timing by MediaWiki version [puppet] - 10https://gerrit.wikimedia.org/r/273990 (https://phabricator.wikimedia.org/T112557) [17:56:40] (03CR) 10Alex Monk: [C: 031] Merge labtestwikitech apache config with wikitech apache config [puppet] - 10https://gerrit.wikimedia.org/r/273411 (owner: 10Alex Monk) [17:57:07] thank you for that one, btw [17:57:16] (03PS5) 10Jcrespo: Merge labtestwikitech apache config with wikitech apache config [puppet] - 10https://gerrit.wikimedia.org/r/273411 (owner: 10Alex Monk) [17:58:04] !log increasing stream throughput for restbase1010-b.eqiad.wmnet boostrap by 25mbps (5x5) : T128107 T95253 [17:58:05] T128107: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107 [17:58:05] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [17:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:58:26] (03CR) 10Jcrespo: [C: 032] Merge labtestwikitech apache config with wikitech apache config [puppet] - 10https://gerrit.wikimedia.org/r/273411 (owner: 10Alex Monk) [17:58:34] (03CR) 10Gehel: "Ok, I'm taking the points and will do some spring cleanup..." [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [17:59:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [18:00:04] yurik gwicke cscott arlolra subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160303T1800). Please do the needful. [18:00:16] nope, not doing anything today [18:00:17] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [18:00:17] too lazy [18:00:33] nothing today. [18:02:20] (03PS1) 10Andrew Bogott: Turn off sqlalchemy logging for designate services. [puppet] - 10https://gerrit.wikimedia.org/r/274763 (https://phabricator.wikimedia.org/T126572) [18:03:03] (03PS4) 10John Vandenberg: phabricator: Redirect create-task form, with or without slash (fixes T127286) [puppet] - 10https://gerrit.wikimedia.org/r/274621 (https://phabricator.wikimedia.org/T127286) (owner: 1020after4) [18:03:23] (03CR) 10Muehlenhoff: Add systemd unit for logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) (owner: 10Muehlenhoff) [18:03:28] (03CR) 10John Vandenberg: [C: 031] phabricator: Redirect create-task form, with or without slash (fixes T127286) [puppet] - 10https://gerrit.wikimedia.org/r/274621 (https://phabricator.wikimedia.org/T127286) (owner: 1020after4) [18:03:54] (03CR) 10Yuvipanda: "There's a big clusterfuck around the location of the CA based on wether it is self hosted puppetmaster or not. https://phabricator.wikimed" [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [18:04:22] 6Operations, 10ops-codfw: labstore2003-labstore2004 onsite setup taks - https://phabricator.wikimedia.org/T128764#2085105 (10Krenair) [18:04:44] (03PS7) 10ArielGlenn: send web server logs from dataset hosts to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) [18:04:50] we seem to have had a real 503 spike briefly at eqiad, with some belated fallout at the remote DCs [18:05:13] (03PS2) 10Muehlenhoff: Add systemd unit for logstash [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) [18:05:16] centered on about 17:50 (15 mins back) [18:05:20] in eqiad anyways [18:06:10] (03CR) 10Andrew Bogott: [C: 032] Turn off sqlalchemy logging for designate services. [puppet] - 10https://gerrit.wikimedia.org/r/274763 (https://phabricator.wikimedia.org/T126572) (owner: 10Andrew Bogott) [18:07:07] 6Operations, 6Labs, 10Labs-Infrastructure, 13Patch-For-Review: labservices1001 ran out of disk space - https://phabricator.wikimedia.org/T126572#2085112 (10Andrew) 5Resolved>3Open The logs were still gigantic due to a setting I overlooked. Attached patch should help a lot. [18:07:13] (03CR) 10ArielGlenn: "Backtracked completely on how I was setting this up, it now looks like most of the other classes. They all need some cleanup to be sure, " [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [18:07:26] (03CR) 10ArielGlenn: send web server logs from dataset hosts to stat1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [18:08:32] Krenair, could you rebase manually 273410 ? [18:09:09] I am ok with it, but gerrit/puppet compiler fails to do it automatically [18:09:56] 6Operations, 6Analytics-Kanban, 10Traffic, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2085127 (10Milimetric) [18:11:52] (03PS8) 10Alex Monk: Add public-wiki-rewrites to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/273410 (https://phabricator.wikimedia.org/T99096) [18:12:01] 6Operations, 10Dumps-Generation, 13Patch-For-Review: Migrate dataset1001 and ms1001 to jessie - https://phabricator.wikimedia.org/T123724#2085135 (10ArielGlenn) Fallback plan: chris is cabling up the 1gb nic, setting up a port for it, I'll install to that and then move to the 10gb nic once the upgrade is don... [18:12:19] thank you [18:12:32] 6Operations: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#2085143 (10thcipriani) [18:13:18] 7Blocked-on-Operations, 6Operations, 10RESTBase, 10RESTBase-Cassandra, 13Patch-For-Review: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2085147 (10Eevans) >>! In T95253#2080114, @Eevans wrote: > On bootstrap timings: > > The numbers prese... [18:13:22] applying as soon as the compiler check all syntax [18:13:49] for the one reverted, ping me at any time and I will gladly apply it [18:14:07] (do not wait for next swat) [18:14:26] thanks jynus [18:14:27] !log lowering outbound stream throughput limit on restbase1010-a.eqiad.wmnet to 25mbps : T128107 T95253 [18:14:29] T128107: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107 [18:14:29] T95253: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253 [18:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:15:06] (03CR) 10Jcrespo: [C: 032] Add public-wiki-rewrites to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/273410 (https://phabricator.wikimedia.org/T99096) (owner: 10Alex Monk) [18:15:39] please understand that we have to be extremely careful, specially on topics I personaly do not handle every day [18:16:22] (03PS20) 10BBlack: Maps VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/269466 (https://phabricator.wikimedia.org/T124279) (owner: 10Ema) [18:17:52] apache refreshed, test one of the redirects that weren't there before, and we are good to go [18:19:47] seems fine from here [18:21:08] (03PS1) 10Papaul: Add mgmt DNS entries for labstore2003 and labstore2004 Bug:T128764 [dns] - 10https://gerrit.wikimedia.org/r/274767 (https://phabricator.wikimedia.org/T128764) [18:21:14] (03PS2) 10Dduvall: labs: Deployer access for programdashboard [puppet] - 10https://gerrit.wikimedia.org/r/274579 (https://phabricator.wikimedia.org/T105967) [18:22:11] no log errors, either, but wikitech has low traffic, so you never know [18:22:53] I think that is all [18:23:16] (03CR) 10Dduvall: [C: 031] "The default access.conf settings were prohibiting SSH access by the programdashboard user. I've added a `security::access::config` resourc" [puppet] - 10https://gerrit.wikimedia.org/r/274579 (https://phabricator.wikimedia.org/T105967) (owner: 10Dduvall) [18:23:42] 6Operations, 10ops-codfw: labstore2003-labstore2004 onsite setup taks - https://phabricator.wikimedia.org/T128764#2085238 (10Papaul) [18:23:44] jynus, well... it didn't completely fix the bug I had in mind, but progress was made [18:23:51] the apache side now works [18:23:57] :) [18:24:27] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:24:34] well, the icon just changed [18:24:47] or looks like it does anyway [18:24:49] the favicon? [18:24:56] and logo? [18:24:57] the site one [18:25:02] those changed earlier, was cached [18:25:08] ah [18:25:41] the favicon is new [18:25:48] same as the logo [18:25:57] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:27:07] I think these are actually the old logos [18:27:15] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2085247 (10Gehel) Discussion with @EBernhardson: * Pool counter probably needs to be adapted as well (https://github.com/wikimedia/operati... [18:27:20] like from the original wikitech, before labsconsole [18:33:40] 6Operations, 10ops-codfw: labstore2003-labstore2004 onsite setup taks - https://phabricator.wikimedia.org/T128764#2085323 (10chasemp) thanks papaul [18:39:20] (03PS1) 10GWicke: Start to ramp up cache TTL for purged RESTBase end points [puppet] - 10https://gerrit.wikimedia.org/r/274769 [18:40:49] (03CR) 10GWicke: "Thank you, sir!" [puppet] - 10https://gerrit.wikimedia.org/r/274458 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [18:40:51] (03CR) 10Ppchelko: [C: 031] Start to ramp up cache TTL for purged RESTBase end points [puppet] - 10https://gerrit.wikimedia.org/r/274769 (owner: 10GWicke) [18:41:56] !log elastic1030.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [18:41:58] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [18:41:58] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [18:42:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:43:12] (03PS2) 10GWicke: Start to ramp up cache TTL for purged RESTBase end points [puppet] - 10https://gerrit.wikimedia.org/r/274769 (https://phabricator.wikimedia.org/T127387) [18:46:26] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [5000000.0] [18:46:36] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 65.00% of data above the critical threshold [5000000.0] [18:48:37] PROBLEM - Disk space on labstore2001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=94%) [18:49:55] I'm refreshing myself on "How to deploy code", and I see a strange command: ssh deployment.codfw.wmnet [18:50:02] I thought the main cluster was on eqiad? [18:50:10] 6Operations, 10ops-eqiad: Rack and Initial setup db1074-79 - https://phabricator.wikimedia.org/T128753#2085395 (10jcrespo) **Important!** One of those will be a replacement for labsdb1002 (I do not know if we need to name it labsdb1008, or just substitute it transparently). For the rest, at least 2 rows is re... [18:51:10] chrisjohnson, important before you work needlessly: https://phabricator.wikimedia.org/T128753 [18:51:28] no space on / on labstore2001 [18:52:07] (03PS1) 10CSteipp: Update hash parameters in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274772 (https://phabricator.wikimedia.org/T127445) [18:52:17] jynus: -rw-r--r-- 1 root root 15G Mar 3 18:45 rep.patch [18:52:23] root's home [18:52:35] awight: maybe it was set when tin was switched to mira for maintenance? [18:52:59] elukey: It's here, https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Step_2:_get_the_code_on_the_deployment_host [18:53:06] I'll comment on the Talk page... [18:53:28] volans, I wouldn't rush there, I am not sure that is 100% active [18:53:47] elukey: Could you lmk which deployment host I should be using? tin.eqiad.wmnet still? [18:53:52] ok but root full, it cannot continue working... there is an scp that holds that file... [18:54:36] * volans going back to complete the db migration :) [18:55:56] RECOVERY - Disk space on labstore2001 is OK: DISK OK [18:56:58] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: puppet fail [19:00:28] volans, thanks for reporting [19:00:49] (03CR) 10CSteipp: [C: 032] Update hash parameters in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274772 (https://phabricator.wikimedia.org/T127445) (owner: 10CSteipp) [19:00:53] yw [19:01:08] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [19:01:17] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [19:01:38] (03Merged) 10jenkins-bot: Update hash parameters in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274772 (https://phabricator.wikimedia.org/T127445) (owner: 10CSteipp) [19:04:31] (03CR) 10Ottomata: [C: 04-1] send web server logs from dataset hosts to stat1002 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [19:05:01] (03CR) 10Ottomata: send web server logs from dataset hosts to stat1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [19:05:28] !log create temp 3T lv on labstore2001 to store test backup deltas [19:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:06:03] 6Operations, 6Office-IT, 10Traffic, 10netops: Office network using monkeybrains.net instead of connection to SFO pop site - https://phabricator.wikimedia.org/T128669#2085479 (10bbogaert) Thanks Faidon. I'll troubleshoot, and let you know what I find. -Byron [19:08:06] !log csteipp@tin Synchronized wmf-config/CommonSettings-labs.php: (no message) (duration: 00m 40s) [19:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:40] (03PS8) 10ArielGlenn: send web server logs from dataset hosts to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) [19:09:57] (03CR) 10jenkins-bot: [V: 04-1] send web server logs from dataset hosts to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [19:11:05] (03PS1) 10Ottomata: Remove deployment-prep analytics cluster configs from ops/puppet hiera [puppet] - 10https://gerrit.wikimedia.org/r/274776 [19:12:35] (03CR) 10ArielGlenn: send web server logs from dataset hosts to stat1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [19:12:59] 6Operations, 6Office-IT, 10Traffic, 10netops: Office network using monkeybrains.net instead of connection to SFO pop site - https://phabricator.wikimedia.org/T128669#2085543 (10ori) >>! In T128669#2083645, @faidon wrote: > Our BGP sessions with the office are indeed down, but this is an OIT matter, you sho... [19:14:11] apergos: why the if /else conditional al all? [19:14:13] at all? [19:14:18] uh [19:14:18] you've already conditionally set $ensure [19:14:23] * apergos looks at the code again [19:14:23] (03PS1) 10EBernhardson: Set default completion suggester scoring for beta and prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274777 [19:14:38] all you need is the main cron, and the ensure param [19:14:40] if ensure is absent [19:14:42] it'll delete it [19:14:55] you could even make $ensure a param [19:14:58] instead of $enable [19:15:09] and just have folks set that [19:15:42] I won't do the param here (yet) cause I need to do that to a bunch of these classes at once (yes it needs fixed) [19:15:49] but lemme get rid of the else [19:15:50] good grief [19:16:54] (03PS10) 10ArielGlenn: send web server logs from dataset hosts to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) [19:19:37] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 63.64% of data above the critical threshold [5000000.0] [19:21:07] apergos: why /nginx on the dest [19:21:08] ? [19:22:15] back [19:22:23] well it's possible you might want other logs someday [19:22:25] from dumps [19:22:27] or other files [19:22:27] so [19:22:34] other webrequest logs? [19:22:36] from dumps? [19:22:44] this is already in /srv/log/webrequest/archive... [19:22:56] what if you change webservers? [19:22:57] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [19:23:05] what if dumps get put behind varnish? [19:23:07] they'll do in lightty :-P [19:23:10] *go [19:23:12] noOooo [19:23:14] they are just access logs [19:23:15] hahaha [19:23:23] that's why we had varnish logs in squid for so long [19:23:28] I was specifically thinkg though of other logs of some sort [19:23:36] then they'd go to a different dir [19:23:41] okey dokey [19:23:45] not in /srv/log/webrequest [19:24:04] apergos: aside from that i think its fine [19:24:05] will +1 it [19:24:45] (03PS11) 10ArielGlenn: send web server logs from dataset hosts to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) [19:24:46] ok [19:25:02] thanks for your patience, it seems if you had written it you would have it in service long ago, sorry :-( [19:25:15] anyways I'll send it tomorrow most likely and try a manual run [19:25:28] (tonight I'm winding down but did want to get this in and reviewed on your watch) [19:26:04] (03CR) 10Ottomata: [C: 031] send web server logs from dataset hosts to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/268129 (https://phabricator.wikimedia.org/T118739) (owner: 10ArielGlenn) [19:26:13] np, thanks! [19:26:57] thank you! [19:27:29] !log Completed migration of data from es200[68] to es201[579], added es201[579] to tendril. T127330 [19:27:30] T127330: Migration from es2001-es2010 to es2011-es2019 - https://phabricator.wikimedia.org/T127330 [19:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:25] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2085616 (10RobH) >>! In T126987#2084046, @mark wrote: > @RobH: why are we buying a single server here instead of topping up... [19:34:08] 6Operations, 6Performance-Team, 10Traffic: Segment Navigation Timing data by continent - https://phabricator.wikimedia.org/T128709#2085635 (10ori) [19:34:28] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [19:38:12] (03PS2) 10Dereckson: Site name configuration on wuu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274196 (https://phabricator.wikimedia.org/T128354) [19:44:23] !log elastic1031.eqiad.wmnet: upgrading to 1.7.5, shipping logs to logstash (T122697, T109101) [19:44:25] T109101: Currently elasticsearch logs do not leave nodes. We use logstash for this across the cluster generally. - https://phabricator.wikimedia.org/T109101 [19:44:25] T122697: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697 [19:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:46:01] woo, last one! [19:47:21] (03PS1) 10Ottomata: Change topic namespace for revision_create from mediawiki to wmf [puppet] - 10https://gerrit.wikimedia.org/r/274783 [19:47:47] (03PS3) 10BBlack: Start to ramp up cache TTL for purged RESTBase end points [puppet] - 10https://gerrit.wikimedia.org/r/274769 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [19:48:04] (03CR) 10BBlack: [C: 032 V: 032] Start to ramp up cache TTL for purged RESTBase end points [puppet] - 10https://gerrit.wikimedia.org/r/274769 (https://phabricator.wikimedia.org/T127387) (owner: 10GWicke) [19:48:06] (03Abandoned) 10Ottomata: Change topic namespace for revision_create from mediawiki to wmf [puppet] - 10https://gerrit.wikimedia.org/r/274783 (owner: 10Ottomata) [19:49:21] (03PS1) 10Ottomata: Add topic config for wmf.resource_change [puppet] - 10https://gerrit.wikimedia.org/r/274785 (https://phabricator.wikimedia.org/T126687) [19:53:59] Hallo. [19:54:05] Is the train running today? [19:54:07] (03PS1) 10Alex Monk: Revert "Revert "phabricator: Send weekly mail every week instead of on certain monthdays"" [puppet] - 10https://gerrit.wikimedia.org/r/274788 [19:55:51] (03PS1) 10Chad: Group2 to wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274791 [19:55:55] ori: https://gerrit.wikimedia.org/r/274790 [19:58:13] !log rolling restart of restbase in staging to apply config change : T127387 [19:58:15] T127387: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387 [19:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:00:04] ostriches: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160303T2000). [20:00:11] aharoni: ^ [20:01:02] ori: thanks. is this the usual hour? [20:01:27] (03CR) 10Chad: [C: 032] Group2 to wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274791 (owner: 10Chad) [20:01:43] (03CR) 10CSteipp: [C: 031] "Wanted some time to look at this again before giving the go-ahead." [puppet] - 10https://gerrit.wikimedia.org/r/274741 (owner: 10Krinkle) [20:01:57] eek [20:02:03] (03Merged) 10jenkins-bot: Group2 to wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274791 (owner: 10Chad) [20:02:19] ostriches: Can I sneak in a cherry-pick to avoid a semi-embarassing thing hitting enwiki? [20:03:07] s/hitting/stop appearing on/ because it's apparently hit already [20:03:35] !log rolling restart of restbase staging complete : T127387 [20:03:36] T127387: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387 [20:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:56] RoanKattouw: Mehhhh ok :'( [20:04:45] ostriches: It's https://gerrit.wikimedia.org/r/#/c/274794/ , just waiting for Jenkins now [20:05:10] I'm gonna blame you in the sync-dir :p [20:05:16] That's OK [20:05:22] I have already passed the blame onto Ori :P [20:08:43] *twiddles thumbs* [20:10:53] (03PS1) 10CSteipp: Update pbkdf2 hash parameters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274795 (https://phabricator.wikimedia.org/T127445) [20:12:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [20:14:57] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [20:15:50] is anyone looking at these? (I"m eating so I would prefer not to. Yes I know this makes me a slakcer) [20:18:16] (03CR) 10Luke081515: [C: 031] Site name configuration on wuu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274196 (https://phabricator.wikimedia.org/T128354) (owner: 10Dereckson) [20:18:46] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [20:19:47] (03CR) 10Brian Wolff: [C: 04-1] "I think by 'suppress' you meant 'oversight'." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [20:20:22] ostriches: It finally merged. Shall I sync it or do you want to do it? [20:20:39] I got it [20:20:41] !log rolling restart of restbase production to apply config change : T127387 [20:20:42] T127387: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387 [20:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:00] !log demon@tin Synchronized php-1.27.0-wmf.15/extensions/Echo: Roan made me do it (duration: 00m 53s) [20:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:12] RoanKattouw: ^^ [20:22:16] Thanks ostriches [20:22:21] yw [20:22:26] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:22:36] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message) [20:22:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:23:46] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:24:34] (03PS10) 10Ottomata: Replace limn::data::generate by reportupdater [puppet] - 10https://gerrit.wikimedia.org/r/273487 (https://phabricator.wikimedia.org/T127327) (owner: 10Mforns) [20:25:56] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:26:39] !log rolling restart of restbase production complete : T127387 [20:26:42] T127387: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387 [20:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:27:30] (03CR) 10CSteipp: Password policies for advanced permission groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [20:29:16] 6Operations, 10hardware-requests, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2085883 (10RobH) I've escalated the pricing task T127372 for @mark's review. [20:29:26] 6Operations, 6Office-IT, 10Traffic, 10netops: Office network using monkeybrains.net instead of connection to SFO pop site - https://phabricator.wikimedia.org/T128669#2085884 (10bbogaert) Hi, What I have found out so far: Routes are being advertised from Monkey Brains: router1# show ip bgp neighbors 208.9... [20:31:37] (03CR) 10Brian Wolff: Password policies for advanced permission groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [20:35:57] why did the wikitech logo change away from the unicorn? :( [20:36:27] it's andrewbogott's fault! [20:36:38] Long live the original wikitech logo! [20:36:59] he’s right, it is my fault [20:37:17] but if you look at the front page of wikitech… there’s a reasonable explanation :) [20:38:08] 6Operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2085945 (10ori) @bblack, @ema: can you confirm that we are on track for having HTTP/2 support on or by May 15th? Also, can you recommend a way for us to project the impact of this change ahead of the actual s... [20:38:25] the logo lives on, here: https://horizon.wikimedia.org [20:38:53] 6Operations, 10RESTBase, 6Services, 3Mobile-Content-Service, 7Varnish: Enable caching for the Mobile Content Service's RESTBase public endpoints - https://phabricator.wikimedia.org/T113591#2085960 (10Pchelolo) 5Open>3Resolved a:3Pchelolo Mobile endpoints are cached in varnish and actively purged no... [20:39:33] andrewbogott: ewwww, rasterized type! it looks pixelated [20:41:11] yeah, I asked design to make me a logo for horizon and waited a few months and then made a shitty one myself [20:41:19] (03PS1) 10GWicke: RESTBase: Increase purged cache TTL to 12 hours [puppet] - 10https://gerrit.wikimedia.org/r/274799 [20:41:28] patches welcome :) [20:44:42] (03CR) 10Ppchelko: [C: 031] RESTBase: Increase purged cache TTL to 12 hours [puppet] - 10https://gerrit.wikimedia.org/r/274799 (owner: 10GWicke) [20:45:22] andrewbogott: fine fine, you make sense :) [20:48:44] (03CR) 10Dzahn: "sorry, i dont know about security::access::config and the labs setup, so i can't really tell." [puppet] - 10https://gerrit.wikimedia.org/r/274579 (https://phabricator.wikimedia.org/T105967) (owner: 10Dduvall) [20:48:57] 6Operations, 10Traffic: Port varnishlog.py to new VSL API - https://phabricator.wikimedia.org/T128788#2085985 (10ema) [20:49:25] (03PS11) 10Milimetric: Replace limn::data::generate by reportupdater [puppet] - 10https://gerrit.wikimedia.org/r/273487 (https://phabricator.wikimedia.org/T127327) (owner: 10Mforns) [20:49:46] ori: please let me know if I forgot anything in T128788 [20:49:46] T128788: Port varnishlog.py to new VSL API - https://phabricator.wikimedia.org/T128788 [20:51:05] 6Operations, 10Traffic: Port varnishlog.py to new VSL API - https://phabricator.wikimedia.org/T128788#2086002 (10ori) [20:51:07] 6Operations, 10Traffic, 13Patch-For-Review: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2086001 (10ori) [20:51:18] 6Operations, 10Traffic: Port varnishlog.py to new VSL API - https://phabricator.wikimedia.org/T128788#2085985 (10ori) [20:51:20] 6Operations, 10Traffic, 13Patch-For-Review: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#2086003 (10ori) [20:51:36] ema: lgtm [20:52:01] (03PS12) 10Ottomata: Replace limn::data::generate by reportupdater [puppet] - 10https://gerrit.wikimedia.org/r/273487 (https://phabricator.wikimedia.org/T127327) (owner: 10Mforns) [20:53:36] (03PS13) 10Ottomata: Replace limn::data::generate by reportupdater [puppet] - 10https://gerrit.wikimedia.org/r/273487 (https://phabricator.wikimedia.org/T127327) (owner: 10Mforns) [20:54:11] (03CR) 10Dzahn: "Giuseppe and Yuvipanda are the best reviewers for this, i think." [puppet] - 10https://gerrit.wikimedia.org/r/274566 (owner: 10Dduvall) [20:55:23] mutante: sorry for spamming you on review. i will spam others :) [20:56:22] (03CR) 10Dzahn: [C: 032] "yep, also checked in racktables" [dns] - 10https://gerrit.wikimedia.org/r/274767 (https://phabricator.wikimedia.org/T128764) (owner: 10Papaul) [20:56:35] thanks for looking nonetheless. i wrestled with that SSH access problem for a while yesterday and finally fixed it this morning with debugging help from thcipriani [20:56:40] marxarelli: oh, it's no problem. i just wanted to be honest and rather say i dont know enough than just making you wait [20:57:43] puppet itself is ok, just dont know too much about the labs access and hiera questions [20:58:10] (03PS1) 10Ppchelko: Removed page_edit topic. [puppet] - 10https://gerrit.wikimedia.org/r/274805 (https://phabricator.wikimedia.org/T126220) [20:59:00] mutante: fwiw, i did verify that the security::access::config resource resolves the issue but, yeah, would be good to verify that it's a sane solution [20:59:10] andrewbogott: aww, thanks for that [20:59:14] (re: logo) [21:00:55] ori: It’s probably somewhat possible for me to add actual text to that screen rather than a picture of text… I’ll have a look. [21:01:31] andrewbogott: nah, don't let that distract you. I'm sure you have bigger things to worry about. [21:01:42] 6Operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2086020 (10BBlack) >>! In T96848#2085945, @ori wrote: > @bblack, @ema: can you confirm that we are on track for having HTTP/2 support on or by May 15th? Also, can you recommend a way for us to project the imp... [21:03:46] 6Operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2086028 (10ori) Thanks for the update, @BBlack, and for your work in this space. I think we're on the same page. Let's check in again in April. [21:18:26] (03PS2) 10Dzahn: vagrant_lxc: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270025 (owner: 10Tim Landscheidt) [21:25:51] 6Operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2086147 (10BBlack) I took another quick 5-minute sample just now in the EU like last time, but different time of day, and currently it's: | Protocol | Percentage | | --- | --- | | h2 only | 3.0% | | spdy3 o... [21:28:26] (03CR) 10BBlack: [C: 032 V: 032] RESTBase: Increase purged cache TTL to 12 hours [puppet] - 10https://gerrit.wikimedia.org/r/274799 (owner: 10GWicke) [21:33:56] (03PS1) 10Jforrester: Enable VisualEditor Single Edit Tab on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274814 [21:34:48] (03CR) 10Legoktm: [C: 031] Enable VisualEditor Single Edit Tab on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274814 (owner: 10Jforrester) [21:36:04] legoktm: If greg-g is OK with it you can just banzai-deploy that now. :-) [21:36:21] * greg-g looks the other way [21:36:24] (yeah, that's fine) [21:36:25] hahahaha [21:36:31] it's all in who you know [21:36:34] (03CR) 10Legoktm: [C: 032] Enable VisualEditor Single Edit Tab on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274814 (owner: 10Jforrester) [21:36:59] (03Merged) 10jenkins-bot: Enable VisualEditor Single Edit Tab on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274814 (owner: 10Jforrester) [21:37:08] thanks greg-g and James_F :) [21:37:11] apergos: sadly [21:38:18] !log legoktm@tin Synchronized wmf-config/InitialiseSettings.php: Enable VisualEditor Single Edit Tab on officewiki https://gerrit.wikimedia.org/r/#/c/274814/ (duration: 00m 33s) [21:38:34] apergos: You know it. :-) [21:38:41] :-D [21:38:49] lucky I know you guys then! [21:40:09] !log rolling restart of restbase staging (config change) : T127387 [21:40:10] T127387: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387 [21:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:41:30] (03CR) 10Alex Monk: Password policies for advanced permission groups (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272660 (https://phabricator.wikimedia.org/T119100) (owner: 10CSteipp) [21:43:34] (03CR) 10Ottomata: "Cool! Earlier today I still saw events rolling in on this topic, but now it is silent. Let's just wait a couple of days before we merge" [puppet] - 10https://gerrit.wikimedia.org/r/274805 (https://phabricator.wikimedia.org/T126220) (owner: 10Ppchelko) [21:45:16] !log rolling restart of restbase staging complete : T127387 [21:45:17] T127387: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387 [21:46:31] 6Operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2086189 (10Ottomata) Bump! I suppose we are still waiting for these nodes though? [21:46:33] 6Operations, 10EventBus, 10hardware-requests: 3 conf200x servers in codfw for zookeeper (and etcd?) - https://phabricator.wikimedia.org/T121882#2086190 (10Ottomata) Indeed, either we order new weaker ones, or we use the referred spares. Whatever if fine with me. [21:52:26] bblack: ping? [21:53:01] SMalyshev: hi [21:53:17] bblack: hi! Have some time to talk on varnish things? [21:53:38] SMalyshev: maybe! [21:53:52] !log rolling restart of restbase production (config change) : T127387 [21:53:52] T127387: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387 [21:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:55:12] (03PS1) 10BBlack: strongswan: do not rely on $site_tier for dpdaction [puppet] - 10https://gerrit.wikimedia.org/r/274822 (https://phabricator.wikimedia.org/T127481) [21:55:14] (03PS1) 10BBlack: remove $site_tier, no longer used [puppet] - 10https://gerrit.wikimedia.org/r/274823 (https://phabricator.wikimedia.org/T127481) [21:55:16] (03PS1) 10BBlack: remove cache ipsec-specific nodelists [puppet] - 10https://gerrit.wikimedia.org/r/274824 (https://phabricator.wikimedia.org/T127481) [21:55:18] (03PS1) 10BBlack: re-arrange cache ipsec for codfw as a backend [puppet] - 10https://gerrit.wikimedia.org/r/274825 [21:55:30] bblack: so I made local varnish setup for wdqs setup on test machine, and it seems to be happy to cache response, even if it's returned as chunked. So is not caching chunked something that we configure? [21:55:41] bblack: or there's some part I'm missing here? [21:55:43] 6Operations, 10MobileFrontend, 10Traffic, 5MW-1.27-release, and 6 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2086258 (10Jdlrobson) >>! In T124356#2080027, @Jd... [21:55:48] (03PS2) 10BBlack: re-arrange cache ipsec for codfw as a backend [puppet] - 10https://gerrit.wikimedia.org/r/274825 (https://phabricator.wikimedia.org/T127481) [21:56:17] SMalyshev: it may be a problem specific to our package of varnish, the one in the wikimedia repos [21:56:42] we only recently discovered the issue, when for the first time we had any significant traffic source sending chunked in the first place [21:57:21] bblack: aha :) ok. So are we planning to fix that? [21:57:52] no, we likely can't reasonably fix it. but if it's specific to our package (as opposed to generic upstream varnish3), there's better hope that varnish4 fixes it. [21:59:15] SMalyshev: what's your local varnish setup on a test machine? (as in, it it in labs? what OS? etc) [21:59:29] you could maybe just install our jessie package to compare [21:59:29] bblack: labs, debian, apt-get install varnish pretty much [21:59:43] so it is our jessie package? [21:59:58] i.e.: [21:59:58] dpkg-query -W varnish [21:59:59] varnish 3.0.6plus-wm8 [22:00:01] bblack: I don't know if it's our or not, is there a way to fid out? [22:00:11] varnish 3.0.6plus~x-wm3 [22:00:21] that's what I have [22:00:23] well, it's at least an older version of our varnish package [22:00:38] I could update it and see what happens [22:00:38] sounds like maybe your instance is ubuntu not debian? [22:00:47] bblack: yeah sorry, ubuntu [22:00:56] Ubuntu 14.04.1 LTS \n \l [22:01:01] on a debian jessie instance, you'll get the newer package. we don't package it for ubuntu anymore [22:01:16] but, for these purposes, what you have should be substantially similar [22:01:16] bblack: I could maybe spin up one with jessie and see [22:01:47] if chunked can cache for you with a blank-ish config, the next-most likely candidate is that our clusters' use of do_gzip is related to our problems with chunked [22:01:54] !log rolling restart of restbase production complete : T127387 [22:01:55] T127387: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387 [22:02:03] * apergos peeks in [22:02:13] I should have do_gzip as a stalkword :-P [22:02:34] SMalyshev: but at this point, you're debugging our chunked problem for us, which is awesome. But I don't suspect the outcome will lead to any realistic fix we can do on the prod clusters other than "wait for varnish4 to roll out" [22:02:52] bblack: when varnish 4 is planned? [22:03:40] SMalyshev: for the misc cluster that WDQS is on, likely pretty soon (as in like April), but there are still some loose ends that we don't have firm timelines on, related to analytics data output [22:03:59] ah, if it's in April then no problem, I can wait until April [22:04:50] I'll still try to run jessie fresh varnish to see what happens, just to know it, but otherwise I'll wait till varnish 4 then [22:06:27] SMalyshev: you might try turning on do_gzip too just to see [22:06:35] bblack: I will [22:06:59] basically: sub vcl_fetch { set beresp.do_gzip = true; } [22:07:34] bblack: I think nginx for me already has gzip enabled, so I wonder if we aren't double-gzipping here [22:07:34] (03PS1) 10Ottomata: Set oozie.service.PurgeService.purge.old.coord.action to true [puppet/cdh] - 10https://gerrit.wikimedia.org/r/274827 (https://phabricator.wikimedia.org/T127988) [22:07:45] SMalyshev: RB responses are chunked, and are definitely cached [22:07:57] (03PS2) 10Ottomata: Set oozie.service.PurgeService.purge.old.coord.action to true [puppet/cdh] - 10https://gerrit.wikimedia.org/r/274827 (https://phabricator.wikimedia.org/T127988) [22:08:01] gwicke: interesting... [22:08:01] SMalyshev: it won't double-gzip, but you might try disabling nginx gzip in your test, if do_gzip alone doesn't break anything [22:08:29] gwicke: can I see which headers your backend is sending? [22:08:30] gwicke: really? maybe we're wrong that it always fails then [22:08:45] gwicke: are RB responses always gzipped if client accepts it? [22:08:46] maybe it's something missing in headers [22:08:51] bblack: yes [22:08:59] let me get you a URL [22:09:06] then it really could be "chunked is only broken if do_gzip is on and the backend doesn't gzip either" [22:09:10] http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1/page/summary/Cat [22:10:11] gwicke: I don't see chunked there, I see content-length [22:10:38] oh nevermind, with gzip on I do finally [22:11:11] the problem we ran into (recently) with that other service though, was I think also a case where the backend was gzipping and sending chunked, though [22:11:46] it's never been investigated deeply. we killed gzip output for that service (at the backend) and that stopped it being chunked too, and the problem went away [22:11:56] the cache-control header was okay? [22:12:17] we used s-maxage: something in the past [22:12:24] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2086341 (10RobH) [22:12:28] I donno, I'm trying to find the old ticket now, assuming we even had one [22:12:33] it was when wikipedia15 hit [22:12:34] muscle memory, or something like that [22:12:40] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2086372 (10RobH) [22:12:51] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2086341 (10RobH) [22:13:02] should be s-maxage=3600 (equals rather than colon) [22:13:20] hmm... interesting. I'll try to experiment with cache-control and gzip then... maybe it'll change something [22:13:30] gwicke: how I ask for the same Cat via varnish? [22:13:52] https://en.wikipedia.org/api/rest_v1/page/html/Cat [22:13:53] (03PS3) 10Ottomata: Set oozie.service.PurgeService.purge.old.coord.action to true and parameterize purge_jobs_older_than_days [puppet/cdh] - 10https://gerrit.wikimedia.org/r/274827 (https://phabricator.wikimedia.org/T127988) [22:14:38] (03CR) 10Ottomata: [C: 032] Set oozie.service.PurgeService.purge.old.coord.action to true and parameterize purge_jobs_older_than_days [puppet/cdh] - 10https://gerrit.wikimedia.org/r/274827 (https://phabricator.wikimedia.org/T127988) (owner: 10Ottomata) [22:15:14] (03PS1) 10Ottomata: Update cdh submodule with oozie purge change [puppet] - 10https://gerrit.wikimedia.org/r/274829 (https://phabricator.wikimedia.org/T127988) [22:15:34] interesting... I feel I need to experiment more then [22:15:51] there's a number of headers we aren't sending, so maybe they will help [22:16:17] SMalyshev: are you setting vary? [22:16:25] gwicke: no [22:16:33] kk [22:17:08] gwicke: though we probably should send vary: Accept [22:17:37] gwicke: would Vary get in a way? [22:18:00] you don't need to vary on Accept-Encoding, as Varnish will transparently handle that for yoyu [22:18:20] it'll decompress a gzipped response if the client doesn't support gzip [22:18:40] (and vice-versa, for compressible types) [22:18:47] (03PS2) 10Alex Monk: Revert "Revert "phabricator: Send weekly mail every week instead of on certain monthdays"" [puppet] - 10https://gerrit.wikimedia.org/r/274788 [22:18:49] as for Accept, that would fragment the cache a lot unless you normalize the values heavily [22:19:05] every browser sends some kind of accept, all differing slightly [22:19:45] this would only make sense if you actually perform content negotiation in the backend [22:20:19] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2086435 (10RobH) p:5Triage>3Normal [22:20:22] I still can't find a phab link, we may not have made a ticket. it was when wp15 went live, we started getting hammered by uncached traffic, and at the time we ended up blaming the chunked+gzip responses and disabled apache's compression on the wp15 backend [22:20:49] it may be size-related too, though [22:21:00] the only other thing I can think of is whether an etag is supplied [22:21:10] I think the response in question was under cache_misc's do_stream limit, but maybe I'm wrong there and it was do_stream as well [22:21:47] RB always returns an etag header [22:22:25] 6Operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#2086452 (10EBernhardson) @mark We have solidified our budget planning for next year (good thing, because it's due tomorrow!). This is our plan: 4x maps backend servers fr... [22:24:11] SMalyshev: in any case, if it turns out your chunked responses aren't a problem for the current misc cluster config, I think the main other thing they were missing was just Cache-Control [22:24:25] it's a varnish-apache/mod_proxy_* thing [22:25:04] if apache is using mod_deflate, it will start compressing and sending data as soon as it gets it from the backend [22:25:09] bblack: ok, I'll try to add that and see if it works. Thanks for helping with it! [22:25:16] it can't know the full response size in advance, so it doesn't send a content-length header [22:25:25] varnish doesn't like that, iirc [22:25:34] 6Operations, 6Office-IT, 10Traffic, 10netops: Office network using monkeybrains.net instead of connection to SFO pop site - https://phabricator.wikimedia.org/T128669#2086474 (10bbogaert) Found problem to be ports for uplinks to ulsfo were in err-disable. These have been recovered (thanks cajoel), and the l... [22:25:54] ori: yeah that's what I remember being the analysis at the time (in general mod_deflate always has a size limit over which it will do chunked with no content-length, though, regardless of mod_proxy) [22:26:07] gwicke: we only have 2 acceptable values for Accept so it should not be a big deal [22:26:20] right; I remember now [22:26:21] ori: but SMalyshev is testing with our varnish package and says chunked responses cache fine, and now gwicke is pointing out that RB's responses are gzip+chunked too and also cache fine [22:27:06] weird [22:27:54] so, wp15 was misc-cluster I believe, and so it was also facing this code, which is not on the text cluster where RB lives: [22:27:57] // Stream objects >= 1MB in size [22:27:59] if (std.integer(beresp.http.Content-Length, 1048576) >= 1048576 || beresp.http.Content-Length ~ "^[0-9]{8}") { [22:28:02] set beresp.do_stream = true; [22:28:05] // hit_for_pass on objects >= 10MB in size (no effect on backends that always (pass) anyways) [22:28:08] if (std.integer(beresp.http.Content-Length, 10485760) >= 10485760 || beresp.http.Content-Length ~ "^[0-9]{9}") { [22:28:11] return (hit_for_pass); [22:28:13] } [22:28:16] } [22:28:29] I seem to remember that wp15 response was large, but I don't think it was large enough for the hit_for_pass check [22:28:41] but then again it didn't have a content-length at all [22:28:46] I'm not sure if this is relevant as we don't have content-length... or beresp is already aggregated? [22:29:03] so it could be just that chunked trips up with do_stream, or that it somehow invoked the hit-for-pass there anyways [22:30:07] I don't think RB's responses are gzip [22:30:30] content-encoding: gzip [22:30:45] try ab -v2 -H'Accept-Encoding: gzip' https://en.wikipedia.org/api/rest_v1/page/summary/Cat [22:31:00] does that go through varnish, tho? [22:32:16] * Krinkle sounds related to https://phabricator.wikimedia.org/T126015 [22:32:17] sorry, ab -H'Accept-Encoding: gzip' -v2 http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1/page/summary/Cat [22:32:43] ori: afaik gzip is enabled in the Varnish backend, so it shouldn't matter [22:32:50] all Varnish requests should signal gzip support [22:32:53] (03CR) 10Alex Monk: "Run the script somewhere on the prod side (e.g. silver - so admin credentials etc. can be used) with the updated config to generate the al" [puppet] - 10https://gerrit.wikimedia.org/r/268921 (owner: 10Alex Monk) [22:32:56] do they? [22:33:39] I don't have an easy way to check [22:33:57] they do [22:34:10] well, it's tricky, but basically they do [22:34:26] technically, sadly, it depends on the client triggering the fetch [22:34:31] I think [22:34:36] how does that work? [22:34:41] but most clients will [22:34:53] https://www.varnish-cache.org/docs/3.0/tutorial/compression.html [22:35:40] ah ha [22:35:51] my interpretation of that link is that if the client request which triggers the backend fetch doesn't support gzip, varnish won't ask for gzip. if the client supports it, it will ask for it. [22:35:55] * apergos wonders if there are times it's better not to override that header [22:36:05] that's whatit sounds like [22:36:17] bblack: And then it caches that object and uses it for both gzip and non-gzip future clients. [22:36:21] (using its own gzip compressor) [22:36:29] right [22:36:33] I investigated this for the Parsoid caches, but am not sure if I had to do anything explicit to always enable gzip in the backend request or not [22:36:38] If the client does not support gzip the Accept-Encoding header is left alone and we'll end up serving whatever we get from the backend server. [22:36:42] so yeah [22:36:50] !log applying hotfix for T128751 (link ../../../extensions/Gerrit* to /srv/phab/phabricator/src/extensions/Gerrit*) - will submit puppet patch to make this permanent. [22:36:51] T128751: Phabricator redirection script for gerrit is broken - https://phabricator.wikimedia.org/T128751 [22:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:37:14] so ideally we'd force the backend to use gzip, but we'll need to make sure that Varnish will decode it if the client that triggered the fetch didn't want gzip [22:37:18] we always want to store gzipped text content [22:37:38] varnish always decodes for non-gzip clients, and that's not known to be buggy [22:37:45] * apergos goes on to read https://www.varnish-cache.org/docs/3.0/phk/gzip.html#phk-gzip [22:37:52] bblack: even if we start messing with the Accept header for the backend fetch? [22:38:02] the do_gzip we turn on in html is for varnish to compress backend responses which weren't already gzip, which is a more-buggy area of functionality [22:38:04] coudl easily trip its internal assumption [22:38:16] (03PS3) 10Dzahn: vagrant_lxc: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270025 (owner: 10Tim Landscheidt) [22:38:56] (03CR) 10Dzahn: [C: 032] vagrant_lxc: Move role class to module role [puppet] - 10https://gerrit.wikimedia.org/r/270025 (owner: 10Tim Landscheidt) [22:39:00] yeah I donno [22:39:08] could be tested! :) [22:39:48] this processing only affects cache hit/miss requests [22:39:49] ok [22:40:10] passing through the client's gzip support to the backend doesn't really make much sense in our case [22:40:14] Varnish4 is Coming though, and in general we know we have unpatched do_gzip bugs, and hacky interactions that have previously tripped bugs related to do_gzip and/or do_stream with our custom -plus patches in our varnish3 [22:40:21] And clients which do not support gzip gets their Accept-Encoding header removed [22:40:23] huh! [22:40:28] just removed completely [22:40:32] I mean all of it's a mess, and IMHO not worth digging too deep if we're getting by today [22:41:42] (03PS1) 10EBernhardson: Update CirrusSearch PoolCounter for cross-dc search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/274834 (https://phabricator.wikimedia.org/T128761) [22:41:53] so it means when a the first request to RB content is from a non-gzip client, that url will not benefit from the better gzip compression that RB offers over varnish's own. [22:42:30] it's notable that the one known upstream varnish3 bugfix for do_gzip that we haven't applied here, we didn't apply because it has a nasty merge conflict with our -plus stream/range code, which is what we suspect interacts badly with do_gzip for us in the first place.... [22:42:47] it's almost certain there are still related bugs lurking in the code we run today, for some conditions [22:45:06] (03PS7) 10Dzahn: RT: add role to krypton [puppet] - 10https://gerrit.wikimedia.org/r/250047 (https://phabricator.wikimedia.org/T119112) [22:46:06] (03PS8) 10Dzahn: RT: add role to krypton [puppet] - 10https://gerrit.wikimedia.org/r/250047 (https://phabricator.wikimedia.org/T119112) [22:46:32] (03CR) 10Dzahn: [C: 032] "i'm certain it won't work but i wanna know _how much_ it breaks :)" [puppet] - 10https://gerrit.wikimedia.org/r/250047 (https://phabricator.wikimedia.org/T119112) (owner: 10Dzahn) [22:48:06] (03PS1) 10Andrew Bogott: Prevent keypair creation in horizon. [puppet] - 10https://gerrit.wikimedia.org/r/274835 [22:49:14] (03PS3) 10Yuvipanda: labs: Remove k8s-eval project NFS [puppet] - 10https://gerrit.wikimedia.org/r/272068 [22:49:57] (03PS1) 10ArielGlenn: turn off dumps cron job in prep for second try dataset1001 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/274836 [22:50:02] (03PS2) 10Andrew Bogott: Prevent keypair creation in horizon. [puppet] - 10https://gerrit.wikimedia.org/r/274835 [22:50:28] (03CR) 10Yuvipanda: [C: 032] labs: Remove k8s-eval project NFS [puppet] - 10https://gerrit.wikimedia.org/r/272068 (owner: 10Yuvipanda) [22:50:57] bblack: so I installed clean varnish on labs, and it seems to be happily caching wdqs responses. So it looks like our configs have something to do with it... maybe gzip [22:51:08] (03PS2) 10ArielGlenn: turn off dumps cron job in prep for second try dataset1001 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/274836 [22:51:24] (03PS1) 10Yuvipanda: labs: Kill NFS from the analytics labs project [puppet] - 10https://gerrit.wikimedia.org/r/274837 (https://phabricator.wikimedia.org/T128804) [22:51:56] (03PS3) 10Andrew Bogott: Prevent keypair creation in horizon. [puppet] - 10https://gerrit.wikimedia.org/r/274835 [22:52:27] (03CR) 10ArielGlenn: [C: 032] turn off dumps cron job in prep for second try dataset1001 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/274836 (owner: 10ArielGlenn) [22:53:12] merged you too, yuvipanda [22:53:18] apergos: thanks! [22:53:21] yw [22:53:31] enabling do_gzip doesn't seem to matter [22:57:12] (03PS4) 10Andrew Bogott: Prevent keypair creation in horizon. [puppet] - 10https://gerrit.wikimedia.org/r/274835 [22:57:37] (03CR) 10Madhuvishy: [C: 031] "Kill it Kill it Kill it." [puppet] - 10https://gerrit.wikimedia.org/r/274837 (https://phabricator.wikimedia.org/T128804) (owner: 10Yuvipanda) [22:59:05] (03CR) 10Andrew Bogott: [C: 032] Prevent keypair creation in horizon. [puppet] - 10https://gerrit.wikimedia.org/r/274835 (owner: 10Andrew Bogott) [22:59:21] SMalyshev: hey! got a minute? [22:59:29] yuvipanda: sure [22:59:46] SMalyshev: let me just walk over [23:00:01] SMalyshev: try do_stream? basically the same as do_gzip in terms of where and how in VCL [23:00:47] 6Operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#2086627 (10RobH) 5Open>3declined >>! In T116090#2032982, @ssastry wrote: > We now have a... [23:01:07] 6Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2086629 (10Ottomata) a:5JAllemandou>3RobH [23:01:11] 6Operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#2086633 (10RobH) 5Open>3Resolved a:3RobH Since all specs have been updated and orders planned or already done, I'm resolving this task. [23:01:23] SMalyshev: the other thing that may make a difference is the size of the object, although I donno really. A very large response may test differently. [23:01:40] SMalyshev: but I wouldn't think so in a clean config without our various size limit conditionals? [23:02:26] bblack: no it's pretty small responses [23:03:05] 6Operations, 10hardware-requests: eqiad out of warranty spares to decommission - approval request - https://phabricator.wikimedia.org/T120679#2086644 (10RobH) [23:04:04] bblack: I see three varnishes there: cp1045 miss(0), cp4001 pass(0), cp4004 frontend pass(0) - so which ones they are? one frontend, one backend, and the third? [23:04:32] 6Operations, 10hardware-requests: 3 new nodes for Druid - https://phabricator.wikimedia.org/T128807#2086645 (10Ottomata) [23:04:43] (03CR) 10Milimetric: [C: 031] labs: Kill NFS from the analytics labs project [puppet] - 10https://gerrit.wikimedia.org/r/274837 (https://phabricator.wikimedia.org/T128804) (owner: 10Yuvipanda) [23:04:55] SMalyshev: one frontend and two backends [23:05:02] 6Operations, 6Labs: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2086662 (10RobH) [23:05:08] 6Operations, 10hardware-requests: 3 new nodes for Druid - https://phabricator.wikimedia.org/T128807#2086663 (10Ottomata) [23:05:20] cp1045 is the one fetching from wdqs, cp4001 fetches from it, and cp4004 fetches from cp4001 for the client [23:06:23] 6Operations, 10hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2086665 (10Ottomata) [23:06:57] SMalyshev: also keep in mind if any of them show a "hit", the rest of X-Cache behind that is part of the cached object [23:07:17] bblack: yeah but for me it's aways pass [23:07:20] so e.g. you can see "miss miss hit", and it's really a cache object in all 3 [23:07:48] (the miss miss coming from when it was fetched into those layer, but the FE never had to go back and ask them again after) [23:08:24] SMalyshev: can you give me a test URL to try? [23:08:39] I don't even know what a valid one is :) [23:08:49] bblack: sure, https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=SELECT%20*%0AWHERE%20%0A%7B%0A%09%3Fx%20%3Fz%20%3Fy%0A%7D%20LIMIT%2010 [23:08:59] oh dammit my client decoded URLs [23:09:12] 6Operations, 10hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2086685 (10Ottomata) Perhaps one of the old RESTbases that are being replaced will do? [23:09:36] bblack: here: https://phabricator.wikimedia.org/P2707 [23:10:25] bblack: interestingly enough, sometimes it's pass and sometimes it's miss... why? [23:13:25] SMalyshev: because it's deciding on pass after fetching it, and then caching that decision for 120s [23:13:54] first it's a miss, so it fetches, then something it sees about the response makes it decide to store a hit-for-pass cache object. the on the next fetch in that 120s, it's a pass. [23:14:16] I've got varnishlog debug on a test request now that sees the same thing [23:14:43] 12 Hash c query.wikidata.org 12 VCL_return c hash 12 VCL_call c miss fetch 12 Backend c 50 wdqs be_wdqs1002 12 TTL c 2172515835 RFC 120 -1 -1 1457046755 0 1457046754 0 0 12 VCL_call c fetch 12 TTL c 2172515835 VCL 120 3600 -1 1457046755 -0 12 VCL_return c hit_for_pass [23:14:48] eh bad paste [23:14:49] bblack: I see, that makes it clearer [23:14:57] so it's hit_for_pass [23:15:12] 12 Hash c query.wikidata.org [23:15:12] 12 VCL_return c hash [23:15:12] 12 VCL_call c miss fetch [23:15:12] 12 Backend c 50 wdqs be_wdqs1002 [23:15:12] 12 TTL c 2172515835 RFC 120 -1 -1 1457046755 0 1457046754 0 0 [23:15:15] 12 VCL_call c fetch [23:15:18] 12 TTL c 2172515835 VCL 120 3600 -1 1457046755 -0 [23:15:20] 12 VCL_return c hit_for_pass [23:16:11] SMalyshev: the response from https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=SELECT%20*%0AWHERE%20%0A%7B%0A%09%3Fx%20%3Fz%20%3Fy%0A%7D%20LIMIT%2010 doesn't seem to contain Cache-Control headers [23:16:25] gwicke: no it doesn't [23:16:34] gwicke: but clean varnish caches it anyway [23:16:47] well, for 120s [23:16:53] (default default_ttl) [23:16:59] what misc uses too in the absence of CC [23:17:01] 6Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, and 4 others: Move oldwikisource on www.wikisource.org to mul.wikisource.org - https://phabricator.wikimedia.org/T64717#2086727 (10TTO) [23:17:15] the question is why the response triggers hit-for-pass in our specific config [23:17:33] gwicke: check out: http://wdqs-test.wmflabs.org/#SELECT%20*%0AWHERE%20%0A%7B%0A%09%3Fx%20%3Fz%20%3Fy%0A%7D%20LIMIT%20100 - this URL runs wdqs on labs via default config varnish, and cache works [23:17:57] the only conditional in misc-specific VCL that does hit-for-pass is if (std.integer(beresp.http.Content-Length, 10485760) >= 10485760 || beresp.http.Content-Length ~ "^[0-9]{9}") { [23:18:01] return (hit_for_pass); [23:18:07] but then there's also probably default VCL invoking in places too [23:18:15] gwicke: oops wrong url... [23:18:27] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: puppet fail [23:19:28] the default vcl_fetch will also hit-for-pass if it sees Set-Cookie or Vary:* [23:19:41] http://wdqs-test.wmflabs.org/sparql?query=SELECT%20*%0AWHERE%20%0A%7B%0A%09%3Fx%20%3Fz%20%3Fy%0A%7D%20LIMIT%20100 - this caches via varnish [23:19:43] but I don't see those in the response [23:19:48] 6Operations, 10hardware-requests: 3 new nodes for Druid - https://phabricator.wikimedia.org/T128807#2086743 (10RobH) a:5RobH>3Ottomata The hardware currently in analytics1015, analytics1017, and analytics1021: * Dell PowerEdge R720xd * RT 3646 * Dual CPU Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz / 6 cores... [23:20:05] bblack: no, there's no cookies or vary [23:20:19] Server: nginx/1.9.3 [23:20:19] Date: Thu, 03 Mar 2016 23:16:11 GMT [23:20:19] Content-Type: application/sparql-results+xml [23:20:19] Transfer-Encoding: chunked [23:20:20] Connection: keep-alive [23:20:21] X-Served-By: wdqs1001 [23:20:22] Access-Control-Allow-Origin: * [23:20:26] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2086750 (10RobH) p:5Triage>3Normal [23:20:28] this is the response from nginx [23:20:44] so the only other thing could be that the stream/hfp conditionals are being tripped by chunked, I think [23:20:52] I can kinda test that [23:21:08] but even then, how does it not cache at the next layer up? [23:21:31] to me, behavior without Cache-Control is kind of like C undefined behavior [23:22:28] also, interestingly enough, Transfer-Encoding: chunked survives all 3 layers of caches [23:23:13] AND the varnish which does cache also sends Transfer-Encoding: chunked [23:24:10] yeah [23:24:26] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2060818 (10RobH) @EBernhardson: I'm a bit confused by a section of the request: > Ideally we should match specs with elastic10{17..31}.eqiad.wmnet. However, @mark mention... [23:24:32] that's pretty strong evidence alone that not all TE:chunked break varnish heh [23:24:45] but we still don't understand why it seemed to before, and why we're not getting caching now [23:25:49] ok, confirmed, it's the cache_misc size-based code, somehow [23:26:17] (probably the case for the wp15 thing as well) [23:26:57] I've manually tested without those conditionals in place for this query, and it caches [23:27:33] bblack: so which conditionals are those? misc_fetch_large_objects? [23:27:44] yes [23:27:51] and now I understand [23:28:12] it's funny how much easier it gets to understand a code problem when you're convinced it's wrong instead of convinced it must be ok because it worked before :P [23:28:23] std.integer(beresp.http.Content-Length, 10485760) - doesn't that return 10M for no header? [23:28:30] yup! that's the problem [23:28:45] those conditionals cause stream+pass for anything that doesn't have a content-length at all, too [23:29:01] ok, that seems to be it then. It will hit_for_pass everything without content-length... but how restbase works then [23:29:02] ? [23:29:10] do they have content-length? [23:29:15] RB goes through a different cluster, which doesn't have that hack [23:29:21] ahhhh hehe :) [23:30:20] ok, so then I probably need a patch for WDQS still to add vary and cache-control... but then we could look into enabling the cache? [23:30:28] what vary? [23:30:49] bblack: We may need Vary: Accept because you can specify format by Accept header [23:30:58] oh that [23:31:01] (not Accept-Encoding, but Accept) [23:31:04] is tricky! [23:31:15] there are two formats - json and xml (xml being the default_ [23:31:24] like gwicke said, Accept varies wildly [23:31:34] we'd have to normalize Vary on input to the caches first [23:31:41] (for this particular case) [23:31:45] bblack: not for WDQS :) for WDQS only two options would work [23:31:58] 6Operations, 10RESTBase, 6Services, 10Traffic, and 2 others: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387#2086805 (10GWicke) Summary of the latest status: - Path normalization is applied to regular & PURGE requests. - We verified purgin... [23:32:05] yeah but that doesn't matter. Vary generically means "every Accept header clients send is a unique cache slot" [23:32:33] it will still cache your 2x possible outputs in 200x different slots for various browser Accept differences by default [23:32:35] bblack: GUI only sends one kind, other clients would send one of the two [23:33:02] bblack: we could of course still normalize but unless somebody trying to be weird there's no reason to put into Accept anything but 2 options [23:33:31] e.g. I just saw a live Firefox/44 request send: Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2 [23:33:35] (to query.wikidata.org) [23:33:52] bblack: no, that's probably for static part... I'm talking only about SPARQL part [23:34:04] bblack: you don't go to SPARQL part with the browser usually [23:34:23] I don't think that's going to hold, right? [23:34:32] bblack: though I see you point, you *can* do it... [23:34:39] we're talking about graphs emebedded in wiki pages, whose javascript asks the browser to fetch sparql [23:34:39] then we'd have to normalize it probably [23:35:17] bblack: but there are only two *legit* options [23:35:39] well anyways [23:36:09] SMalyshev: thanks for digging into all of this and expending time testing. You've uncovered general bugs in our misc-web VCL that have bitten us before that we totally didn't understand :) [23:36:32] bblack: np :) I'll make some patches next and put it for review [23:36:38] step 1 for me is "go fix those things". and then yes, we can normalize Accept for the specific URLs you're doing Vary:Accept on. That part isn't very hard. [23:36:51] RoanKattouw_away: ostriches: Krenair: (greg-g) I'm happy to do the SWAT deployment today. This would be my first, although I've done plenty of scapping mostly for CentralNotice things. [23:37:12] 6Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2086825 (10Ottomata) > If we can fit more cores per system, is there any benefit to lowering this cluster size from 3 to 2? Will let @milimetric chime in on this one. [23:38:18] digging a bit deeper on misc_fetch_large_objects - that code comes from long ago. the misc-web cluster used to be a very tiny cluster with only a very small memory cache, hence the limitation to flip to hit-for-pass on large objects. [23:38:35] I'm not sure about the do_stream part, maybe that can live on [23:39:17] some of the current misc cluster still has small memory, but we can make it conditional to frontend only and at least you'll still get a local backend hit [23:40:36] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2086829 (10EBernhardson) >>! In T128000#2086773, @RobH wrote: > @EBernhardson: > > I'm a bit confused by a section of the request: > >> Ideally we should match specs with... [23:40:54] 6Operations, 10hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2086831 (10RobH) a:5RobH>3Ottomata The upcoming restbase systems are 128GB of ram, and thus are overprovisioned for this request. However, we do have spare systems that are the followi... [23:41:59] 6Operations, 10hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2086665 (10GWicke) @RobH, restbase1001-1006 have 64G RAM. [23:43:37] bblack: do you want to take over https://phabricator.wikimedia.org/T128656 or create a new bug and I'll close that one? [23:44:06] (03PS9) 10Dzahn: RT: add role to krypton [puppet] - 10https://gerrit.wikimedia.org/r/250047 (https://phabricator.wikimedia.org/T119112) [23:44:28] 6Operations, 10Mail: move travel related aliases to OIT - https://phabricator.wikimedia.org/T127549#2086843 (10bbogaert) Hi Daniel, I'm working on this again and have recreated the google group and ldap entry. Can you remove it so we may test again? Thanks, Byron [23:45:02] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2086844 (10EBernhardson) something doesn't add up with the cpu specs, but i'm sure you can figure out what it's supposed to be. elastic2001 reports 32 hardware threads (16 c... [23:46:06] SMalyshev: let me just walk over/window page_up [23:46:08] SMalyshev: let me just walk over/window page_up [23:46:08] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:46:12] bah [23:46:21] heh [23:47:07] 6Operations, 10Traffic: cache_misc's misc_fetch_large_objects has issues - https://phabricator.wikimedia.org/T128813#2086850 (10BBlack) [23:47:52] SMalyshev: I made a new one, sorry, wasn't looking here while typing ^ [23:48:04] feel free to add that as a blocker on your related stuff [23:48:04] bblack: no problem, I'll close the old one [23:48:33] 6Operations, 10Traffic: cache_misc's misc_fetch_large_objects has issues - https://phabricator.wikimedia.org/T128813#2086850 (10Smalyshev) [23:49:19] (03PS2) 10Yuvipanda: labs: Kill NFS from the analytics labs project [puppet] - 10https://gerrit.wikimedia.org/r/274837 (https://phabricator.wikimedia.org/T128804) [23:49:26] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Kill NFS from the analytics labs project [puppet] - 10https://gerrit.wikimedia.org/r/274837 (https://phabricator.wikimedia.org/T128804) (owner: 10Yuvipanda) [23:49:39] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2086873 (10RobH) The Intel(R) Xeon(R) CPU E5-2640 v3 is an 8 core CPU. You linked the [[http://ark.intel.com/products/64591/Intel-Xeon-Processor-E5-2640-15M-Cache-2_50-GHz... [23:51:06] awight: Krenair RoanKattouw_away: up to you all, really [23:51:22] no objection, though Roan has a patch in it [23:51:44] I'll ambush him if he shows up... [23:51:55] If he doesn't show up I'll let you do it [23:52:04] If he does, he decides [23:52:05] 6Operations, 10hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2086875 (10RobH) Damn, I've made that mistake twice now and I shouldn't. Thank you @GWicke! Either way, the restbase1001-1006 already have about half spoken for & not quite back into avai... [23:52:13] jdlrobson: did you have a second patch for SWAT? [23:52:21] Krenair: the gauntlet! thx [23:52:55] 6Operations, 10Analytics-Cluster, 10hardware-requests: eqiad: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#2086884 (10RobH) restbase1001-1006 are not yet available. Though getting @mark's approval for the allocation of one for this in advance would elimina... [23:57:22] Krenair: Do we normally create the release branch merges, e.g. for https://gerrit.wikimedia.org/r/#/c/273579/ ? [23:57:35] I thought that was on the developer. [23:58:23] Convention is that the person requesting SWAT presses the cherry-pick button [23:58:47] it's often required to know *which* branches to backport to [23:59:00] I might be nice today, cos the other stuff is fast and safe. [23:59:01] right [23:59:02] although usually not terribly hard to figure out [23:59:30] actually, in this case, from the comments: [23:59:38] Florianschmidtwelzow Feb 27 11:50 PM [23:59:38] Patch Set 2: Cherry Picked [23:59:38] This patchset was cherry picked to change: I59dbbe02f25bf123ebe3cae947289c611a23b9dd [23:59:47] I see, thank you