[00:00:04] <jouncebot>	 RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160129T0000). Please do the needful.
[00:00:05] <jouncebot>	 ebernhardson yurik Jdlrobson bmansurov Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[00:00:11] <Krenair>	 delaying swat
[00:00:13] <grrrit-wm>	 (03Merged) 10jenkins-bot: Revert "wgRCWatchCategoryMembership true on wikipedias & commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267189 (owner: 10Alex Monk)
[00:00:23] <wikibugs>	 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1980344 (10Yurik) @bblack, the 4 test servers have been performing admirably, so if possible, it would be good to keep them as production and match them in another DC for redundancy.
[00:00:37] <greg-g>	 yes, SWAT is delayed until further notice, sorry for the convenience
[00:01:15] <yurik>	 greg-g, thanks, that is very convenient, yes :D 
[00:01:46] <grrrit-wm>	 (03PS1) 10Subramanya Sastry: parsoid-vd-client & diffservice: Use uprightdiff for diffing images [puppet] - 10https://gerrit.wikimedia.org/r/267190 
[00:02:01] <logmsgbot>	 !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/267189/2 (duration: 01m 11s)
[00:02:03] <greg-g>	 yurik: the convenience is for our users who will be happier when we fix this UBN! first :)
[00:02:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:02:23] <greg-g>	 also, a lovely joke from the late great Mitch Hedberg
[00:02:39] <wikibugs>	 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1980347 (10Tfinc) This will likely fit under the strategic budget so we'll need a brief narrative about going default on Wikipedia and any other projects to explain the increase of machines.
[00:03:00] <yurik>	 greg-g, what's a UBN?
[00:03:10] <greg-g>	 Unbreak Now
[00:03:33] <yurik>	 ah yes, they usually are kinda nasty, aren't they
[00:03:49] <grrrit-wm>	 (03PS9) 10Andrew Bogott: Keystone: Adopt a multi-domain model [puppet] - 10https://gerrit.wikimedia.org/r/244350 
[00:03:51] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Define wgOpenStackManagerProject [puppet] - 10https://gerrit.wikimedia.org/r/267192 (https://phabricator.wikimedia.org/T115029) 
[00:04:12] <greg-g>	 yurik: too many of them this week
[00:04:16] <wikibugs>	 6operations: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1980354 (10Dzahn)
[00:04:18] <wikibugs>	 6operations, 5Patch-For-Review: make the releases.wm.org index page look nicer - https://phabricator.wikimedia.org/T125164#1980352 (10Dzahn) 5Open>3Resolved maybe some day round 2 , adding a CSS with the mediawiki.org style and a logo?
[00:04:18] <greg-g>	 hence my high blood pressure :/
[00:04:33] <wikibugs>	 6operations: make the releases.wm.org index page look nicer - https://phabricator.wikimedia.org/T125164#1980355 (10Dzahn)
[00:05:50] <yurik>	 greg-g, i heard herbal tea somehow brings down the blood pressure as well as causes drinker not to distance oneself from the world's ills
[00:06:04] <jdlrobson>	 greg-g: unbreak now is what we do best
[00:06:08] <jdlrobson>	 ;-)
[00:06:08] <grrrit-wm>	 (03PS1) 10Rush: diamond: nfsiostat as a collector [puppet] - 10https://gerrit.wikimedia.org/r/267193 
[00:06:09] <yurik>	 without the "not"
[00:06:13] <greg-g>	 jdlrobson: :)
[00:06:20] <jdlrobson>	 no need for high blood pressure
[00:06:23] <jdlrobson>	 we 0wn at fixing those
[00:06:27] <grrrit-wm>	 (03PS2) 10Rush: diamond: nfsiostat as a collector [puppet] - 10https://gerrit.wikimedia.org/r/267193 
[00:07:25] <greg-g>	 ok, SWAT is back on the menu
[00:07:29] <jdlrobson>	 WOOO
[00:07:33] * greg-g is tired and making too many cultural references
[00:07:33] <Krenair>	 okay
[00:09:30] <yurik>	 greg-g, you should talk to oliver - he loves to make all sorts of weird references ... that only he himself gets
[00:09:36] <yurik>	 :-P
[00:10:59] <grrrit-wm>	 (03PS1) 10Dereckson: Enable SandboxLink on or.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267194 (https://phabricator.wikimedia.org/T124614) 
[00:12:22] <grrrit-wm>	 (03CR) 10Bmansurov: "Nope, swatters suggested that I used the window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov)
[00:12:37] <greg-g>	 yurik: yeah, he's too extreme for even me
[00:12:38] <greg-g>	 :)
[00:13:04] <Krenair>	 9 patches?
[00:13:20] <Krenair>	 ebernhardson, one of yours is V-1'd
[00:13:43] <Dereckson>	 Hi. Krenair: I added the 9th, it's a throttle rule, so we need it.
[00:15:13] <Krenair>	 yurik first
[00:15:21] <grrrit-wm>	 (03PS3) 10Alex Monk: Update graph settings - should be noop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267060 (owner: 10Yurik)
[00:15:26] <yurik>	 Krenair, naturally!
[00:15:27] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Update graph settings - should be noop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267060 (owner: 10Yurik)
[00:15:50] <grrrit-wm>	 (03PS1) 10Dereckson: Enable WikidataPageBanner on es.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267195 (https://phabricator.wikimedia.org/T125000) 
[00:16:02] <grrrit-wm>	 (03Merged) 10jenkins-bot: Update graph settings - should be noop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267060 (owner: 10Yurik)
[00:16:09] * yurik hides
[00:17:31] <ebernhardson>	 Krenair: checking
[00:17:46] <logmsgbot>	 !log krenair@mira Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/267060/ (duration: 01m 12s)
[00:17:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:17:51] <Krenair>	 yurik, ^
[00:18:38] <grrrit-wm>	 (03PS2) 10EBernhardson: Point CirrusSearch queries to local datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267053 
[00:18:45] <ebernhardson>	 was just a unit test i forgot to update...
[00:18:46] <yurik>	 Krenair, seems to be ok
[00:18:52] <Krenair>	 yurik, this second patch is not reviewed by someone else?
[00:19:19] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Point CirrusSearch queries to local datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267053 (owner: 10EBernhardson)
[00:19:25] <yurik>	 Krenair, no, but i can very quickly get it reviewed - it s a oneliner
[00:19:31] <Krenair>	 ok, please do so
[00:20:22] <ebernhardson>	 :S
[00:20:58] <Krenair>	 jdlrobson, bmansurov: okay, there's some dependencies thing here
[00:21:08] <yurik>	 Krenair, max just merged it
[00:21:23] <jdlrobson>	 Krenair: what do you need?
[00:21:35] <Krenair>	 jdlrobson proposes https://gerrit.wikimedia.org/r/#/c/267025/ which depends on https://gerrit.wikimedia.org/r/#/c/264909/
[00:21:47] <Krenair>	 which is not on either prod branch yet
[00:22:13] <jdlrobson>	 Krenair: that's fine. It's harmless and will be riding the train next week.
[00:22:15] <Krenair>	 but I suppose we can do the config early
[00:22:18] <Krenair>	 ok
[00:22:32] <grrrit-wm>	 (03PS3) 10EBernhardson: Point CirrusSearch queries to local datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267053 
[00:22:34] <Krenair>	 yurik, he merged it to the deployment branch directly? sigh...
[00:22:48] <yurik>	 Krenair, yeah, i guess he didn't realize it was not on master
[00:23:05] <Krenair>	 oh well
[00:23:06] <yurik>	 not a bigie, right? :)
[00:23:21] <yurik>	 since you wanted to deploy it anyway :D
[00:24:16] <Krenair>	 not a big deal because I'm looking at it as part of this swat, it's not taking me completely by surprise
[00:25:01] <logmsgbot>	 !log krenair@mira Synchronized php-1.27.0-wmf.11/extensions/Graph/modules/graph2.js: https://gerrit.wikimedia.org/r/#/c/267065/ (duration: 01m 11s)
[00:25:01] <Krenair>	 I actually have someone present in the channel who knows what the patch is, which is better than most other times people send me surprises via the deployment branches
[00:25:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:25:05] <Krenair>	 yurik, ^
[00:25:25] <yurik>	 checking...
[00:25:50] <grrrit-wm>	 (03CR) 10Mobrovac: "/usr/local/bin/uprightdiff is somehow magically present on the node?" [puppet] - 10https://gerrit.wikimedia.org/r/267190 (owner: 10Subramanya Sastry)
[00:26:00] <wikibugs>	 6operations: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1980389 (10yuvipanda) 3NEW
[00:26:10] <YuviPanda>	 bblack: ^
[00:26:50] <wikibugs>	 6operations, 10Wikimedia-DNS: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1980396 (10Krenair)
[00:27:20] <wikibugs>	 6operations, 10Wikimedia-DNS: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1980408 (10yuvipanda) local testing with dnsmasq (for example) returns ::1 for localhost.
[00:30:27] <Krenair>	 everything okay, yurik?
[00:30:37] <yurik>	 Krenair, yep, all's good ,thx!
[00:30:56] <yurik>	 my graphoid service is acting up, but that's to be expected i guess :)
[00:31:05] <yurik>	 to be fixed on monday
[00:31:23] <grrrit-wm>	 (03PS3) 10Alex Monk: Add sampling rates for mobile web language switcher in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267025 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov)
[00:31:28] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Add sampling rates for mobile web language switcher in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267025 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov)
[00:31:47] <greg-g>	 yurik: your skills at inspiring confidence could use some updating :P
[00:31:57] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add sampling rates for mobile web language switcher in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267025 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov)
[00:31:57] <wikibugs>	 6operations, 6Labs: Manual creation of labs account - https://phabricator.wikimedia.org/T125172#1980416 (10Cobi) 3NEW
[00:32:20] <yurik>	 greg-g, being pessimist ensures that life is full of positives ;)
[00:32:51] <ori>	 you are assuming that things turn out better than the pessimist expected, which is an optimistic thing to do
[00:32:52] <yurik>	 and when there are no positives, it was expected )
[00:32:52] <Krenair>	 heh
[00:33:02] <Krenair>	 I wonder if anyone remembers how to let an old SVN account into labs
[00:33:20] <YuviPanda>	 what have I become? why am I reading nginx source on a thursday evening...
[00:33:42] <ebernhardson>	 YuviPanda: could be worse, i got max to read hhvm source on a thursday evening ;)
[00:33:47] <YuviPanda>	 haha
[00:33:48] * yurik googles how to create a firmware virus in qbasic
[00:34:05] <YuviPanda>	 so far I've run into: bugs in nginx, a bug in docker, a bug in my code
[00:34:19] <yurik>	 the holy trinity!
[00:34:25] <Dereckson>	 Krenair: https://phabricator.wikimedia.org/T55793
[00:35:09] <Krenair>	 an RT reference, lovely
[00:35:09] <logmsgbot>	 !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/267025/ (duration: 01m 12s)
[00:35:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:35:24] <Krenair>	  -> https://phabricator.wikimedia.org/T83042
[00:35:49] <Krenair>	 thanks Dereckson 
[00:35:53] <Dereckson>	 You're welcome.
[00:36:23] <Krenair>	 bmansurov, ^
[00:36:44] <bmansurov>	 ok thanks
[00:36:52] <Krenair>	 and jdlrobson ^
[00:37:27] <Krenair>	 second patch is in jenkins
[00:37:32] <jdlrobson>	 thanks Krenair 
[00:38:40] <Krenair>	 while that's going, ebernhardson 
[00:38:52] * jdlrobson waits for jenkins
[00:39:47] <grrrit-wm>	 (03CR) 10MarcoAurelio: "Looks OK except for the minor alphabetical issue in PS3. Deployer should check and run optiPNG in the logo to ensure it displays OK in the" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu)
[00:39:49] <grrrit-wm>	 (03PS4) 10Alex Monk: Point CirrusSearch queries to local datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267053 (owner: 10EBernhardson)
[00:40:01] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Point CirrusSearch queries to local datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267053 (owner: 10EBernhardson)
[00:40:27] <Krenair>	 ebernhardson, you are still here, right?
[00:40:31] <grrrit-wm>	 (03Merged) 10jenkins-bot: Point CirrusSearch queries to local datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267053 (owner: 10EBernhardson)
[00:40:44] <ebernhardson>	 Krenair: yup
[00:40:48] <Krenair>	 ok, just checking :)
[00:41:35] <Krenair>	 looks like CirrusSearch-common has to go before InitialiseSettings
[00:42:21] <logmsgbot>	 !log krenair@mira Synchronized tests/cirrusTest.php: https://gerrit.wikimedia.org/r/#/c/267053/ (duration: 01m 11s)
[00:42:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:42:29] <ebernhardson>	 Krenair: yes
[00:42:41] <yurik>	 i just looked at the graphoid service - seems like for some strange reason the deployed version is different from the tip of the graphoid deploy. If greg-g is ok with it, I would like to git deploy sync graphoid again.  Its not a huge issue, but a number of romanian graphs are not drawing correctly
[00:43:07] <Krenair>	 is graphoid deployed by trebuchet?
[00:43:12] <yurik>	 Krenair, correct
[00:43:27] <greg-g>	 yurik: finnnne
[00:43:36] * yurik gives greg-g a flower
[00:43:47] <logmsgbot>	 !log krenair@mira Synchronized wmf-config/CirrusSearch-common.php: https://gerrit.wikimedia.org/r/#/c/267053/ (duration: 01m 10s)
[00:43:50] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:43:56] <greg-g>	 yurik: I'll turn it into tea
[00:44:28] <Krenair>	 ebernhardson, syncing InitialiseSettings now
[00:44:33] <yurik>	 are you sure i didn't pick a mildly poisonous one?  just enough to make you sleepy?  and in your absence do all sorts of nasty deployments?
[00:44:51] <greg-g>	 yurik: I'll take the rest :)
[00:45:24] <yurik>	 greg-g, and you think that it is me who is a risk taker?!?
[00:45:28] <logmsgbot>	 !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/267053/ (duration: 01m 10s)
[00:45:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:45:56] <Krenair>	 well, search still works
[00:46:13] <ebernhardson>	 yea, morelike still works as well
[00:46:18] <ebernhardson>	 (the one switched to codfw)
[00:46:24] <yurik>	 Krenair, should i git deploy now, or wait for you?
[00:46:42] <Krenair>	 I don't think git deploys affect me
[00:46:52] <greg-g>	 yurik: go ahead
[00:47:06] <greg-g>	 let's get things wrapped up before it's EOD, ideally
[00:48:59] <ebernhardson>	 greg-g: technically, i think its about 10 hours past EOD for yuri (3:45am :P)
[00:49:09] <logmsgbot>	 !log krenair@mira Synchronized php-1.27.0-wmf.11/extensions/MobileFrontend/resources/skins.minerva.editor/init.js: https://gerrit.wikimedia.org/r/#/c/267168/ (duration: 01m 12s)
[00:49:20] <yurik>	 ebernhardson is spying on me!
[00:49:24] <greg-g>	 ebernhardson: my timezone is the only one that matters, I guess
[00:49:27] <ebernhardson>	 :)
[00:49:30] <grrrit-wm>	 (03PS2) 10Alex Monk: Return more like search queries to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266995 (owner: 10EBernhardson)
[00:49:39] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Return more like search queries to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266995 (owner: 10EBernhardson)
[00:50:07] <grrrit-wm>	 (03Merged) 10jenkins-bot: Return more like search queries to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266995 (owner: 10EBernhardson)
[00:50:28] <yurik>	 !log synced latest graphoid
[00:50:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:50:31] <yurik>	 all's good
[00:50:33] <yurik>	 thanks!
[00:50:36] <Krenair>	 jdlrobson, syncing
[00:50:38] <Krenair>	 jdlrobson, ^
[00:50:42] <yurik>	 greg-g, ^^^
[00:51:01] <yurik>	 greg-g, and that is why you are moving to east coast :-P
[00:51:22] <Krenair>	 jdlrobson, can you confirm please?
[00:51:31] <yurik>	 to be closer to the proletariat
[00:51:32] <jdlrobson>	 Krenair: on it
[00:51:48] <jdlrobson>	 RTL fixed! yay!
[00:51:52] <Krenair>	 ebernhardson, syncing
[00:52:54] <ebernhardson>	 Krenair: kk
[00:53:00] <logmsgbot>	 !log krenair@mira Synchronized wmf-config/CirrusSearch-production.php: https://gerrit.wikimedia.org/r/#/c/266995/ (duration: 01m 11s)
[00:53:03] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:53:04] <Krenair>	 ebernhardson, ^
[00:54:11] <ebernhardson>	 Krenair: queries look to be working, no log explosion so probably good (also i tested this before)
[00:54:15] <ebernhardson>	 but i'll keep an eye on my dashboards
[00:54:21] <Krenair>	 k
[00:55:39] <grrrit-wm>	 (03PS2) 10Alex Monk: Bump up the QuickSurveys sampling rates for es and fa wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267071 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov)
[00:55:43] <Krenair>	 bmansurov, ping
[00:55:46] <bmansurov>	 yes
[00:55:48] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Bump up the QuickSurveys sampling rates for es and fa wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267071 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov)
[00:56:13] <grrrit-wm>	 (03Merged) 10jenkins-bot: Bump up the QuickSurveys sampling rates for es and fa wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267071 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov)
[00:57:44] <logmsgbot>	 !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/267071/ (duration: 01m 11s)
[00:57:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:58:04] <Krenair>	 bmansurov, ^
[00:58:18] <bmansurov>	 Krenair: looks good
[00:59:14] <grrrit-wm>	 (03PS6) 10Alex Monk: Add sampling rates for mobile web language switcher on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov)
[00:59:24] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Add sampling rates for mobile web language switcher on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov)
[00:59:50] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add sampling rates for mobile web language switcher on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov)
[01:01:19] <grrrit-wm>	 (03PS3) 10Alex Monk: Santiago Editatón throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267186 (https://phabricator.wikimedia.org/T125081) (owner: 10Dereckson)
[01:01:22] <logmsgbot>	 !log krenair@mira Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/265292/ (duration: 01m 14s)
[01:01:25] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 032] Santiago Editatón throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267186 (https://phabricator.wikimedia.org/T125081) (owner: 10Dereckson)
[01:01:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:01:29] <Krenair>	 bmansurov, ^
[01:01:39] <bmansurov>	 Krenair: thanks!
[01:01:54] <grrrit-wm>	 (03Merged) 10jenkins-bot: Santiago Editatón throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267186 (https://phabricator.wikimedia.org/T125081) (owner: 10Dereckson)
[01:02:57] <bmansurov>	 Krenair: does it take time before I see the change?
[01:03:09] <Krenair>	 bmansurov, yes, beta doesn't receiving our syncs
[01:03:14] <Krenair>	 receive*
[01:03:18] <logmsgbot>	 !log krenair@mira Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/267186/ (duration: 01m 09s)
[01:03:21] <Krenair>	 it automatically updates every so often
[01:03:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[01:03:23] <Krenair>	 Dereckson, fyi ^
[01:03:27] <Krenair>	 greg-g, think that's it
[01:03:31] <bmansurov>	 Krenair: ok thanks
[01:03:43] <Krenair>	 bmansurov, it should happen soon (tm)
[01:03:54] <bmansurov>	 fingers crossed
[01:03:57] <greg-g>	 Krenair: thank you muchly
[01:04:36] <Dereckson>	 Thanks for the deploy Krenair.
[01:05:27] <greg-g>	 alright, I'm going afk for the evening, thanks all
[01:06:01] <grrrit-wm>	 (03PS1) 10MaxSem: Reduce Kafka timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267200 (https://phabricator.wikimedia.org/T125084) 
[01:07:33] <grrrit-wm>	 (03PS1) 10Ori.livneh: Revert "Revert "Autopromotion: remove deprecated onView event, fix INGROUPS"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267201 
[01:07:40] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Revert "Revert "Autopromotion: remove deprecated onView event, fix INGROUPS"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267201 (owner: 10Ori.livneh)
[01:08:25] <tto>	 Something weird's going on. Page categorization events are showing up on enwiki watchlists, but $wgRCWatchCategoryMembership is still set to false on enwiki
[01:08:30] <tto>	 Any idea what's up there?
[01:08:39] <grrrit-wm>	 (03Merged) 10jenkins-bot: Revert "Revert "Autopromotion: remove deprecated onView event, fix INGROUPS"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267201 (owner: 10Ori.livneh)
[01:08:58] <wikibugs>	 6operations, 6Labs: Manual creation of labs account - https://phabricator.wikimedia.org/T125172#1980554 (10Krenair) Instructions in T83042, LDAP admins CC'd
[01:09:14] <Krenair>	 tto, it was enabled and disabled
[01:09:38] <tto>	 So the events are still in the watchlist table, then. Right
[01:09:53] <tto>	 (or recentchanges or whatever you call it)
[01:10:01] <Krenair>	 that's not quite how the watchlist works, but sure
[01:10:06] <Krenair>	 yes
[01:21:15] <grrrit-wm>	 (03CR) 10MtDu: "I ran optipng on the logo before I pushed the patch. Is that enough or what else do I need to do?" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu)
[01:24:33] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/).
[01:25:43] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/).
[01:26:58] <grrrit-wm>	 (03PS2) 10Ori.livneh: Enable persistent redis connections for job runners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261306 (owner: 10Aaron Schulz)
[01:27:23] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] "“Oh well," McWatt sang, "what the hell.”" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261306 (owner: 10Aaron Schulz)
[01:27:47] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable persistent redis connections for job runners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261306 (owner: 10Aaron Schulz)
[01:29:22] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[01:29:53] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge.
[01:31:56] <logmsgbot>	 !log ori@mira Synchronized wmf-config: I83da57cf: Enable persistent redis connections for job runners (duration: 01m 11s)
[01:31:59] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:02:59] <Krenair>	 mkdir: cannot create directory �/sys/fs/cgroup/memory/mediawiki/job/13186�: File exists
[02:03:00] <Krenair>	 limit.sh: failed to create the cgroup.
[02:03:00] <Krenair>	 sigh
[02:03:03] <Krenair>	 didn't this get fixed once
[02:03:06] <Krenair>	 (silver)
[02:11:03] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0]
[02:14:33] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[02:15:22] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Define wgOpenStackManagerProject [puppet] - 10https://gerrit.wikimedia.org/r/267192 (https://phabricator.wikimedia.org/T115029) (owner: 10Andrew Bogott)
[02:17:35] <grrrit-wm>	 (03PS3) 10Rush: diamond: nfsiostat as a collector [puppet] - 10https://gerrit.wikimedia.org/r/267193 
[02:20:45] <grrrit-wm>	 (03PS4) 10Rush: diamond: nfsiostat as a collector [puppet] - 10https://gerrit.wikimedia.org/r/267193 
[02:23:22] <grrrit-wm>	 (03CR) 10Rush: [C: 032] diamond: nfsiostat as a collector [puppet] - 10https://gerrit.wikimedia.org/r/267193 (owner: 10Rush)
[02:25:29] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.11) (duration: 10m 40s)
[02:25:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:26:01] <grrrit-wm>	 (03PS1) 10Rush: diamond: enable nfsiostat on labs instances with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/267204 
[02:26:36] <grrrit-wm>	 (03PS2) 10Rush: diamond: enable nfsiostat on labs instances with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/267204 
[02:32:56] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jan 29 02:32:56 UTC 2016 (duration 7m 28s)
[02:33:00] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:33:47] <grrrit-wm>	 (03PS3) 10Rush: diamond: enable nfsiostat on labs instances with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/267204 
[02:35:37] <grrrit-wm>	 (03PS4) 10Rush: diamond: enable nfsiostat on labs instances with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/267204 
[02:35:48] <grrrit-wm>	 (03PS5) 10Rush: diamond: enable nfsiostat on labs instances with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/267204 
[02:35:54] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [24.0]
[02:38:10] <grrrit-wm>	 (03CR) 10Rush: [C: 032] diamond: enable nfsiostat on labs instances with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/267204 (owner: 10Rush)
[02:42:54] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[02:57:07] <grrrit-wm>	 (03PS2) 10Rush: diamond: monitor nscd behavior for ldap clients [puppet] - 10https://gerrit.wikimedia.org/r/265847 
[03:17:19] <wikibugs>	 6operations, 10OTRS, 7HTTPS: ssl certificate replacement: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#1980721 (10Matthewrbowker)
[04:13:42] <icinga-wm>	 PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 12.00% of data above the critical threshold [100000000.0]
[04:28:41] <grrrit-wm>	 (03PS1) 10BBlack: dnsrecursor: add localhost data [puppet] - 10https://gerrit.wikimedia.org/r/267208 
[04:36:13] <grrrit-wm>	 (03CR) 10Subramanya Sastry: "Right now, it exists because Tim built it on ruthenium using the /srv/uprightdiff repo that has been checked out via puppet. In the future" [puppet] - 10https://gerrit.wikimedia.org/r/267190 (owner: 10Subramanya Sastry)
[04:41:53] <icinga-wm>	 RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0]
[04:49:58] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] "I see. This is not the happiest solution, but it'll work for the time being." [puppet] - 10https://gerrit.wikimedia.org/r/267190 (owner: 10Subramanya Sastry)
[05:08:31] <grrrit-wm>	 (03PS2) 10Yuvipanda: dnsrecursor: add localhost data [puppet] - 10https://gerrit.wikimedia.org/r/267208 (https://phabricator.wikimedia.org/T125170) (owner: 10BBlack)
[05:49:10] <wctaiwan>	 I'm getting logged out within the same browser session again. It seemed to go away for a few days after the first day it happened, but it's returned.
[05:53:21] <p858snake|_>	 repeatedly? all user sessions were killed the other day for security reasons
[05:59:12] <wctaiwan>	 the session is limited to today.
[06:00:23] <icinga-wm>	 PROBLEM - Apache HTTP on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50430 bytes in 0.133 second response time
[06:00:42] <icinga-wm>	 PROBLEM - HHVM rendering on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50430 bytes in 0.101 second response time
[06:07:24] <icinga-wm>	 RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.058 second response time
[06:07:43] <icinga-wm>	 RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 70151 bytes in 0.160 second response time
[06:30:13] <icinga-wm>	 PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:13] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:33] <icinga-wm>	 PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:34] <icinga-wm>	 PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:30:52] <icinga-wm>	 PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:31:33] <icinga-wm>	 PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:42] <icinga-wm>	 PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:12] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:39:28] <bd808>	 wctaiwan: can you get us some details of the sessions this is happening on?
[06:39:48] <grrrit-wm>	 (03CR) 10Luke081515: [C: 031] Enable WikidataPageBanner on es.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267195 (https://phabricator.wikimedia.org/T125000) (owner: 10Dereckson)
[06:40:08] <wctaiwan>	 what kind of details? I'm using Firefox (latest stable); I accept session cookies, but not third party ones. Cookies are not kept beyond the current session.
[06:40:43] <wctaiwan>	 Indication that it's happening is that I would be logged out, but when I go to login I'd see my username filled in, which it wouldn't be had I not been logged in (unless I'd logged out, but I generally don't bother to).
[06:42:23] <wctaiwan>	 it seems to happen after a period of inactivity? I don't recall evert being logged in and navigating to another page to find that I'd been logged out. But I'm not 100% sure that's not a coincidence.
[06:43:02] <bd808>	 are you moving across wikis? Logging in to a particular wiki?
[06:43:46] <bd808>	 we are having some vaguely similar reports here -- https://phabricator.wikimedia.org/T124252#1979688
[06:43:48] <wctaiwan>	 hmm, that's a good point, actually. I might have logged in on meta and not enwiki. In which case it's PEBKAC.
[06:43:54] <bd808>	 but no actionable details yet
[06:44:51] <wctaiwan>	 okay, I think that's unlikely, since the username wouldn't be pre-filled on enwiki if I logged in on meta (I just tested).
[06:45:15] <bd808>	 logging in on meta and then being logged in on enwiki should generally work for sure
[06:45:54] <bd808>	 assuming that you got all the interactions with loginwiki either via the 3rd party cookie + javascript or by the 1x1 images
[06:46:01] <wctaiwan>	 well, not for me, since meta wouldn't be able to set a cookie for *.wikipedia.org
[06:46:11] <wctaiwan>	 yeah, I wouldn't have, because I block third-party cookies.
[06:46:37] <bd808>	 right. that's the scenario that the images are meant to work with
[06:46:51] <bd808>	 I block 3rd party cookies too
[06:46:57] <wctaiwan>	 I think Firefox catches those. Otherwise it'd be trivial to work around its tracking protection.
[06:47:37] <bd808>	 wctaiwan: are you running incognito too?
[06:47:41] <wctaiwan>	 yes.
[06:47:50] <wctaiwan>	 http://i.imgur.com/8pZHgIS.png are my privacy settings in firefox
[06:47:56] <bd808>	 ah. that may certainly play into this
[06:48:45] <wctaiwan>	 Yeah, it could be related. But this is difficult to pin down because steps for reproduction would be "log in, stop looking at wikipedia, and wait for a few hours and then remember to check"
[06:49:01] <wctaiwan>	 I'm not even sure at this point I'm just logging out and forgetting I did.
[06:49:21] <wctaiwan>	 s/I'm just/if I'm not just/
[06:49:59] <bd808>	 wctaiwan: I think it's worth filing a bug about with the description you have given thus far
[06:50:20] <wctaiwan>	 sure, I can do that. Anything I should look for next time I suspect it's happening?
[06:50:22] <icinga-wm>	 PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:51:17] <bd808>	 getting the cookies that you have on when you suspect you've been logged out would be good. Having the cookies form before that as well would be even better
[06:51:54] <bd808>	 s/form/from/
[06:52:39] <wctaiwan>	 Hmm, I don't think Firefox shows any cookies when you're using private browsing :/
[06:54:38] <bd808>	 they should show in the developer tools
[06:55:00] <wctaiwan>	 nope
[06:55:00] <wctaiwan>	 https://bugzilla.mozilla.org/show_bug.cgi?id=823941
[06:55:12] <icinga-wm>	 RECOVERY - puppet last run on chromium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[06:56:19] <wctaiwan>	 anyway, I'll file the bug. Thanks.
[06:56:29] <bd808>	 wctaiwan: I'm looking at cookies attached to a GET of enwiki in an incognito FF 44 sesion right now
[06:56:42] <icinga-wm>	 RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures
[06:56:54] <icinga-wm>	 RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[06:56:54] <bd808>	 but I'm looking specifically at the response in the network tab
[06:56:57] <wctaiwan>	 ohh
[06:57:03] <wctaiwan>	 I was looking in the storage tab
[06:57:14] <icinga-wm>	 RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures
[06:57:35] <wctaiwan>	 okay, I'll try to get that then.
[06:57:54] <icinga-wm>	 RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:57] <bd808>	 cool. thanks for reporting and being willing to help debug a bit
[06:58:05] <icinga-wm>	 RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:09] <wctaiwan>	 np. thanks for looking into it.
[06:58:23] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:32] <bd808>	 wctaiwan: please cc me on the bug you file
[06:58:33] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:37] <wctaiwan>	 will do
[07:16:52] <icinga-wm>	 RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures
[08:24:54] <wikibugs>	 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM, 5Patch-For-Review: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1980984 (10Joe) I think we should just backport this patch to our current package while we are confident releasing a new one. This...
[08:45:03] <icinga-wm>	 PROBLEM - HHVM rendering on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[08:46:42] <icinga-wm>	 RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 70151 bytes in 0.307 second response time
[09:07:23] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Update debian-targets patch for 1.0.2f [debs/openssl] - 10https://gerrit.wikimedia.org/r/267218 
[09:07:47] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Update debian-targets patch for 1.0.2f [debs/openssl] - 10https://gerrit.wikimedia.org/r/267218 (owner: 10Muehlenhoff)
[09:20:55] <grrrit-wm>	 (03Abandoned) 10Ema: eqiad: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266503 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema)
[09:26:11] <addshore>	 Krenair: still awake? 
[09:26:40] <addshore>	 *highly doubts it*
[09:32:05] <addshore>	 or greg-g (I'm trying to hunt down the reason for https://gerrit.wikimedia.org/r/#/c/267189/)
[09:38:53] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0]
[09:48:44] <grrrit-wm>	 (03CR) 10Addshore: "At a guess the revert is due to https://phabricator.wikimedia.org/T125147" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267189 (owner: 10Alex Monk)
[09:49:23] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[09:59:05] <grrrit-wm>	 (03CR) 10Addshore: [C: 04-1] "Per https://phabricator.wikimedia.org/T125147" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 (owner: 10Addshore)
[09:59:11] <grrrit-wm>	 (03CR) 10Addshore: [C: 04-1] "Per https://phabricator.wikimedia.org/T125147" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264734 (owner: 10Addshore)
[10:01:54] <wikibugs>	 6operations, 10DBA: upgrade db servers to jessie - https://phabricator.wikimedia.org/T125028#1981051 (10jcrespo) 5Open>3Invalid a:3jcrespo As per Ops Meeting.
[10:01:56] <wikibugs>	 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1981054 (10jcrespo)
[10:02:53] <icinga-wm>	 RECOVERY - Disk space on ms-be2015 is OK: DISK OK
[10:05:21] <wikibugs>	 6operations: upgrade iron to jessie (or get rid of it) - https://phabricator.wikimedia.org/T125025#1981064 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff
[10:07:08] <wikibugs>	 6operations: make the releases.wm.org index page look nicer - https://phabricator.wikimedia.org/T125164#1981067 (10Legoktm) >>! In T125164#1980139, @Dzahn wrote: > - add a custom header file, so we display "Wikimedia Software Releases" >   instead of just "Index of /" >   (https://httpd.apache.org/docs/2.4/mod/m...
[10:15:13] <icinga-wm>	 RECOVERY - puppet last run on ms-be2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:17:17] <moritzm>	 !log rolling restart of swift in esams
[10:17:19] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:20:15] <wikibugs>	 6operations, 10ops-codfw: ms-be2015.codfw.wmnet: slot=8 dev=sdi failed - https://phabricator.wikimedia.org/T124056#1981088 (10fgiunchedi) 5Open>3Resolved disk rebuilding, resolving
[10:23:10] <wikibugs>	 6operations: Move bacula director and storage daemon off helium? - https://phabricator.wikimedia.org/T123723#1981097 (10akosiaris) The storage daemon must be on hardware as it needs access to the disk shelf. The big hurdle to having the storage daemon to a VM is access to a lot of disk space which right now is s...
[10:25:26] <wikibugs>	 6operations, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378#1981104 (10akosiaris) I think we should merge T32452 in this one
[10:31:13] <icinga-wm>	 PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet last ran 21 hours ago
[10:31:43] <icinga-wm>	 RECOVERY - salt-minion processes on mw1120 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[10:32:10] <wikibugs>	 6operations, 10ops-eqiad: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1981110 (10elukey) 3NEW
[10:33:03] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: Add support for float timeouts in socket streams [debs/hhvm] - 10https://gerrit.wikimedia.org/r/267228 (https://phabricator.wikimedia.org/T125084) 
[10:35:12] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 845
[10:35:22] <icinga-wm>	 RECOVERY - RAID on ms-be2003 is OK: OK: optimal, 13 logical, 13 physical
[10:35:27] <wikibugs>	 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1981122 (10akosiaris) Just sure of something. Currently the maps cache cluster is 2 boxes and is performing quite well (with minimal load). 4 does not sound bad to me, but do we have any numbers...
[10:36:14] <wikibugs>	 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#1981123 (10fgiunchedi) 3NEW
[10:36:42] <icinga-wm>	 RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[10:37:13] <wikibugs>	 6operations, 7Swift: swift: puppetized mkfs/parted fails on ms-be2003, ms-be2015 / disk error - https://phabricator.wikimedia.org/T125013#1981129 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi re: ms-be2003 is {T125200} and ms-be2015 was {T124056} resolving this one, thanks @dzahn !
[10:38:01] <grrrit-wm>	 (03PS2) 10BBlack: eqiad: remove most mobile frontends from cache_mobile [puppet] - 10https://gerrit.wikimedia.org/r/267160 (https://phabricator.wikimedia.org/T109286) 
[10:38:03] <grrrit-wm>	 (03PS1) 10BBlack: eqiad: remove last cache_mobile frontend [puppet] - 10https://gerrit.wikimedia.org/r/267230 (https://phabricator.wikimedia.org/T122651) 
[10:40:12] <icinga-wm>	 RECOVERY - check_mysql on db1008 is OK: Uptime: 846116 Threads: 2 Questions: 5842269 Slow queries: 5686 Opens: 2405 Flush tables: 2 Open tables: 429 Queries per second avg: 6.904 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[10:41:05] <grrrit-wm>	 (03PS3) 10BBlack: eqiad: remove most mobile frontends from cache_mobile [puppet] - 10https://gerrit.wikimedia.org/r/267160 (https://phabricator.wikimedia.org/T109286) 
[10:41:30] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] eqiad: remove most mobile frontends from cache_mobile [puppet] - 10https://gerrit.wikimedia.org/r/267160 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack)
[10:42:49] <grrrit-wm>	 (03CR) 10BBlack: [C: 04-1] "Needs to wait for ok from analytics, probably Monday" [puppet] - 10https://gerrit.wikimedia.org/r/267230 (https://phabricator.wikimedia.org/T122651) (owner: 10BBlack)
[10:43:51] <grrrit-wm>	 (03PS4) 10BBlack: cache_parsoid: remove citoid+cxserver pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266741 (https://phabricator.wikimedia.org/T110476) 
[10:44:00] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] cache_parsoid: remove citoid+cxserver pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266741 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack)
[10:44:15] <grrrit-wm>	 (03PS4) 10BBlack: cache_parsoid: remove restbase pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266748 (https://phabricator.wikimedia.org/T110475) 
[10:44:25] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] cache_parsoid: remove restbase pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266748 (https://phabricator.wikimedia.org/T110475) (owner: 10BBlack)
[10:46:29] <wikibugs>	 6operations, 10RESTBase, 6Services, 10Traffic, 5Patch-For-Review: Remove restbase from parsoidcache - https://phabricator.wikimedia.org/T110475#1981161 (10BBlack) 5Open>3Resolved a:3BBlack
[10:46:33] <wikibugs>	 6operations, 6Services, 10Traffic: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#1981163 (10BBlack)
[10:46:37] <wikibugs>	 6operations, 6Services, 10Traffic: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#1578453 (10BBlack)
[10:47:16] <wikibugs>	 6operations, 5Patch-For-Review, 7Swift: swift upgrade plans - https://phabricator.wikimedia.org/T117972#1981172 (10fgiunchedi) FWIW swift 2.6 has been released 4 days ago, https://github.com/openstack/swift/blob/master/CHANGELOG
[10:50:57] <wikibugs>	 6operations: upgrade swift servers from precise to jessie - https://phabricator.wikimedia.org/T125024#1981174 (10fgiunchedi) note these might get upgraded to trusty first, see also related {T117972}
[10:54:31] <grrrit-wm>	 (03PS1) 10Jcrespo: Add mysql grants for racktables [puppet] - 10https://gerrit.wikimedia.org/r/267232 
[10:54:36] <wikibugs>	 6operations: Move bacula director and storage daemon off helium? - https://phabricator.wikimedia.org/T123723#1981176 (10akosiaris) 5Open>3declined a:3akosiaris Declining per IRC OK from @MoritzMuehlenhoff
[10:55:16] <grrrit-wm>	 (03Abandoned) 10Jcrespo: Racktables user placeholder [puppet] - 10https://gerrit.wikimedia.org/r/254288 (owner: 10Jcrespo)
[10:56:25] <grrrit-wm>	 (03PS2) 10Jcrespo: Add mysql grants for racktables [puppet] - 10https://gerrit.wikimedia.org/r/267232 
[10:57:39] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Add mysql grants for racktables [puppet] - 10https://gerrit.wikimedia.org/r/267232 (owner: 10Jcrespo)
[11:10:41] <grrrit-wm>	 (03PS1) 10BBlack: MW parsoid URLs: s/parsoidcache/parsoid/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267234 (https://phabricator.wikimedia.org/T110472) 
[11:12:27] <wikibugs>	 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1981202 (10Yurik) The 4x4 varnishes + 4x2 backends was initially suggested by @bblack as the minimal platform to serve all of Wikipedias.  I tried stress-testing maps from multiple labs instances...
[11:12:39] <Steinsplitter>	 robh about?
[11:13:08] <Steinsplitter>	 jynus?
[11:14:08] <Steinsplitter>	 it is urgent.
[11:14:19] <elukey>	 !log disabled puppet on analytics1027 due to issues with Camus and HDFS
[11:14:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:14:51] <Steinsplitter>	 BBlack 
[11:15:53] <jynus>	 Steinsplitter, yes?
[11:16:19] <_joe_>	 Steinsplitter: if something is urgent, I guess there is an UBN! ticket on phabricator, right?
[11:16:38] <Steinsplitter>	 _joe_ can you please check how much mass messeges are queued on commons?
[11:16:46] <Steinsplitter>	 and if there are RIP ones.
[11:17:46] <_joe_>	 I have no internal knowledge of that extension, it would take quite some time and I'm already working on an UBN! ticket right now
[11:18:05] <_joe_>	 I can take a look at the jobqueue, if that's what that extension uses
[11:18:13] <Steinsplitter>	 maybe jynu know how
[11:18:28] <Steinsplitter>	 yes, they are in the jobqueue
[11:18:49] <_joe_>	 the jobqueues are in good health atm
[11:18:56] <_joe_>	 https://grafana.wikimedia.org/dashboard/db/job-queue-health
[11:19:25] <_joe_>	 also https://grafana.wikimedia.org/dashboard/db/job-queue-rate
[11:19:32] <Steinsplitter>	 strange :/
[11:19:40] <Steinsplitter>	 will file a bug on phab
[11:19:50] <Steinsplitter>	 thx
[11:20:22] <_joe_>	 Steinsplitter: but this is by no means a measure of that extension working correctly
[11:22:13] <moritzm>	 !log rolling restart of swift in codfw
[11:22:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:26:11] <wikibugs>	 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1981220 (10BBlack) Here's a better list:  cxserver deploy: https://github.com/wikimedia/mediawiki-services-cxser...
[11:32:18] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi broken disk, https://phabricator.wikimedia.org/T125200
[11:35:11] <wikibugs>	 6operations, 5Patch-For-Review: move racktables to a VM - https://phabricator.wikimedia.org/T105555#1981235 (10jcrespo)
[11:35:13] <wikibugs>	 6operations, 10DBA: mysql privs: restrict access to racktables to krypton - https://phabricator.wikimedia.org/T118816#1981233 (10jcrespo) 5Open>3Resolved Done on https://gerrit.wikimedia.org/r/#/c/267232/
[11:38:40] <wikibugs>	 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Search-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1981236 (10akosiaris) >>! In T120281#1971185, @EBernhardson wrote: > Another option for analytics<->codfw that me and @S...
[11:40:13] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1228 is CRITICAL: Connection refused by host
[11:40:33] <icinga-wm>	 PROBLEM - Disk space on mw1228 is CRITICAL: Connection refused by host
[11:40:42] <icinga-wm>	 PROBLEM - DPKG on mw1228 is CRITICAL: Connection refused by host
[11:40:53] <icinga-wm>	 PROBLEM - salt-minion processes on mw1228 is CRITICAL: Connection refused by host
[11:41:12] <icinga-wm>	 PROBLEM - NTP on mw1228 is CRITICAL: NTP CRITICAL: No response from NTP server
[11:41:13] <icinga-wm>	 PROBLEM - configured eth on mw1228 is CRITICAL: Connection refused by host
[11:41:14] <icinga-wm>	 PROBLEM - dhclient process on mw1228 is CRITICAL: Connection refused by host
[11:41:24] <icinga-wm>	 PROBLEM - RAID on mw1228 is CRITICAL: Connection refused by host
[11:41:34] <icinga-wm>	 PROBLEM - nutcracker process on mw1228 is CRITICAL: Connection refused by host
[11:41:53] <icinga-wm>	 PROBLEM - nutcracker port on mw1228 is CRITICAL: Connection refused by host
[11:43:17] <_joe_>	 !log uploaded hhvm_3.6.5+dfsg1-1+wm8 to trusty-wikimedia
[11:43:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:49:06] <wikibugs>	 6operations, 6Collaboration-Team-Backlog, 10Flow: Flow messages are not editable and new topics can't be posted (API outage) - https://phabricator.wikimedia.org/T125080#1981245 (10mark) 5Open>3Resolved a:3mark We have no indication that anything is wrong other than some brief effects shortly after the...
[11:50:57] <grrrit-wm>	 (03PS1) 10KartikMistry: CX: Remove ContentTranslationCorpora setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267236 
[11:53:31] <wikibugs>	 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1981249 (10BBlack) I don't think I've ever made recommendations about the backend service, just the 4x4 cache/termination layer.  That part isn't really a "suggestion", it's an operational minimu...
[11:56:07] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: package_builder: Improve README.md networking part [puppet] - 10https://gerrit.wikimedia.org/r/267237 
[11:58:58] <wikibugs>	 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1981251 (10akosiaris) OK, that answers my question. Thanks!
[12:02:00] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: package_builder: Improve README.md networking part [puppet] - 10https://gerrit.wikimedia.org/r/267237 
[12:02:18] <wikibugs>	 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM, 5Patch-For-Review: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1981252 (10Joe) Patch applied and new package built.  The package was installed on labs and my test of fsockopen now shows that th...
[12:03:56] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: Improve README.md networking part [puppet] - 10https://gerrit.wikimedia.org/r/267237 (owner: 10Alexandros Kosiaris)
[12:18:23] <icinga-wm>	 PROBLEM - Apache HTTP on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:18:53] <icinga-wm>	 PROBLEM - HHVM rendering on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:20:02] <icinga-wm>	 RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.176 second response time
[12:21:35] <wikibugs>	 6operations, 10ops-codfw, 10ops-eqiad, 10ops-esams, and 2 others: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#1981274 (10BBlack) 3NEW
[12:22:23] <icinga-wm>	 RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 70567 bytes in 0.961 second response time
[12:22:37] <icinga-wm>	 PROBLEM - Corp OIT LDAP Mirror on pollux is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:23:11] <paravoid>	 hey
[12:23:12] <jynus>	 ldap down
[12:23:21] <jynus>	 ?
[12:23:25] <paravoid>	 anyone investigating it?
[12:23:31] <jynus>	 I will now
[12:23:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.523 second response time
[12:23:46] <paravoid>	 we're at lunch, but can open a laptop if needed
[12:24:12] <icinga-wm>	 PROBLEM - salt-minion processes on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:24:13] <jynus>	 I think the whole server is down, I can handle for now
[12:24:23] <icinga-wm>	 PROBLEM - RAID on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:24:33] <icinga-wm>	 PROBLEM - dhclient process on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[12:25:23] <icinga-wm>	 RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.046 second response time
[12:25:48] <jynus>	 oh, it is a virtual server
[12:26:03] <icinga-wm>	 RECOVERY - RAID on pollux is OK: OK: no RAID installed
[12:28:04] <paravoid>	 should I open my laptop?
[12:28:11] <jynus>	 yes, probably
[12:29:09] <paravoid>	 here now
[12:29:48] <jynus>	 I do not have the dns list downladed
[12:30:40] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Add base::firewall to jobrunners mw1161-mw1169 (reprovisioned app servers) [puppet] - 10https://gerrit.wikimedia.org/r/267238 
[12:30:57] <paravoid>	 !log force-rebooting pollux
[12:31:00] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:33:12] <icinga-wm>	 RECOVERY - salt-minion processes on pollux is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[12:33:28] <icinga-wm>	 RECOVERY - Corp OIT LDAP Mirror on pollux is OK: LDAP OK - 0.118 seconds response time
[12:33:29] <icinga-wm>	 RECOVERY - dhclient process on pollux is OK: PROCS OK: 0 processes with command name dhclient
[12:33:38] <paravoid>	 ok
[12:33:41] <paravoid>	 back to lunch now :)
[12:33:46] <paravoid>	 ttyl!
[12:34:15] <moritzm>	 paravoid: enjoy your lunch! pollux is one of the ganeti vms which doesn't have the aio workaround yet, it was probably that: https://etherpad.wikimedia.org/p/disk_aio_setting
[12:37:22] <paravoid>	 yeah I figured
[12:37:47] <paravoid>	 i know about the aio issue, I've spoken with the debian maint and qemu upstreams btw
[12:38:05] <paravoid>	 the recommendation was to upgrade to latest qemu
[12:42:19] <wikibugs>	 6operations, 10ops-codfw, 10ops-eqiad, 10ops-esams, and 2 others: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#1981316 (10BBlack) For the record, salt on all hosts (which says it hit 1210 machines) gives this list for machines with 4-digit+ kern.log alerts presently:  ``` {'a...
[13:16:45] <wikibugs>	 6operations, 6Services, 10Traffic, 5Patch-For-Review: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#1981369 (10BBlack) FWIW, in a 1 hour snapshot of all traffic to parsoidcache (regardless of internal vs external IPs), when varnish/pybal monitoring checks are excluded, we're left w...
[13:18:18] <grrrit-wm>	 (03PS1) 10Aude: Update WikidataBuildResources git source (github -> gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/267242 (https://phabricator.wikimedia.org/T111173) 
[13:21:01] <grrrit-wm>	 (03PS1) 10Mforns: Remove kafka1012 from EventLogging brokers array [puppet] - 10https://gerrit.wikimedia.org/r/267243 (https://phabricator.wikimedia.org/T125199) 
[13:21:39] <wikibugs>	 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1981389 (10Yurik) @bblack, thanks for the link.  The purpose of this task is exactly that - to have enough hardware for this service to gain a full production status.
[13:28:23] <icinga-wm>	 PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: puppet fail
[13:43:46] <_joe_>	 !log installing the new HHVM package to the canary appservers (main and api)
[13:43:48] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:51:44] <grrrit-wm>	 (03CR) 10Alex Monk: "no, T125209" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267189 (owner: 10Alex Monk)
[13:52:40] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 04-1] "T125209" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 (owner: 10Addshore)
[13:52:56] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 04-1] "T125209" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264734 (owner: 10Addshore)
[13:54:43] <icinga-wm>	 RECOVERY - puppet last run on db2002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[14:03:59] <_joe_>	 !log installing the new hhvm package on all the codfw appserver
[14:04:02] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:04:44] <grrrit-wm>	 (03CR) 10Luke081515: [C: 04-1] Configure default Echo subscriptions user options on he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246171 (https://phabricator.wikimedia.org/T114982) (owner: 10Dereckson)
[14:05:33] <yurik>	 bblack, is there anything i can help with for the mobile->desktop?
[14:13:45] <bblack>	 yurik: no, we're pretty close to done, but we need to wait on analytics, and they're busy with other issues
[14:14:22] <yurik>	 bblack, thanks for pushing it forward :)
[14:24:32] <wikibugs>	 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1981540 (10BBlack) Status update: We're pretty much done with the cache traffic migration, but there's still 1x eqiad mobile cache (cp1060) pooled with low weight to keep mobile...
[14:24:46] <grrrit-wm>	 (03PS1) 10Mark Bergsma: Add BGP MED support [debs/pybal] - 10https://gerrit.wikimedia.org/r/267251 
[14:24:46] <moritzm>	 !log rebooting bohrium for kernel update
[14:24:49] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:25:37] <grrrit-wm>	 (03CR) 10Mark Bergsma: [C: 032] Add BGP MED support [debs/pybal] - 10https://gerrit.wikimedia.org/r/267251 (owner: 10Mark Bergsma)
[14:26:01] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add support for float timeouts in socket streams [debs/hhvm] - 10https://gerrit.wikimedia.org/r/267228 (https://phabricator.wikimedia.org/T125084) (owner: 10Giuseppe Lavagetto)
[14:27:39] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032] Add IPv6 support to all monitors [debs/pybal] - 10https://gerrit.wikimedia.org/r/267008 (owner: 10Mark Bergsma)
[14:35:06] <wikibugs>	 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1981548 (10Ottomata) Ok great!  We’re having some issues with jobs right now due to some Kafka problems, and we’ll want to make sure everything is fine before we try to move on t...
[14:36:30] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: scap: re-add servers to mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/267252 (https://phabricator.wikimedia.org/T124642) 
[14:37:02] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] scap: re-add servers to mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/267252 (https://phabricator.wikimedia.org/T124642) (owner: 10Giuseppe Lavagetto)
[14:39:29] <elukey>	 !log stopped kafka (service) on kafka1012 (the host that caused the outage)
[14:39:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:41:46] <wikibugs>	 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1981554 (10elukey) Kafka stopped on the node, no more services actively running on it.
[14:42:30] <icinga-wm>	 PROBLEM - Kafka Broker Server on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties
[14:43:16] <elukey>	 ----^ sorry it's me, turning down icinga
[14:43:28] <_joe_>	 schedule downtime :)
[14:43:37] <jynus>	 well, kafka didn't caused the outage, mediawiki did :-)
[14:43:53] <_joe_>	 jynus: hhvm did
[14:43:57] <_joe_>	 not mediawiki
[14:43:57] <jynus>	 :-)
[14:45:03] <wikibugs>	 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1981558 (10Ottomata) @cmjohnson are you in the DC today?  Can we get this disk swapped asap? Thanks!
[14:45:10] <wikibugs>	 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1981560 (10Ottomata) p:5Triage>3High
[14:46:26] <icinga-wm>	 ACKNOWLEDGEMENT - Host ms-be2003 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff Host didnt come up after reboot, stuck in BIOS
[14:47:22] <jynus>	 _joe_, do you want me to take T124642 or do you want to do it yourself?
[14:47:42] <_joe_>	 jynus: I'm almost done
[14:47:51] <_joe_>	 see my change up there :)
[14:48:17] <_joe_>	 but if you want to practice pooling a server, be my guest :))
[14:48:51] <jynus>	 let me at least resync them for you
[14:49:04] <_joe_>	 I am almost done with that too
[14:49:12] <jynus>	 you are too fast
[14:49:30] <_joe_>	 we just need to repool them
[14:50:02] <_joe_>	 which means a) readding them to conftool-data
[14:50:41] <wikibugs>	 6operations, 10Traffic: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#1981574 (10ema)
[14:50:43] <wikibugs>	 6operations, 10Traffic: Forward-port Varnish 3 patches to Varnish 4 - https://phabricator.wikimedia.org/T124277#1981572 (10ema) 5Open>3Resolved Some of the patches have to been tackled in https://phabricator.wikimedia.org/T124281. Some other patches are not needed anymore. The remaining ones have been forw...
[14:52:39] <grrrit-wm>	 (03PS1) 10Giuseppe Lavagetto: conftool: re-pool recovered servers [puppet] - 10https://gerrit.wikimedia.org/r/267256 (https://phabricator.wikimedia.org/T124642) 
[14:53:25] <jynus>	 is that live now?
[14:53:36] <_joe_>	 what?
[14:53:43] <_joe_>	 the servers are still depooled
[14:53:45] <jynus>	 conftool
[14:53:53] <_joe_>	 yes, see ops@ emails
[14:53:53] <jynus>	 not the patch, the tool for equiad
[14:54:04] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 68.00% of data above the critical threshold [5000000.0]
[14:54:14] <icinga-wm>	 PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 69.57% of data above the critical threshold [5000000.0]
[14:54:51] <grrrit-wm>	 (03PS2) 10Giuseppe Lavagetto: conftool: re-pool recovered servers [puppet] - 10https://gerrit.wikimedia.org/r/267256 (https://phabricator.wikimedia.org/T124642) 
[14:55:24] <jynus>	 I can help, then correcting documentation
[14:56:05] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] conftool: re-pool recovered servers [puppet] - 10https://gerrit.wikimedia.org/r/267256 (https://phabricator.wikimedia.org/T124642) (owner: 10Giuseppe Lavagetto)
[14:56:26] <jynus>	 (I know you did the main thing, but thare are many other places)
[14:57:01] <_joe_>	 jynus: well, check the docs, it should be ok now, I updated it this morning :)
[14:57:22] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 65.22% of data above the critical threshold [10.0]
[14:57:52] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [10.0]
[14:57:53] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [10.0]
[14:57:53] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 59.09% of data above the critical threshold [10.0]
[14:58:31] <elukey>	 ---^ kafka is not happy about the loss of 1012 :)
[14:59:14] <_joe_>	 elukey: expected I guess?
[14:59:30] <ottomata>	 kafka is ok with it, just a little cranky
[15:00:17] <elukey>	 _joe_: yep sorry
[15:00:47] <elukey>	 ottomata: we could think about reducing the false positives
[15:01:12] <jynus>	 it is not the only place, there are links with high visibility such as https://wikitech.wikimedia.org/wiki/Depooling_servers
[15:01:20] <ottomata>	 false positives?
[15:01:22] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 77.27% of data above the critical threshold [10.0]
[15:01:22] <elukey>	 !log re-enabled puppet on analytics1027
[15:01:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:01:52] <elukey>	 these alarms are legitimate ones but we are basically dropping them
[15:02:05] <ottomata>	 ah, yeah if  we did depenencies in icinca somehow properly
[15:02:11] <ottomata>	 it could know not to alert about them
[15:02:13] <ottomata>	 but, dunno
[15:02:16] <ottomata>	 sounds messy :)
[15:02:22] <ottomata>	 we could just schedule downtime for those services
[15:02:43] <icinga-wm>	 ACKNOWLEDGEMENT - Host ms-be2007 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff The reboot for the kernel update triggered a reimage, needs to be sorted out
[15:02:55] <elukey>	 but then we'd hide errors on the remaining kafka brokers :(
[15:04:32] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[15:04:34] <ottomata>	 not the whole servers
[15:04:49] <ottomata>	 just things like under replicated partitions
[15:04:59] <ottomata>	 which we know are going to be there while 1012 is down
[15:05:12] <icinga-wm>	 RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0]
[15:05:14] <elukey>	 taking notes, I'll check it :)
[15:07:00] <wikibugs>	 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#1981605 (10Papaul) Note: This is one of the other server that is out of warranty. Once i get the drives from Chris I will replace the bad drive.   Thanks
[15:08:19] <akosiaris>	 !log powering off nas1001-b.eqiad.wmnet. https://phabricator.wikimedia.org/T124156
[15:08:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:11:30] <akosiaris>	 !log powering off nas1001-a.eqiad.wmnet. https://phabricator.wikimedia.org/T124156
[15:11:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:12:15] <wikibugs>	 6operations, 10ops-eqiad, 5Patch-For-Review: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#1981619 (10akosiaris)
[15:12:40] <wikibugs>	 6operations, 6Commons, 10MassMessage: Not all mass messages sent out. - https://phabricator.wikimedia.org/T125214#1981621 (10Steinsplitter) 3NEW
[15:13:02] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 206, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/2: down - nas1001-b {#2993} [10Gbps DF]BR
[15:13:27] <wikibugs>	 6operations, 10DBA: Prepare db1018 for s2 master failover - https://phabricator.wikimedia.org/T125215#1981628 (10jcrespo) 3NEW
[15:14:05] <wikibugs>	 6operations, 10DBA: Prepare db1018 for s2 master failover - https://phabricator.wikimedia.org/T125215#1981638 (10jcrespo)
[15:14:07] <wikibugs>	 6operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1981637 (10jcrespo)
[15:15:11] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 227, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/2: down - nas1001-a {#2994} [10Gbps DF]BR
[15:18:24] <grrrit-wm>	 (03PS1) 10Muehlenhoff: Enable base::firewall on alsafi [puppet] - 10https://gerrit.wikimedia.org/r/267260 
[15:18:54] <wikibugs>	 6operations, 10DBA: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1981644 (10jcrespo)
[15:19:16] <grrrit-wm>	 (03PS1) 10Jcrespo: Depool db1018 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267261 (https://phabricator.wikimedia.org/T125215) 
[15:20:28] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Depool db1018 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267261 (https://phabricator.wikimedia.org/T125215) (owner: 10Jcrespo)
[15:21:04] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: reprepro: add HP's MCP repository to updates [puppet] - 10https://gerrit.wikimedia.org/r/267262 (https://phabricator.wikimedia.org/T97998) 
[15:21:13] <bblack>	 !log upgrading packages (incl kernel) on esams cache hosts (cp3xxx) (codfw, ulsfo already done)
[15:21:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:22:51] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1217 is OK: OK
[15:23:04] <wikibugs>	 6operations, 7Monitoring, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#1981658 (10faidon) >>! In T97998#1979159, @jcrespo wrote: > AFAIK, hpcacucli is non-free. This is the basic, free, debian-included option...
[15:23:42] <wikibugs>	 6operations, 10ops-eqiad, 5Patch-For-Review: mw1172, mw1178,mw1217,  mw1257 are unresponsive, mgmt interface unreachable - https://phabricator.wikimedia.org/T124642#1981659 (10Joe) 5Open>3Resolved
[15:25:19] <grrrit-wm>	 (03CR) 10Tim Landscheidt: [C: 04-1] "I missed If56f5be90411db7895e8dbd34b8cadea95ff510b, so the rationale in the commit message is not true." [puppet] - 10https://gerrit.wikimedia.org/r/267039 (https://phabricator.wikimedia.org/T123271) (owner: 10Tim Landscheidt)
[15:31:02] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1172 is OK: OK
[15:32:20] <akosiaris>	 !log remove all networking configuration from asw-b-eqiad switch for nas1001-a, nas1001-b. Leave just descriptions
[15:32:23] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:33:21] <grrrit-wm>	 (03CR) 10Eevans: [C: 031] [production]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) (owner: 10Eevans)
[15:36:13] <icinga-wm>	 PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: Puppet has 3 failures
[15:37:24] <wikibugs>	 6operations, 10ops-eqiad, 5Patch-For-Review: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#1981686 (10akosiaris)
[15:37:41] <logmsgbot>	 !log jynus@mira Synchronized wmf-config/db-eqiad.php: Depool db1018 for maintenance (duration: 01m 49s)
[15:37:44] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:38:37] <wikibugs>	 6operations, 10ops-eqiad, 5Patch-For-Review: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#1947460 (10akosiaris) I' ve powered off nas1001-a and nas1001-b.   I 've also removed most configuration for nas1001-a, nas1001-b from asw-b-eqiad. I 've left only t...
[15:39:49] <cmjohnson1>	 papaul: i don't have that size disk....i thought you needed 600GB SAS not 2TB....i need one for kafka1012 as well. Gonna have to order it (robh)
[15:39:51] <icinga-wm>	 RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[15:41:50] <wikibugs>	 6operations, 6Labs: evaluate possibility for nscd use with useldap - https://phabricator.wikimedia.org/T124991#1981692 (10mark)
[15:42:02] <papaul>	  cmjohnson1ok
[15:42:05] <wikibugs>	 6operations, 6Labs: evaluate possibility for nscd use with useldap - https://phabricator.wikimedia.org/T124991#1971715 (10mark)
[15:42:29] <papaul>	 cmjohnson1: will open a task for that thanks
[15:45:01] <bblack>	 !log upgrade packages (incl kernel) on eqiad caches hosts (cp1xxx)
[15:45:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:47:12] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1257 is OK: OK
[15:51:09] <subbu>	 robh, could you restart parsoid-vd and parsoid-vd-client on ruthenium? i'll then ask you for access to logs after a bit to see what is going on with them. thanks.
[15:53:21] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1178 is OK: OK
[16:09:50] <grrrit-wm>	 (03PS2) 10Mforns: Remove kafka1012 from EventLogging brokers array [puppet] - 10https://gerrit.wikimedia.org/r/267243 (https://phabricator.wikimedia.org/T125199) 
[16:10:39] <grrrit-wm>	 (03PS3) 10Ottomata: Remove kafka1012 from EventLogging brokers array [puppet] - 10https://gerrit.wikimedia.org/r/267243 (https://phabricator.wikimedia.org/T125199) (owner: 10Mforns)
[16:11:58] <grrrit-wm>	 (03PS2) 10Subramanya Sastry: parsoid-vd-client & diffservice: Use uprightdiff for diffing images [puppet] - 10https://gerrit.wikimedia.org/r/267190 
[16:12:00] <grrrit-wm>	 (03PS1) 10Subramanya Sastry: T110474: Point iegreview to internal parsoid url [puppet] - 10https://gerrit.wikimedia.org/r/267269 
[16:12:02] <grrrit-wm>	 (03PS1) 10Subramanya Sastry: T110474: Point restbase to internal parsoid url [puppet] - 10https://gerrit.wikimedia.org/r/267270 
[16:14:34] <wikibugs>	 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1981719 (10ssastry) >>! In T110474#1981220, @BBlack wrote: > integration-visualdiff: > https://github.com/wikime...
[16:15:36] <grrrit-wm>	 (03CR) 10GWicke: [C: 031] "This is overridden in hiera to use parsoid.svc & a specific instance in labs, so this change should not affect any running instances." [puppet] - 10https://gerrit.wikimedia.org/r/267270 (owner: 10Subramanya Sastry)
[16:15:38] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Remove kafka1012 from EventLogging brokers array [puppet] - 10https://gerrit.wikimedia.org/r/267243 (https://phabricator.wikimedia.org/T125199) (owner: 10Mforns)
[16:17:07] <subbu>	 robh, sorry i now see you are away .. i was just operating off the clinic duty topic line in my irc client. :)
[16:18:34] <subbu>	 _joe_, could you restart parsoid-vd and parsoid-vd-client services on ruthenium?
[16:19:05] <subbu>	 or whichever root is around.
[16:19:29] <grrrit-wm>	 (03PS2) 10BBlack: MW parsoid URLs: s/parsoidcache/parsoid/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267234 (https://phabricator.wikimedia.org/T110472) 
[16:21:35] <grrrit-wm>	 (03PS1) 10Ottomata: Unpuppetize impala in Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/267271 (https://phabricator.wikimedia.org/T125141) 
[16:27:16] <grrrit-wm>	 (03CR) 10GWicke: [C: 031] MW parsoid URLs: s/parsoidcache/parsoid/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267234 (https://phabricator.wikimedia.org/T110472) (owner: 10BBlack)
[16:33:28] <ottomata>	 !log uinstalling impala in analytics cluster
[16:33:31] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:35:26] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1981816 (10greg) @legoktm: Update, please?
[16:38:52] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Unpuppetize impala in Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/267271 (https://phabricator.wikimedia.org/T125141) (owner: 10Ottomata)
[16:46:35] <robh>	 subbu: did anyone help ya out?
[16:46:47] <subbu>	 not yet. :)
[16:47:05] <robh>	 will do now
[16:47:19] <subbu>	 thanks.
[16:47:35] <wikibugs>	 6operations, 10ops-codfw: Codfw: ms-be2003 2TB order request  - https://phabricator.wikimedia.org/T125223#1981842 (10Papaul) 3NEW a:3RobH
[16:47:48] <robh>	 !log restarting parsoid-vd & parsoid-vd-client on ruthenium
[16:47:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:48:23] <robh>	 subbu: so want the log output in like 5 minutes from now?
[16:48:28] <robh>	 I can just set a timer to remind me
[16:48:32] <subbu>	 yes,that would be great. thanks.
[16:48:36] <robh>	 will do
[16:49:21] <robh>	 i'll toss in /tmp with you as owner similar to last time
[16:49:44] <robh>	 I should have been more detailed on why i did that restart in sal, heh
[16:50:03] <robh>	 !log parsoid-vd restart was due to subbu irc request (i wasnt just randomly restarting things ;)
[16:50:07] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:50:48] <subbu>	 k
[16:51:30] <wikibugs>	 6operations, 10Traffic: varnishkafka integration with Varnish 4  for analytics - https://phabricator.wikimedia.org/T124278#1981852 (10Ottomata) Just so it doesn't get lost in this process: https://gerrit.wikimedia.org/r/#/c/230173/  I still want to merge that and use it one day... :)
[16:52:32] <wikibugs>	 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1981854 (10RobH) @ssastry:  Since you already have shell, L3, etc... there are two things needed for this:  1.) Your managers approval for this access request expansion. 2.) Ops meeting review and approva...
[16:53:58] <wikibugs>	 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1981856 (10RobH) I take that back, I actually don't see your signature on the L3 document either?  (it is new and your access predates it, so that is not unusual.)  Would you additionally review and sign...
[16:55:51] <robh>	 subbu: logs are in tmp for ya
[16:56:19] <subbu>	 thanks.
[16:56:40] <robh>	 quite welcome, hopfully we get it all approved on monday so you can do without waiting on us =]
[16:56:56] <subbu>	 yes. indeed.
[16:57:03] <robh>	 a chunk of ops is at a conference, hence the low response rate 
[16:57:24] <subbu>	 looks like i have a config problem for the client (probably some path issue in my puppet code) ... time to fix it.
[16:58:48] * robh will be here all PDT AM so will be around for relevant puppet merges and service restarts
[16:59:01] <robh>	 I was supposed to go to ulsfo but they have not sent me the completion notice for the xconnect yet...
[17:00:19] <bd808>	 greg-g: anomie, tgr, and I were wondering if we could deploy a few sessionmanager related backports today.
[17:00:33] <greg-g>	 bd808: as I take a deep breath, yes
[17:00:49] <bd808>	 We have https://gerrit.wikimedia.org/r/#/q/status:open+topic:sessionmanager-backports,n,z and https://gerrit.wikimedia.org/r/#/c/267134/ right now
[17:01:35] <grrrit-wm>	 (03PS1) 10Ottomata: Respect $enabled param on kafka::server [puppet/kafka] - 10https://gerrit.wikimedia.org/r/267278 
[17:01:50] <jynus>	 !log restarting mysql at db1018
[17:01:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:03:51] <bd808>	 anomie: should we get started? I can run the deploys if you can help test them
[17:04:34] <anomie>	 bd808: ok
[17:05:02] <bd808>	 anomie: is there any ordering that is better or worse?
[17:05:29] <anomie>	 bd808: I don't think there are any cross-patch dependencies in the ones I did.
[17:06:04] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Respect $enabled param on kafka::server [puppet/kafka] - 10https://gerrit.wikimedia.org/r/267278 (owner: 10Ottomata)
[17:06:09] <bd808>	 yeah they look to be independent. Ok
[17:06:33] <grrrit-wm>	 (03PS1) 10Ottomata: Update kafka submodule with $enabled param fix [puppet] - 10https://gerrit.wikimedia.org/r/267279 
[17:07:19] <grrrit-wm>	 (03PS2) 10Ottomata: Update kafka submodule with $enabled param fix [puppet] - 10https://gerrit.wikimedia.org/r/267279 
[17:07:37] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Update kafka submodule with $enabled param fix [puppet] - 10https://gerrit.wikimedia.org/r/267279 (owner: 10Ottomata)
[17:13:07] <grrrit-wm>	 (03PS1) 10Bmansurov: Stop the first survey in fawiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267282 (https://phabricator.wikimedia.org/T123770) 
[17:14:42] <bmansurov>	 greg-g: Hello. Could you please help SWAT https://gerrit.wikimedia.org/r/#/c/267282/ ?
[17:15:17] <grrrit-wm>	 (03PS1) 10Subramanya Sastry: parsoid-vd-client on ruthenium: Fix path to config file [puppet] - 10https://gerrit.wikimedia.org/r/267283 
[17:16:05] <bd808>	 bmansurov: If greg-g oks it I can push the change out
[17:16:09] <grrrit-wm>	 (03CR) 10Subramanya Sastry: "This should fix parsoid-vd-client errors seen on ruthenium." [puppet] - 10https://gerrit.wikimedia.org/r/267283 (owner: 10Subramanya Sastry)
[17:16:37] <bd808>	 Fridays aren't typically deploy days though. We should remember to point that out to folks who run quicksurveys
[17:16:54] <bmansurov>	 bd808: thanks, leila says in the task greg-g ok'ed it
[17:17:38] <bd808>	 bmansurov: I have a few changes queued up in front of you, so it will be a little while.
[17:17:51] <bmansurov>	 sure, i'm here
[17:17:54] <bd808>	 jenkins is taking his sweet time this morning
[17:18:06] <bmansurov>	 bd808: oh dear
[17:20:29] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] T110474: Point iegreview to internal parsoid url [puppet] - 10https://gerrit.wikimedia.org/r/267269 (owner: 10Subramanya Sastry)
[17:21:00] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] T110474: Point restbase to internal parsoid url [puppet] - 10https://gerrit.wikimedia.org/r/267270 (owner: 10Subramanya Sastry)
[17:21:10] <wikibugs>	 6operations, 10ops-codfw: Codfw-mw* IDRAC firmware upgrade - https://phabricator.wikimedia.org/T125088#1981944 (10RobH) @Papaul: While you do this, would you also document the process on the platform specific documentation pages on wikitech?  Thanks!
[17:23:40] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 031] T110474: Point iegreview to internal parsoid url [puppet] - 10https://gerrit.wikimedia.org/r/267269 (owner: 10Subramanya Sastry)
[17:24:35] <greg-g>	 bd808: I approved the turning off of the surveys in -staff to leila
[17:24:46] <bd808>	 greg-g: thanks
[17:25:00] <apergos>	 is audio seeming a bit crappy to anyone else via blujeans?
[17:25:13] <apergos>	 oh it just got better
[17:26:53] * bd808 glares at "Build has been executing for 18 min...."
[17:27:18] <apergos>	 sorry wrong channel
[17:28:16] <bmansurov>	 bd808: that's the new incarnation of the 'compiling' xkcd comic I guess?
[17:28:39] <bmansurov>	 https://xkcd.com/303/ ;)
[17:28:51] <apergos>	 rebuild all your docker base images and the containers direved from them
[17:29:01] <apergos>	 that's my equivalent...
[17:29:02] <bd808>	 *shudder*
[17:34:34] <bd808>	 anomie: finally ready to start syncing things
[17:34:39] <anomie>	 bd808: ok
[17:35:20] <bd808>	 syncs will take ~2m each because of https://phabricator.wikimedia.org/T125108
[17:35:40] <logmsgbot>	 !log bd808@mira Synchronized php-1.27.0-wmf.11/includes/session/SessionBackend.php: SessionManager: Save user name to metadata even if the user doesn't exist locally (a39b4ac) (duration: 01m 29s)
[17:35:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:36:09] <bd808>	 anomie: ^ do we have a reproduction case for that one?
[17:36:54] <anomie>	 bd808: Not separately. The three of mine combined should fix the auto-creation not auto-creating on loginwiki bug.
[17:37:30] <bd808>	 ok. So I should just power through and we can check at the end then I guess
[17:38:12] <grrrit-wm>	 (03PS2) 10BryanDavis: Grant autocreateaccount to anons on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267134 (https://phabricator.wikimedia.org/T125133) (owner: 10Anomie)
[17:38:19] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Grant autocreateaccount to anons on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267134 (https://phabricator.wikimedia.org/T125133) (owner: 10Anomie)
[17:38:55] <grrrit-wm>	 (03Merged) 10jenkins-bot: Grant autocreateaccount to anons on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267134 (https://phabricator.wikimedia.org/T125133) (owner: 10Anomie)
[17:39:25] <logmsgbot>	 !log bd808@mira Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/session/CentralAuthSessionProvider.php: CentralAuth: Take auto-creation into account (f526ef1) (duration: 01m 28s)
[17:39:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:39:58] <anomie>	 (the two code changes are needed to get CA to try triggering the auto-creation, and the config change is needed to have loginwiki allow it to happen)
[17:41:56] <logmsgbot>	 !log bd808@mira Synchronized wmf-config/CommonSettings.php: Grant autocreateaccount to anons on loginwiki (d916008) (duration: 01m 27s)
[17:41:58] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:42:14] <bd808>	 anomie: that's the 3 important ones
[17:42:18] <anomie>	 bd808: Worked!
[17:42:23] <bd808>	 w00t!
[17:42:29] * anomie sees "Anomie test 8" got created on loginwiki
[17:44:44] <logmsgbot>	 !log bd808@mira Synchronized php-1.27.0-wmf.11/includes/api/ApiMain.php: Log user-agents that are using HTTP when HTTPS is preferred (55ac0b7) (duration: 01m 26s)
[17:44:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:45:12] <bd808>	 anomie: can you trigger that one? ^
[17:45:17] <anomie>	 Sure, just a minute
[17:45:34] <anomie>	 bd808: Done. Got the warning.
[17:46:01] <anomie>	 bd808: And I see it going into logstash too
[17:46:07] <bd808>	 yup
[17:46:15] <bd808>	 52 already :(
[17:47:32] <grrrit-wm>	 (03PS2) 10BryanDavis: Stop the first survey in fawiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267282 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov)
[17:47:49] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Stop the first survey in fawiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267282 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov)
[17:47:58] <bd808>	 bmansurov: you are up next
[17:48:17] <grrrit-wm>	 (03Merged) 10jenkins-bot: Stop the first survey in fawiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267282 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov)
[17:48:55] <greg-g>	 I see a ":(", should I be worried or is it just not a fix to something that we're ok not fixing right now, bd808 anomie 
[17:49:16] <anomie>	 greg-g: It's sadness at how many bots are still hitting http:// instead of https://
[17:49:17] <bmansurov>	 i'm here 
[17:49:22] <bd808>	 greg-g: nothign bad. we just have logging of misbehaving bots now
[17:49:24] <anomie>	 (we just added logging to log that)
[17:50:12] * anomie decides to send an announcement to mediawiki-api-announce and wikitech-l
[17:51:33] <logmsgbot>	 !log bd808@mira Synchronized wmf-config/InitialiseSettings.php: Stop the first survey in fawiki and eswiki (f89621d) (duration: 01m 25s)
[17:51:36] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:51:37] <bd808>	 bmansurov: ^
[17:52:03] <bmansurov>	 bd808: thanks, surveys are gone as expected 
[17:52:10] <bd808>	 sweet
[17:52:15] <subbu>	 ori, good morning. when you get a chance, can you look at https://gerrit.wikimedia.org/r/#/c/267283/ and https://gerrit.wikimedia.org/r/#/c/267190/ ?
[17:52:56] <grrrit-wm>	 (03PS2) 10Ori.livneh: parsoid-vd-client on ruthenium: Fix path to config file [puppet] - 10https://gerrit.wikimedia.org/r/267283 (owner: 10Subramanya Sastry)
[17:53:02] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] parsoid-vd-client on ruthenium: Fix path to config file [puppet] - 10https://gerrit.wikimedia.org/r/267283 (owner: 10Subramanya Sastry)
[17:53:07] <greg-g>	 bd808: anomie whew, thanks :)
[17:53:40] <bd808>	 greg-g: I'm all done now and I documented what we did on [[Deployments]]
[17:53:43] <grrrit-wm>	 (03PS3) 10Ori.livneh: parsoid-vd-client & diffservice: Use uprightdiff for diffing images [puppet] - 10https://gerrit.wikimedia.org/r/267190 (owner: 10Subramanya Sastry)
[17:53:50] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] parsoid-vd-client & diffservice: Use uprightdiff for diffing images [puppet] - 10https://gerrit.wikimedia.org/r/267190 (owner: 10Subramanya Sastry)
[17:54:24] <greg-g>	 bd808: thank you
[17:56:13] <ori>	 subbu: merged, ran puppet on ruthenium, looks good
[17:56:20] <subbu>	 ori, great. thanks.
[17:58:19] <apergos>	 is there an ops session talk this week or not? I don't remember that one was set but better safe than sorry
[18:01:34] <jynus>	 !log creating special partitioning for db2034 and db2042 (ETA:5 days, lag)
[18:01:37] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:09:16] <subbu>	 bblack, i was about to send an email to wikitech-l and other lists about the parsoid-lb decommissioning. Should we pick a date for it that I can announce? or should I just say "soon, once we finish migrating all known services away from it"?
[18:10:39] <bblack>	 subbu: it's not terribly time-critical, I'd say announce that we plan to decom it it 3 weeks from now, and offer them pointers to switching to using the in-wiki-domain RB URLs.
[18:10:56] <bblack>	 maybe even 2 weeks.  there's really not much traffic, I can't image there will be much objection
[18:11:42] <jynus>	 !log creating special partitioning for db2037 and db2044 (ETA:5 days, lag)
[18:11:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:12:59] <subbu>	 bblack, done. i mentioned 3 weeks.
[18:13:28] <subbu>	 robh, can i get another dump of the parsoid-vd-client logs? thanks.
[18:13:36] <robh>	 yep
[18:13:44] <bblack>	 subbu: thanks!
[18:14:19] <wikibugs>	 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1982145 (10ssastry)
[18:14:42] <robh>	 subbu: done
[18:14:59] <subbu>	 thanks.
[18:15:24] <robh>	 welcome
[18:24:09] <grrrit-wm>	 (03CR) 10GWicke: [C: 031] [production]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) (owner: 10Eevans)
[18:24:14] <nuria>	 Does anyone here know how transparecy.wikimedia.org gets deployed?
[18:26:07] <bblack>	 nuria: puppet controls the software deployment
[18:26:16] <bblack>	 and in turn, the puppetization indicates the content comes from a git repo
[18:26:26] <bblack>	     $repo_dir = '/srv/org/wikimedia/TransparencyReport'
[18:26:26] <bblack>	     $docroot  = "${repo_dir}/build"
[18:26:26] <bblack>	     git::clone { 'wikimedia/TransparencyReport':
[18:26:26] <bblack>	         ensure    => latest,
[18:26:26] <bblack>	         directory => $repo_dir,
[18:26:28] <bblack>	     }
[18:26:54] <mutante>	 pretty sure deployment is ssh to the server 
[18:26:56] <mutante>	 and git pull
[18:27:09] <bblack>	 ensure => latest doesn't pull for you on automated puppet runs?
[18:27:13] <mutante>	 since that git::clone up there will do the initial setup but not automatically pull
[18:27:18] <bblack>	 ok
[18:27:19] <mutante>	 oh, right
[18:27:40] <mutante>	 i take it back, ensure => latest should do it
[18:27:45] <nuria>	 bblack: i see, so puppet is updating to latest then
[18:27:59] <bblack>	 yeah and that will run every half hour or so
[18:28:05] <nuria>	 bblack: ok, thank you , will add piwik there too
[18:28:32] <wikibugs>	 6operations, 10ops-codfw, 10hardware-requests: Codfw: ms-be2003 2TB order request - https://phabricator.wikimedia.org/T125223#1982179 (10RobH) @papaul: Please use one of the spare 2TB SATA dissk on the codfw spares hardware listing sheet.  These spares are specifically for the ms-be systems that are out of w...
[18:36:12] <wikibugs>	 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#1982247 (10RobH) Please note that there are 10 spares on CODFW spare tracking to replace disks in out of warranty spares:  HDD - SATA Seagate ST2000DM001 7.2K 2TB   10  10  These shouldn't fall be...
[18:36:24] <wikibugs>	 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#1982251 (10RobH) a:3Papaul
[18:36:52] <wikibugs>	 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#1981123 (10RobH)
[18:36:55] <wikibugs>	 6operations, 10ops-codfw, 10hardware-requests: Codfw: ms-be2003 2TB order request - https://phabricator.wikimedia.org/T125223#1982252 (10RobH) 5Open>3Resolved I didn't want to move this into that private space, as anyone who cannot view the space is then stuck getting alerts and unable to unsubscribe.  I...
[18:36:56] <grrrit-wm>	 (03PS1) 10Elukey: Termporary disable puppet on kafka1012 for maintenance purposes [puppet] - 10https://gerrit.wikimedia.org/r/267293 
[18:37:58] <grrrit-wm>	 (03PS1) 10Jcrespo: Install Jessie on db1018 [puppet] - 10https://gerrit.wikimedia.org/r/267294 
[18:38:00] <grrrit-wm>	 (03Abandoned) 10Elukey: Termporary disable puppet on kafka1012 for maintenance purposes [puppet] - 10https://gerrit.wikimedia.org/r/267293 (owner: 10Elukey)
[18:38:43] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Install Jessie on db1018 [puppet] - 10https://gerrit.wikimedia.org/r/267294 (owner: 10Jcrespo)
[18:38:53] <grrrit-wm>	 (03PS1) 10Elukey: Temporary disable kafka1012 for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/267295 
[18:40:12] <grrrit-wm>	 (03PS2) 10Jcrespo: Install Jessie on db1018 [puppet] - 10https://gerrit.wikimedia.org/r/267294 (https://phabricator.wikimedia.org/T125215) 
[18:53:18] <grrrit-wm>	 (03PS2) 10Ottomata: Temporary disable kafka1012 for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/267295 (owner: 10Elukey)
[18:54:08] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Temporary disable kafka1012 for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/267295 (owner: 10Elukey)
[19:00:54] <grrrit-wm>	 (03PS1) 10Dzahn: releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) 
[19:01:27] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) (owner: 10Dzahn)
[19:01:44] <grrrit-wm>	 (03PS2) 10Dzahn: releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) 
[19:04:16] <grrrit-wm>	 (03PS3) 10Dzahn: releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) 
[19:04:32] <grrrit-wm>	 (03PS4) 10Dzahn: releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) 
[19:04:37] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) (owner: 10Dzahn)
[19:05:59] <wikibugs>	 6operations: rhodium.eqiad.wmnet status? - https://phabricator.wikimedia.org/T125056#1982360 (10Dzahn) a:3Dzahn
[19:09:50] <wikibugs>	 6operations: rhodium.eqiad.wmnet status? - https://phabricator.wikimedia.org/T125056#1982415 (10Dzahn) looks like an OS got installed but the service  has not been implemented yet, and it's  ----> T98173
[19:10:23] <wikibugs>	 6operations, 5Patch-For-Review: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261185 (10Dzahn) T125056 asks for the status of this server
[19:13:41] <wikibugs>	 6operations: rhodium.eqiad.wmnet status? - https://phabricator.wikimedia.org/T125056#1982455 (10Dzahn) @ArielGlenn does that answer the status question sufficiently?  i'd close or merge it with T98173
[19:14:16] <wikibugs>	 6operations: rhodium.eqiad.wmnet status? - https://phabricator.wikimedia.org/T125056#1982462 (10Dzahn)
[19:14:27] <wikibugs>	 6operations: rhodium.eqiad.wmnet status? - https://phabricator.wikimedia.org/T125056#1982465 (10Dzahn) a:5Dzahn>3ArielGlenn
[19:36:07] <jynus>	 !log reinstall db1018
[19:36:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:53:25] <grrrit-wm>	 (03PS1) 10Subramanya Sastry: parsoid-vd-client: Fill out missing pieces of the config file [puppet] - 10https://gerrit.wikimedia.org/r/267311 
[19:56:33] <grrrit-wm>	 (03CR) 10Subramanya Sastry: "Fixes based on error logs on ruthenium." [puppet] - 10https://gerrit.wikimedia.org/r/267311 (owner: 10Subramanya Sastry)
[20:05:48] <grrrit-wm>	 (03PS5) 10Dzahn: releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) 
[20:07:50] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) (owner: 10Dzahn)
[20:10:11] <wikibugs>	 6operations, 5Patch-For-Review: make the releases.wm.org index page look nicer - https://phabricator.wikimedia.org/T125164#1982623 (10Dzahn) >>! In T125164#1981067, @Legoktm wrote: > Can we also do something similar for the `/mediawiki` page?  Yes. done  {F3291357}
[20:10:47] <mutante>	 legoktm: https://releases.wikimedia.org/mediawiki/
[20:10:59] <mutante>	 latest on top, not "Index of" 
[20:11:02] <grrrit-wm>	 (03CR) 10Aaron Schulz: [C: 032] Use the logical redis definition for GettingStarted. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266481 (https://phabricator.wikimedia.org/T124671) (owner: 10Giuseppe Lavagetto)
[20:11:31] <mutante>	 except that "snapshot" is from 2009 , heh
[20:12:20] <grrrit-wm>	 (03Merged) 10jenkins-bot: Use the logical redis definition for GettingStarted. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266481 (https://phabricator.wikimedia.org/T124671) (owner: 10Giuseppe Lavagetto)
[20:16:24] <logmsgbot>	 !log aaron@mira Synchronized wmf-config/CommonSettings.php: Use the logical redis definition for GettingStarted (duration: 01m 26s)
[20:16:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[20:18:37] <grrrit-wm>	 (03PS2) 10Dzahn: Enable base::firewall on alsafi [puppet] - 10https://gerrit.wikimedia.org/r/267260 (owner: 10Muehlenhoff)
[20:20:42] <grrrit-wm>	 (03PS1) 10Jcrespo: Revert "Install Jessie on db1018" [puppet] - 10https://gerrit.wikimedia.org/r/267316 
[20:20:51] <grrrit-wm>	 (03PS2) 10Jcrespo: Revert "Install Jessie on db1018" [puppet] - 10https://gerrit.wikimedia.org/r/267316 
[20:22:11] <mutante>	 jynus: :( failed entirely on that hardware?
[20:22:49] <jynus>	 it goes into a crazy loop when the disk fails to mount, but it also doesn't let me mount it manually
[20:23:07] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Revert "Install Jessie on db1018" [puppet] - 10https://gerrit.wikimedia.org/r/267316 (owner: 10Jcrespo)
[20:23:09] <mutante>	 sigh, are they HPs?
[20:23:38] <jynus>	 no
[20:24:42] <mutante>	 hmm, i wonder.. trying trusty then?
[20:25:12] <jynus>	 yes, at least try, and I cannot leave replication paused for the whole weekend
[20:25:20] <mutante>	 gotcha
[20:27:00] <jynus>	 this is new, either the installer has been updated, or there is something wrong with the disk
[20:27:29] <jynus>	 and with the installer i mean jessie, not our recipe
[20:28:40] <grrrit-wm>	 (03PS1) 10Papaul: Decom: Remove caesium from dhcpd Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267320 (https://phabricator.wikimedia.org/T125165) 
[20:30:07] <jynus>	 tusty "just works"
[20:30:24] <jynus>	 it is an upstream change on jessie's installer
[20:31:23] <mutante>	 wow, interesting
[20:31:33] <mutante>	 i was lucky so far with the hardware
[20:31:45] <mutante>	 well, or just replaced stuff with virtual ones
[20:31:55] <jynus>	 this is new, like a few weeks new, an maybe RAID-specific
[20:32:03] <mutante>	 aha
[20:32:46] <grrrit-wm>	 (03PS3) 10Dzahn: Enable base::firewall on alsafi [puppet] - 10https://gerrit.wikimedia.org/r/267260 (owner: 10Muehlenhoff)
[20:32:59] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Enable base::firewall on alsafi [puppet] - 10https://gerrit.wikimedia.org/r/267260 (owner: 10Muehlenhoff)
[20:33:22] <jynus>	 but wasting 1 hour installing trusty is no fun
[20:34:11] <mutante>	 yea :/
[20:35:06] <mutante>	 Reedy: https://releases.wikimedia.org/mediawiki/   should we even have that "snapshot" dir up anymore? check the date 
[20:35:24] <mutante>	 it's more noticable now because of the "version sort" 
[20:35:48] <mutante>	 so the latest stuff should be on top
[20:36:11] <mutante>	 i would assume a snapshot is like from last night
[20:40:34] <grrrit-wm>	 (03CR) 10Addshore: [C: 031] Update WikidataBuildResources git source (github -> gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/267242 (https://phabricator.wikimedia.org/T111173) (owner: 10Aude)
[20:43:44] <jynus>	 I am thinking of trying now jessie, or if I will waist another hour installing jessie and then tusty again
[20:44:03] <jynus>	 as it was a partition-related issue
[20:44:50] <jynus>	 I will check jessie-installer bugs first
[20:45:02] <mutante>	 jynus: yea, that might save time in the long run though
[20:45:45] <grrrit-wm>	 (03PS1) 10Papaul: Decom: Remove caesium from authoinstall Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267321 (https://phabricator.wikimedia.org/T125165) 
[20:46:17] <jynus>	 I suppose you were referring before to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=788156
[20:47:11] <jynus>	 (but this is not it)
[20:47:23] <mutante>	 i did not have that specific bug number in mind, but i had some vague memories about issues with HP hardware we had that did not happen on Dell
[20:48:29] <wikibugs>	 6operations, 5Patch-For-Review: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1982693 (10ArielGlenn) Should its current salt key be kept around or can I toss it? If I can toss it then that's good enough for me.
[20:49:14] <wikibugs>	 6operations: rhodium.eqiad.wmnet status? - https://phabricator.wikimedia.org/T125056#1982694 (10ArielGlenn) Well I knew about that but then it's stalled, and the think is that it has a salt key.  Anyways I asked on the other ticket, likely I'll be able to close this soon.
[20:50:47] <mutante>	 apergos: i dont actually see that salt key on neodymium
[20:50:57] <apergos>	 it was... I think...
[20:51:11] * apergos goes to look at their notes
[20:51:17] <mutante>	 my guess was it probably was "rhodium.wikimedia.org"
[20:51:21] <mutante>	 vs. eqiad.wmnet
[20:51:32] <mutante>	 i saw a change in gerrit that changed that name 
[20:51:36] <grrrit-wm>	 (03PS1) 10Jcrespo: Revert "Revert "Install Jessie on db1018"" [puppet] - 10https://gerrit.wikimedia.org/r/267322 
[20:51:38] <mutante>	 "wrong FQDN" etc
[20:51:40] <apergos>	 it might have been one of the hosts with a puppet cert and no salt key
[20:51:41] <grrrit-wm>	 (03PS2) 10Jcrespo: Revert "Revert "Install Jessie on db1018"" [puppet] - 10https://gerrit.wikimedia.org/r/267322 
[20:51:53] <apergos>	 I was trying to get all that cleaned up and got down to only two hosts left
[20:52:47] <grrrit-wm>	 (03PS1) 10Papaul: Decom: Remove rsync_caesium role from site.pp Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267323 (https://phabricator.wikimedia.org/T125165) 
[20:53:03] <mutante>	 apergos: yes, it's a puppet cert
[20:53:12] <mutante>	 for the current name
[20:53:20] <apergos>	 yeah I found the notes on the ticket
[20:53:24] <apergos>	 should have read before replying
[20:53:34] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Revert "Revert "Install Jessie on db1018"" [puppet] - 10https://gerrit.wikimedia.org/r/267322 (owner: 10Jcrespo)
[20:54:20] <mutante>	 apergos: maybe it's easier to just delete it.. the server is not running , it's been a couple months.. making a new one costs less time than asking
[20:55:05] <wikibugs>	 6operations: rhodium.eqiad.wmnet status? - https://phabricator.wikimedia.org/T125056#1982732 (10ArielGlenn) Er rather it has NO salt key but a valid (apparently) puppet cert. I just want those lists to be in sync.
[20:56:33] <wikibugs>	 6operations, 5Patch-For-Review: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1982744 (10ArielGlenn) Sorry, it's not about the salt key, it's about having a current puppet cert without having a salt key.  I'm trying to keep those lists in sync at l...
[20:56:57] <apergos>	 well I don't know where alex is in it so I might as well let him reply
[20:57:02] <apergos>	 it's not going to kill me to wait
[20:57:23] <apergos>	 hate when I can't remember stuff on my own tickets I wrote not more than a day or two ago though
[21:04:06] <grrrit-wm>	 (03PS2) 10Dzahn: Decom: Remove caesium from dhcpd Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267320 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul)
[21:04:16] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Decom: Remove caesium from dhcpd Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267320 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul)
[21:04:46] <grrrit-wm>	 (03PS2) 10Dzahn: Decom: Remove caesium from authoinstall Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267321 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul)
[21:05:37] <grrrit-wm>	 (03PS3) 10Dzahn: Decom: Remove caesium from authoinstall Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267321 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul)
[21:05:56] <grrrit-wm>	 (03PS4) 10Dzahn: Decom: Remove caesium from autoinstall [puppet] - 10https://gerrit.wikimedia.org/r/267321 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul)
[21:07:26] <grrrit-wm>	 (03CR) 10Dzahn: "you can also delete the entire role class, it was just for this purpose and not used elsewhere" [puppet] - 10https://gerrit.wikimedia.org/r/267323 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul)
[21:08:01] <grrrit-wm>	 (03CR) 10Dzahn: [C: 04-1] "please also remove roles/rsync_caesium.pp" [puppet] - 10https://gerrit.wikimedia.org/r/267323 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul)
[21:08:14] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Decom: Remove caesium from autoinstall [puppet] - 10https://gerrit.wikimedia.org/r/267321 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul)
[21:09:04] <grrrit-wm>	 (03PS2) 10Dzahn: Decom: Remove rsync_caesium role from site.pp Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267323 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul)
[21:09:06] <wikibugs>	 6operations: jessie installer fails when using db hosts- same recipe works on trusty and on other hosts/a few weeks ago - https://phabricator.wikimedia.org/T125256#1982775 (10jcrespo) 3NEW
[21:10:06] <wikibugs>	 6operations: jessie installer fails when using db hosts- same recipe works on trusty and on other hosts/a few weeks ago - https://phabricator.wikimedia.org/T125256#1982783 (10jcrespo) And yes, I tried formatting it manually, too.
[21:10:47] <grrrit-wm>	 (03PS1) 10Jcrespo: Revert "Revert "Revert "Install Jessie on db1018""" [puppet] - 10https://gerrit.wikimedia.org/r/267326 
[21:10:56] <grrrit-wm>	 (03PS2) 10Jcrespo: Revert "Revert "Revert "Install Jessie on db1018""" [puppet] - 10https://gerrit.wikimedia.org/r/267326 
[21:12:07] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] Decom: Remove rsync_caesium role from site.pp Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267323 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul)
[21:13:00] <mutante>	 !log bromine - stop and remove rsync service
[21:13:04] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[21:16:08] <grrrit-wm>	 (03PS1) 10Jcrespo: Make sure the installation is fully unattended [puppet] - 10https://gerrit.wikimedia.org/r/267328 
[21:17:09] <grrrit-wm>	 (03PS2) 10Jcrespo: Make sure the installation is fully unattended [puppet] - 10https://gerrit.wikimedia.org/r/267328 
[21:17:47] <grrrit-wm>	 (03PS3) 10Jcrespo: Make sure the installation is fully unattended [puppet] - 10https://gerrit.wikimedia.org/r/267328 
[21:19:15] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Make sure the installation is fully unattended [puppet] - 10https://gerrit.wikimedia.org/r/267328 (owner: 10Jcrespo)
[21:19:25] <grrrit-wm>	 (03PS3) 10Jcrespo: Revert "Revert "Revert "Install Jessie on db1018""" [puppet] - 10https://gerrit.wikimedia.org/r/267326 
[21:21:03] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Revert "Revert "Revert "Install Jessie on db1018""" [puppet] - 10https://gerrit.wikimedia.org/r/267326 (owner: 10Jcrespo)
[21:21:33] <Platonides>	 aren't you taking it too far, jynus ? :)
[21:21:49] <jynus>	 oh, wait until the next time
[21:22:07] <jynus>	 :-) that is only 2 tries, actually
[21:24:03] <jynus>	 I'be been toold, however, that if you stick a lot of reverts there, it eventually works
[21:25:03] <Platonides>	 oh, keep trying then
[21:29:08] <grrrit-wm>	 (03CR) 10Jcrespo: "This has only deleted one of the 2 confirmations, it still asks for comfirmation to write the partition table." [puppet] - 10https://gerrit.wikimedia.org/r/267328 (owner: 10Jcrespo)
[21:32:10] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] parsoid-vd-client: Fill out missing pieces of the config file [puppet] - 10https://gerrit.wikimedia.org/r/267311 (owner: 10Subramanya Sastry)
[21:33:20] <wikibugs>	 6operations, 5Patch-For-Review: decom caesium - https://phabricator.wikimedia.org/T125165#1982829 (10Dzahn) merged @papaul's changes:  https://gerrit.wikimedia.org/r/#/c/267323/  https://gerrit.wikimedia.org/r/#/c/267321/  https://gerrit.wikimedia.org/r/#/c/267320/
[21:34:08] <grrrit-wm>	 (03PS1) 10Dzahn: delete rsync_caesium class, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/267333 (https://phabricator.wikimedia.org/T125165) 
[21:35:06] <grrrit-wm>	 (03PS2) 10Dzahn: parsoid-vd-client: Fill out missing pieces of the config file [puppet] - 10https://gerrit.wikimedia.org/r/267311 (owner: 10Subramanya Sastry)
[21:36:00] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "only affects testing server" [puppet] - 10https://gerrit.wikimedia.org/r/267311 (owner: 10Subramanya Sastry)
[21:38:01] <grrrit-wm>	 (03PS1) 10Dzahn: varnish/misc-web: remove caesium backend, decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/267334 (https://phabricator.wikimedia.org/T125165) 
[21:38:03] <grrrit-wm>	 (03CR) 10Jdlrobson: [C: 031] "Is someone able to get this SWATed in one of the many SWAT windows and then test it is working as expected?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267195 (https://phabricator.wikimedia.org/T125000) (owner: 10Dereckson)
[21:39:47] <grrrit-wm>	 (03PS2) 10Dzahn: delete rsync_caesium class, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/267333 (https://phabricator.wikimedia.org/T125165) 
[21:40:14] <mutante>	 subbu: [parsoid-vd-client]/Service[parsoid-vd-client]: Triggered 'refresh' ...
[21:41:42] <subbu>	 mutante, what is that from? code update?
[21:41:55] <mutante>	 subbu: from the puppet run on ruthenium
[21:41:59] <subbu>	 ah, yes, you +2ed the patch. thanks.
[21:42:10] <mutante>	 subbu: it was just a lazy way to say "i merged that, ran puppet, and it restarted the service"
[21:42:20] <mutante>	 yw
[21:42:56] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] delete rsync_caesium class, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/267333 (https://phabricator.wikimedia.org/T125165) (owner: 10Dzahn)
[21:44:18] <subbu>	 robh: whenever you are free, one more dump of the parsoid-vd-client logs if you don't mind.
[21:45:18] <robh>	 done
[21:46:29] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 031] "Yup, confirmed it's a noop - https://puppet-compiler.wmflabs.org/1666/" [puppet] - 10https://gerrit.wikimedia.org/r/267270 (owner: 10Subramanya Sastry)
[21:46:37] <grrrit-wm>	 (03PS1) 10Dzahn: admin: rm reprepro from exceptions in enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/267377 
[21:47:14] <grrrit-wm>	 (03PS2) 10Dzahn: admin: rm reprepro from exceptions in enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T125165) 
[21:48:18] <subbu>	 progress .. but, Jan 29 21:44:59 ruthenium nodejs[22720]: Error: libjpeg.so.8: cannot open shared object file: No such file or directory ... via canvas ... mobrovac mutante either i am missing some package or my node_modules build on vm (on trust) doesn't translate over to jessie.
[21:48:51] <subbu>	 anyway, i think at this point, i should simply wait till monday and have my root access and mess around directly at that point.
[21:49:04] <grrrit-wm>	 (03CR) 10Mobrovac: MW parsoid URLs: s/parsoidcache/parsoid/ (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267234 (https://phabricator.wikimedia.org/T110472) (owner: 10BBlack)
[21:49:14] <mutante>	 subbu: maybe you need "libjpeg-dev" installed or a similar libjpeg package ?
[21:49:29] <mobrovac>	 yup, +1
[21:49:31] <mutante>	 subbu: dpkg -l | grep libjpeg    on the VM ?
[21:49:39] <mobrovac>	 subbu: need to install the -dev pkgs
[21:50:12] <mobrovac>	 subbu: it'd be better to build the deps in a Jessie VM, especially for things like canvas which have binaries
[21:50:28] <mutante>	 yea, but labs would not let him create a jessie instance :/
[21:50:37] <subbu>	 mutante, mobrovac ah yes .. $ sudo apt-get install libcairo2-dev libjpeg8-dev libpango1.0-dev libgif-dev build-essential g++
[21:50:40] <mutante>	 resist the temptation to just install it as root, let's 
[21:50:43] <subbu>	 from https://github.com/Automattic/node-canvas/wiki/Installation---Ubuntu-and-other-Debian-based-systems
[21:50:49] <mutante>	 let's do it via puppet right away
[21:50:50] <subbu>	 mutante, ok. :)
[21:50:53] <mutante>	 just saves time later :)
[21:50:54] <grrrit-wm>	 (03CR) 10Dereckson: "Yes, I've already included this change for the next SWAT: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267195 (https://phabricator.wikimedia.org/T125000) (owner: 10Dereckson)
[21:51:08] <subbu>	 sure. works for me. how do i get puppet to install those packages?
[21:51:39] <mobrovac>	 mutante: why couldn't he create a jessie instance in labs? there are 8.2 images
[21:51:41] <mutante>	 package { 'foo':
[21:51:47] <mutante>	    ensure => present,
[21:51:48] <mutante>	 }
[21:51:54] <mutante>	 subbu: ^
[21:52:01] <subbu>	 mutante, i'll pm him :)
[21:52:11] <mutante>	 ok :)
[21:52:51] <grrrit-wm>	 (03CR) 10Papaul: [V: 031] admin: rm reprepro from exceptions in enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T125165) (owner: 10Dzahn)
[21:53:35] <subbu>	 mutante, ah .. via package? got it.
[21:53:38] <subbu>	 will update.
[21:53:49] <mobrovac>	 subbu: package { 'foo': } suffices
[21:54:19] <grrrit-wm>	 (03CR) 10Alex Monk: "Is it present on bromine?" [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T125165) (owner: 10Dzahn)
[21:54:54] <mutante>	 yes, this is also fine:
[21:54:55] <mutante>	  ensure_packages(['virtualenv', 'gcc', 'python-dev', 'libmysqlclient-dev'])
[21:55:37] <mutante>	 ensure_packages will not conflict if the same package gets installed by multiple classes on the same machine
[21:56:03] <subbu>	 ok.
[21:56:14] <mobrovac>	 mutante: it's actually require_package()
[21:56:17] <mobrovac>	 subbu: ^
[21:56:28] <mutante>	 mobrovac: no, it exists both
[21:56:32] <mobrovac>	 ah
[21:56:33] <mobrovac>	 kk
[21:56:59] <mutante>	 yea, eh.. we use both of them
[21:59:21] <grrrit-wm>	 (03PS1) 10Subramanya Sastry: visuadiff: add dependences on required deb packages [puppet] - 10https://gerrit.wikimedia.org/r/267378 
[21:59:24] <icinga-wm>	 PROBLEM - Host cp3049 is DOWN: PING CRITICAL - Packet loss = 100%
[22:00:03] <subbu>	 mobrovac, mutante there is the patch.
[22:01:46] <mutante>	 mobrovac: had to check the difference again, so require_package has been written by Ori, and ensure_packages is from puppet's stdlib and:
[22:01:53] <icinga-wm>	 PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6
[22:02:02] <p858snake>	 <mutante> Reedy: https://releases.wikimedia.org/mediawiki/   should we even have that "snapshot" dir up anymore? check the date  <Its been longstanding that we don't delete files from there, maybe rename zSnapshots or something (depending on the file sorting works)
[22:02:03] <mutante>	     wmflib: add require_package() from vagrant
[22:02:03] <mutante>	     
[22:02:04] <icinga-wm>	 PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6
[22:02:04] <mutante>	     It's similar to ensure_packages(), but it's cleaner and faster. 
[22:02:04] <icinga-wm>	 PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6
[22:02:04] <icinga-wm>	 PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6
[22:02:23] <icinga-wm>	 PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6
[22:02:23] <icinga-wm>	 PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6
[22:02:27] <mutante>	 oh
[22:02:35] <icinga-wm>	 PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3049_v4, cp3049_v6
[22:02:35] <icinga-wm>	 PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6
[22:02:41] <mutante>	 bblack: around ^ ?
[22:02:43] <icinga-wm>	 PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6
[22:02:53] <icinga-wm>	 PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3049_v4, cp3049_v6
[22:03:04] <icinga-wm>	 PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6
[22:03:04] <icinga-wm>	 PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6
[22:03:04] <icinga-wm>	 PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3049_v4, cp3049_v6
[22:03:04] <icinga-wm>	 PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3049_v4, cp3049_v6
[22:03:14] <icinga-wm>	 PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6
[22:03:15] <icinga-wm>	 PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3049_v4, cp3049_v6
[22:03:23] <icinga-wm>	 PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6
[22:03:23] <icinga-wm>	 PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6
[22:03:47] <mutante>	 all of them about cp3049 .. ok.. looking
[22:05:30] <jynus>	 it seems down from icinga
[22:06:00] <mutante>	 yes, i am connecting to mgmt
[22:06:32] <mutante>	 nothing on console
[22:06:38] <mutante>	 !log powercycle cp3049 
[22:06:41] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:07:41] <jynus>	 ah, that is why it was denying me access
[22:08:00] <mutante>	 UEFI0030: A keyboard device is not connected to the system.
[22:08:04] <mutante>	 yea.. go on...
[22:08:17] <mutante>	 i see it coming back now
[22:08:29] <jynus>	 no, it is something else, I think my ssh config wrong
[22:08:41] <mutante>	 jynus: maybe because this has .esams. in it?
[22:08:44] <mutante>	 unlike the others
[22:09:02] <jynus>	 ah! no, worse than that, I was thinking sfo
[22:09:07] <jynus>	 so human error
[22:09:10] <mutante>	 ok
[22:10:38] <mutante>	 hmmm.. it doesnt finish the boot process. something broke
[22:12:03] <jynus>	 still no ssh access
[22:12:17] <mutante>	 yea, and no more output either
[22:12:30] <mutante>	 this was the last i saw:
[22:12:37] <mutante>	 [  OK  [    9.862350] ipmi_si ipmi_si.0: Using irq 10
[22:12:37] <mutante>	 [    9.863607] ------------[ cut here ]------------
[22:12:37] <mutante>	 ] Started Create
[22:13:07] <mutante>	 shortly after [    9.838257] systemd[1]: Mounted Huge Pages File System.
[22:13:17] <jynus>	 that looks like kernel panic
[22:13:34] <mutante>	 [  OK  ] Mounted Debug File System.
[22:13:35] <mutante>	 [    9.794621] EXT4-fs (md0): re-mounted. Opts: errors=remount-ro
[22:14:07] <jynus>	 I would try one powercycle, even if only to get more info
[22:14:14] <mutante>	 [    6.350863] ata3: SATA link down (SStatus 0 SControl 300)
[22:14:14] <mutante>	 [    6.675177] ata4: SATA link down (SStatus 0 SControl 300)
[22:14:24] <jynus>	 ah, disk issue
[22:14:25] <mutante>	 that's almost like controller died
[22:14:32] <mutante>	 they show up, then they all disappear
[22:14:37] <mutante>	 as if the controller broke
[22:15:33] <jynus>	 then I will leave it for reverse(kram) to check it in person
[22:15:57] <mutante>	 yea, should just make a ticket in esams 
[22:16:31] <robh>	 its under warranty as well so he'll be able to get a replacement.
[22:16:52] <mutante>	 usb 1-1.6: New USB device found,
[22:17:00] <mutante>	 usb 1-1.6: Product: Gadget USB HUB
[22:17:01] <jynus>	 are any of you going to be any time longer?
[22:17:02] <mutante>	 ?
[22:17:03] <ori>	 mutante: no, we agreed to standardize on require_package, which (in addition to other things) also makes the package a requirement for the current class, so you don't have to declare the package and _then_ add require => Package['foo'] 
[22:17:26] <robh>	 jynus: be around?  i need to take a lunch in a moment and run to the store (im out of food here)
[22:17:30] <robh>	 but i'll be back
[22:17:54] <icinga-wm>	 PROBLEM - Host cp1049 is DOWN: PING CRITICAL - Packet loss = 100%
[22:17:57] <jynus>	 so, I believe db1018 will conserve its downtimes, despite the resintall
[22:18:01] <grrrit-wm>	 (03PS1) 10Ottomata: Allow access to Analytlics mysql metadata instance on analytics1027 from analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/267379 
[22:18:28] <mutante>	 ori: ok, thanks! we have a lot of places to replace it then
[22:18:34] <icinga-wm>	 PROBLEM - Host cp3042 is DOWN: PING CRITICAL - Packet loss = 100%
[22:18:37] <jynus>	 but just in case it doesnt, and icinga trolls us, know it is under maintenance and depooled
[22:18:42] <grrrit-wm>	 (03PS2) 10Ottomata: Allow access to Analytlics mysql metadata instance on analytics1027 from analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/267379 
[22:19:31] <robh>	 good to konw
[22:19:33] <robh>	 know
[22:19:52] <robh>	 Ok, im away for about an hour (only mentioning it in here since im on clinic duty)
[22:20:36] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032] Allow access to Analytlics mysql metadata instance on analytics1027 from analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/267379 (owner: 10Ottomata)
[22:20:53] <icinga-wm>	 PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:03] <icinga-wm>	 PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:13] <icinga-wm>	 PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:14] <icinga-wm>	 PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:14] <icinga-wm>	 PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:25] <icinga-wm>	 PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:35] <icinga-wm>	 PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:35] <icinga-wm>	 PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:35] <icinga-wm>	 PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:35] <icinga-wm>	 PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:35] <icinga-wm>	 PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:35] <icinga-wm>	 PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:36] <icinga-wm>	 PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:36] <icinga-wm>	 PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:44] <icinga-wm>	 PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:44] <icinga-wm>	 PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:21:53] <icinga-wm>	 PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:22:04] <icinga-wm>	 PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:22:04] <icinga-wm>	 PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:22:13] <icinga-wm>	 PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:22:14] <icinga-wm>	 PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:22:14] <icinga-wm>	 PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:22:14] <icinga-wm>	 PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:22:15] <icinga-wm>	 PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:22:15] <icinga-wm>	 PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:22:23] <icinga-wm>	 PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:22:34] <icinga-wm>	 PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:22:34] <ori>	 bblack: ^
[22:22:43] <jynus>	 it is not traffic
[22:22:49] <jynus>	 it is the server
[22:22:53] <icinga-wm>	 PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6
[22:22:56] <ori>	 1049?
[22:22:59] <jynus>	 yes
[22:23:05] <mutante>	 and 3042
[22:23:18] <mutante>	 i think they just both had hardware fail
[22:23:21] <mutante>	 but starting to be strange
[22:23:24] <jynus>	 it doesn't even respond to console
[22:23:35] <mutante>	 !log powercycled cp1049
[22:23:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:23:59] <jynus>	 ah, in that case it was really you
[22:24:00] <mutante>	 maybe the monitoring got more verbose recently?
[22:24:15] <jynus>	 that is the ipsec
[22:24:33] <mutante>	 yes, but did it output that many lines when a single server went down?
[22:24:34] <jynus>	 it was discussed (this?) morning (europe's)
[22:24:36] <mutante>	 guess it did
[22:24:50] <mutante>	 what was discussed?
[22:24:58] <jynus>	 it is "relativelly new" for ipsec servers
[22:25:05] <mutante>	 ah, ok
[22:25:11] <mutante>	 so cp1049 is coming back
[22:25:16] <mutante>	 at login: again
[22:25:20] <mutante>	 3049 is not
[22:25:23] <icinga-wm>	 RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 38 ESP OK
[22:25:24] <icinga-wm>	 RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 38 ESP OK
[22:25:24] <icinga-wm>	 RECOVERY - Host cp1049 is UP: PING OK - Packet loss = 0%, RTA = 2.82 ms
[22:25:26] <mutante>	 eh, 3042
[22:25:34] <icinga-wm>	 RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 38 ESP OK
[22:25:34] <icinga-wm>	 RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 38 ESP OK
[22:25:44] <icinga-wm>	 RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 38 ESP OK
[22:25:45] <icinga-wm>	 RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 38 ESP OK
[22:25:45] <icinga-wm>	 RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 38 ESP OK
[22:25:47] <grrrit-wm>	 (03PS1) 10Ottomata: Add ferm::service{ 'analytics-mysql-meta' to analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/267380 
[22:25:53] <icinga-wm>	 RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 38 ESP OK
[22:25:53] <icinga-wm>	 RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 38 ESP OK
[22:25:53] <icinga-wm>	 RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 38 ESP OK
[22:25:53] <icinga-wm>	 RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 38 ESP OK
[22:26:04] <icinga-wm>	 RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 38 ESP OK
[22:26:05] <icinga-wm>	 RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 38 ESP OK
[22:26:13] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Add ferm::service{ 'analytics-mysql-meta' to analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/267380 (owner: 10Ottomata)
[22:26:14] <icinga-wm>	 RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 38 ESP OK
[22:26:24] <icinga-wm>	 RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 38 ESP OK
[22:26:24] <icinga-wm>	 RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 38 ESP OK
[22:26:33] <icinga-wm>	 RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 38 ESP OK
[22:26:34] <icinga-wm>	 RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 38 ESP OK
[22:26:44] <mutante>	 3042 -  md0: unknown partition table
[22:26:44] <icinga-wm>	 RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 38 ESP OK
[22:26:46] <mutante>	 oh well
[22:26:53] <icinga-wm>	 RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 38 ESP OK
[22:26:53] <icinga-wm>	 RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 38 ESP OK
[22:26:53] <icinga-wm>	 RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 38 ESP OK
[22:26:53] <icinga-wm>	 RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 38 ESP OK
[22:26:53] <icinga-wm>	 RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 38 ESP OK
[22:26:53] <icinga-wm>	 RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 38 ESP OK
[22:26:54] <icinga-wm>	 RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 38 ESP OK
[22:26:54] <icinga-wm>	 RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 38 ESP OK
[22:27:04] <icinga-wm>	 RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 38 ESP OK
[22:27:04] <mutante>	 !log cp3042 -  md0: unknown partition table
[22:27:07] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:28:42] <wikibugs>	 6operations, 10MediaWiki-API, 6Services, 10Traffic, 7Monitoring: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#1982972 (10GWicke) @faidon, is your view that this should be handled by somebody outside ops?
[22:29:26] <grrrit-wm>	 (03PS1) 10GWicke: WIP / untested: Don't decode percent encoding for rest.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/267381 (https://phabricator.wikimedia.org/T125176) 
[22:30:54] <wikibugs>	 6operations, 10ops-esams: cp3042 - controller / hardware issue - https://phabricator.wikimedia.org/T125265#1982978 (10Dzahn) 3NEW
[22:31:23] <grrrit-wm>	 (03PS2) 10Subramanya Sastry: visuadiff: add dependences on required deb packages [puppet] - 10https://gerrit.wikimedia.org/r/267378 
[22:31:49] <subbu>	 mutante, updated that as per ori's comment above.
[22:32:02] <mutante>	 subbu: perfect :)
[22:32:50] <jynus>	 ack'ed 3042
[22:33:41] <mutante>	 just saw, thx
[22:34:29] <jynus>	 and you are using, I supose, 3049
[22:35:10] <grrrit-wm>	 (03PS3) 10Subramanya Sastry: visuadiff: add dependencies on required deb packages [puppet] - 10https://gerrit.wikimedia.org/r/267378 
[22:35:12] <mutante>	 jynus: just disconnected now
[22:35:41] <jynus>	 powercycle?
[22:37:35] <jynus>	 !log powercycle cp3042
[22:37:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:37:45] <grrrit-wm>	 (03CR) 10Dzahn: "N: Can't select versions from package 'libjpeg8-dev' as it is purely virtual" [puppet] - 10https://gerrit.wikimedia.org/r/267378 (owner: 10Subramanya Sastry)
[22:37:53] <jynus>	 !log powercycle cp3049, not 42
[22:37:56] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:39:00] <jynus>	 let's wait and see
[22:40:05] <mutante>	 jynus: correcting the ticket name.. it was 3049 
[22:40:17] <jynus>	 no
[22:40:19] <wikibugs>	 6operations, 10ops-esams: cp3049 - controller / hardware issue - https://phabricator.wikimedia.org/T125265#1982998 (10Dzahn)
[22:40:24] <mutante>	 14:10 < mutante> !log powercycle cp3049 
[22:40:25] <jynus>	 ahhhh
[22:40:44] <jynus>	 wait, I have a confusion now
[22:40:50] <jynus>	 49 booted for me
[22:40:54] <icinga-wm>	 RECOVERY - Host cp3049 is UP: PING OK - Packet loss = 0%, RTA = 86.22 ms
[22:40:55] <mutante>	 14:06 < icinga-wm> PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6
[22:41:05] <mutante>	 jynus: started to boot  or actually finished?
[22:41:11] <jynus>	 finished
[22:41:15] <mutante>	 eh.. 
[22:41:23] <jynus>	 are you sure it wasn't 42 the damaged one?
[22:41:27] <mutante>	 but where are the recoveries then
[22:41:35] <mutante>	 yea, see the lines like this
[22:41:37] <mutante>	 14:06 < icinga-wm> PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6
[22:41:40] <mutante>	 3049 at the end
[22:41:49] <jynus>	 let me ssh
[22:42:26] <jynus>	 yes, 3049 is alive
[22:42:51] <jynus>	 do not know about the service, but the machine is
[22:42:53] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[22:43:12] <jynus>	 there where 3 incidents, 2 came back alice 42 didnt
[22:44:13] <jynus>	 1049 3049 up, 3042 down, agree?
[22:44:47] <jynus>	 (icinga agrees with me, at least)
[22:48:18] <mutante>	 jynus: yes, agree
[22:48:27] <jynus>	 let me powercycle 42 once again to be 100% sure that is the broken one
[22:48:57] <jynus>	 *3049*
[22:49:45] <mutante>	 icinga says 3042 is still borked
[22:49:51] <mutante>	 while 3049 is happy
[22:50:00] <jynus>	 yes, for that, there is nothing to lose
[22:53:20] <jynus>	 !log powercycling cp3042 to test it is really the broken one
[22:53:20] <wikibugs>	 6operations, 10ops-esams: cp3042 - controller / hardware issue - https://phabricator.wikimedia.org/T125265#1983006 (10Dzahn)
[22:53:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:53:20] <grrrit-wm>	 (03PS2) 10Dzahn: varnish/misc-web: remove caesium backend, decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/267334 (https://phabricator.wikimedia.org/T125165) 
[22:53:20] <grrrit-wm>	 (03CR) 10Subramanya Sastry: "I see .. On my laptop (trusty), I see this ... so, I wonder if the fact that I built the packages on a trusty VM means this won't work on" [puppet] - 10https://gerrit.wikimedia.org/r/267378 (owner: 10Subramanya Sastry)
[22:53:24] <jynus>	 confirmed, 3042 is the one with a kernel panic (maybe you got confuse with the other machine because the similar name)
[22:53:53] <grrrit-wm>	 (03PS4) 10Subramanya Sastry: visuadiff: add dependencies on required deb packages [puppet] - 10https://gerrit.wikimedia.org/r/267378 
[22:54:30] <wikibugs>	 6operations, 10ops-esams: cp3042 - controller / hardware issue - https://phabricator.wikimedia.org/T125265#1983012 (10jcrespo) ``` [    0.000000] ACPI: LAPIC (acpi_id[0x2e] lapic_id[0x2a] enabled) [    0.000000] ACPI: LAPIC (acpi_id[0x0c] lapic_id[0x10] enabled) [    0.000000] ACPI: LAPIC (acpi_id[0x30] lapic_...
[22:55:00] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] varnish/misc-web: remove caesium backend, decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/267334 (https://phabricator.wikimedia.org/T125165) (owner: 10Dzahn)
[22:56:44] <grrrit-wm>	 (03PS1) 10Dzahn: releases: switch reprepro upload server to bromine [puppet] - 10https://gerrit.wikimedia.org/r/267385 (https://phabricator.wikimedia.org/T124261) 
[22:56:45] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[22:56:50] <grrrit-wm>	 (03CR) 10Subramanya Sastry: [C: 04-1] "Actually hold on ... since I switched to upright diff, I may not need canvas anyway since that is used by the resemble package that I am n" [puppet] - 10https://gerrit.wikimedia.org/r/267378 (owner: 10Subramanya Sastry)
[22:57:46] <grrrit-wm>	 (03PS2) 10Dzahn: releases: switch reprepro upload server to bromine [puppet] - 10https://gerrit.wikimedia.org/r/267385 (https://phabricator.wikimedia.org/T124261) 
[22:58:12] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] releases: switch reprepro upload server to bromine [puppet] - 10https://gerrit.wikimedia.org/r/267385 (https://phabricator.wikimedia.org/T124261) (owner: 10Dzahn)
[22:59:00] <wikibugs>	 6operations, 10MediaWiki-Authentication-and-authorization: ~3000% increase in session redis memory usage, causing evictions and session loss - https://phabricator.wikimedia.org/T125267#1983035 (10ori) 3NEW
[23:01:52] <wikibugs>	 6operations, 5Patch-For-Review: decom caesium - https://phabricator.wikimedia.org/T125165#1983056 (10Dzahn) @Papaul see my changes above. Could you follow-up with the DNS removal and then move the ticket to ops-eqiad or make a new ticket for the final decom steps (wipe disk, remove from rack etc)? thanks
[23:03:03] <grrrit-wm>	 (03PS5) 10Subramanya Sastry: visualdiff: add dependencies on required deb packages [puppet] - 10https://gerrit.wikimedia.org/r/267378 
[23:06:08] <grrrit-wm>	 (03CR) 10Subramanya Sastry: "https://github.com/wikimedia/integration-visualdiff/commit/39f522385f28f4b6b6f4129bea4f3f48721b5573 removes resemblejs which removes the c" [puppet] - 10https://gerrit.wikimedia.org/r/267378 (owner: 10Subramanya Sastry)
[23:13:18] <grrrit-wm>	 (03CR) 10Dzahn: "yes, it is. thanks, amending" [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T125165) (owner: 10Dzahn)
[23:13:22] <grrrit-wm>	 (03PS3) 10Dzahn: admin: replace caesium with bromine in enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T124261) 
[23:14:55] <grrrit-wm>	 (03CR) 10Alex Monk: [C: 031] "probably correct, but I'm not a releaser or ops" [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T124261) (owner: 10Dzahn)
[23:15:32] <grrrit-wm>	 (03PS4) 10Dzahn: admin: replace caesium with bromine in enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T124261) 
[23:15:55] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "yes, same role class, same user, just different hostname" [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T124261) (owner: 10Dzahn)
[23:21:26] * bd808 is running sync-file
[23:22:04] <bd808>	 ori: is that "LightProcess" fix easy?
[23:22:09] <bd808>	 it's f'ing annoying
[23:22:43] <logmsgbot>	 !log bd808@mira Synchronized php-1.27.0-wmf.11/includes/session/SessionBackend.php: Testing proposed fix for T125267 (duration: 01m 26s)
[23:22:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:22:56] <bd808>	 anomie, ori, greg-g: ^
[23:23:39] <ori>	 graph to watch: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Memcached+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report
[23:24:27] <ori>	 the periodicity of that graph is interesting
[23:24:53] <ori>	 peaks every 15m
[23:27:14] <ori>	 bd808: my idea of disabling LightProcess entirely for CLI mode is probably not great, since I believe we have long-running maintenance scripts that shell out
[23:27:47] <ori>	 it may have been fixed upstream, we are pretty far behind
[23:28:36] <ori>	 bytes_out is climbing again toward the next peak
[23:29:08] <ori>	 bytes_in is slightly depressed, almost certainly as a result of the patch, but it didn't fix the issue by the looks of it
[23:29:52] <bd808>	 and we are pretty sure this is just redis traffic caused right? Not some memc regression somewhere else?
[23:30:29] <ori>	 of that i am completely sure; redis memory usage is very stable at ~15mb normally, it's at 500
[23:30:39] <bd808>	 *nod*
[23:30:52] <bd808>	 and we don't store anything other than sessions in this redis?
[23:31:44] <bd808>	 peak matches 15 min ago. so no joy yet
[23:32:05] <ori>	 i think GettingStarted stores cleanup category memberships, but it has been doing that for ages, and it hasn't had code changes recently AFAIK
[23:32:17] <ori>	 I'll run MONITOR on a redis instance to confirm that the traffic is due to session-related keys
[23:32:50] <bd808>	 No GettingStarted changes shown on https://www.mediawiki.org/wiki/MediaWiki_1.27/wmf.11
[23:33:07] <grrrit-wm>	 (03PS6) 10Dzahn: visualdiff: add dependencies on required deb packages [puppet] - 10https://gerrit.wikimedia.org/r/267378 (owner: 10Subramanya Sastry)
[23:34:09] <grrrit-wm>	 (03PS1) 10Mobrovac: MobileApps: Change RESTBase URI [puppet] - 10https://gerrit.wikimedia.org/r/267392 (https://phabricator.wikimedia.org/T125252) 
[23:34:55] <greg-g>	 it's 3:35, btw, I'd like us to resolve this or rollback sessionmanager by 4:15 to give us some bake time before we all leave for the night
[23:35:53] <ori>	 100 lines of redis activity from mc1001: https://dpaste.de/SJzw/raw
[23:35:56] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] visualdiff: add dependencies on required deb packages [puppet] - 10https://gerrit.wikimedia.org/r/267378 (owner: 10Subramanya Sastry)
[23:37:23] <mutante>	 !log ruthenium - git pull origin in /srv/visualdiff/
[23:37:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:38:16] <ori>	 so, there are a few other users, it looks like (abusefilter, chronologyprotector) but it's definitely session stuff
[23:38:37] <ori>	 to rule out the possibility of it being something else that is storing huge values, i ran redis-cli --big-keys: https://dpaste.de/WcaH/raw
[23:38:59] <chasemp>	 91 session lines and 8 others
[23:40:20] <bd808>	 So the call to init the session used to be guarded by "if ( $wgRequest->checkSessionCookie() || isset( $_COOKIE[$wgCookiePrefix . 'Token'] ) )". That seems to be gone now.
[23:40:32] <bd808>	 anomie: are we fetching from redis much more often?
[23:40:46] <anomie>	 bd808: Maybe.
[23:40:57] <anomie>	 Is the problem fetches and not storing data?
[23:41:19] <ori>	 yes
[23:41:23] <bd808>	 the traffic spike is more data being fetch from redis to MW
[23:41:30] <mutante>	 !log ruthenium - restart parsoid-rt-client, parsoid-vd-client
[23:41:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:41:35] <bd808>	 but apparently data sotred in redis is up as well
[23:41:57] <bd808>	 but if we are starting more sessions I suppose that would increase the stored data
[23:41:58] <grrrit-wm>	 (03PS1) 10Subramanya Sastry: testreduce: Remove ensure => latest from the repo declaration [puppet] - 10https://gerrit.wikimedia.org/r/267393 
[23:42:23] <grrrit-wm>	 (03CR) 10Dzahn: "i did the git pull origin in both places, visualdiff got updated and testreduce was already latest version" [puppet] - 10https://gerrit.wikimedia.org/r/267378 (owner: 10Subramanya Sastry)
[23:42:47] <ori>	 bytes_in is up, but the scale is different
[23:42:48] <ori>	 http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&title=&vl=&x=&n=&hreg[]=mc1001&mreg[]=bytes_in&gtype=line&glegend=show&aggregate=1&embed=1&_=1454110869000
[23:42:54] <grrrit-wm>	 (03CR) 10Jcrespo: "It is probably one of those:" [puppet] - 10https://gerrit.wikimedia.org/r/267328 (owner: 10Jcrespo)
[23:43:02] <ori>	 it went from 1.5 to 2.2mb
[23:49:55] <icinga-wm>	 PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: puppet fail
[23:50:23] <mutante>	 subbu: we got a new problem on ruthenium 
[23:50:48] <subbu>	 oh?
[23:50:59] <mutante>	 Error: Could not retrieve catalog from remote server: Error 400 on SERVER: undefined method `function_create_resources' for nil:NilClass at /etc/puppet/modules/visualdiff/manifests/init.pp:5 
[23:51:05] <mutante>	 uhm..
[23:53:51] <jynus>	 !log restarted db1018 replication (and its codfw slaves) after a (somewhat) failed maintenance
[23:53:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master