[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160129T0000). Please do the needful. [00:00:05] ebernhardson yurik Jdlrobson bmansurov Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:11] delaying swat [00:00:13] (03Merged) 10jenkins-bot: Revert "wgRCWatchCategoryMembership true on wikipedias & commons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267189 (owner: 10Alex Monk) [00:00:23] 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1980344 (10Yurik) @bblack, the 4 test servers have been performing admirably, so if possible, it would be good to keep them as production and match them in another DC for redundancy. [00:00:37] yes, SWAT is delayed until further notice, sorry for the convenience [00:01:15] greg-g, thanks, that is very convenient, yes :D [00:01:46] (03PS1) 10Subramanya Sastry: parsoid-vd-client & diffservice: Use uprightdiff for diffing images [puppet] - 10https://gerrit.wikimedia.org/r/267190 [00:02:01] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/267189/2 (duration: 01m 11s) [00:02:03] yurik: the convenience is for our users who will be happier when we fix this UBN! first :) [00:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:02:23] also, a lovely joke from the late great Mitch Hedberg [00:02:39] 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1980347 (10Tfinc) This will likely fit under the strategic budget so we'll need a brief narrative about going default on Wikipedia and any other projects to explain the increase of machines. [00:03:00] greg-g, what's a UBN? [00:03:10] Unbreak Now [00:03:33] ah yes, they usually are kinda nasty, aren't they [00:03:49] (03PS9) 10Andrew Bogott: Keystone: Adopt a multi-domain model [puppet] - 10https://gerrit.wikimedia.org/r/244350 [00:03:51] (03PS1) 10Andrew Bogott: Define wgOpenStackManagerProject [puppet] - 10https://gerrit.wikimedia.org/r/267192 (https://phabricator.wikimedia.org/T115029) [00:04:12] yurik: too many of them this week [00:04:16] 6operations: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1980354 (10Dzahn) [00:04:18] 6operations, 5Patch-For-Review: make the releases.wm.org index page look nicer - https://phabricator.wikimedia.org/T125164#1980352 (10Dzahn) 5Open>3Resolved maybe some day round 2 , adding a CSS with the mediawiki.org style and a logo? [00:04:18] hence my high blood pressure :/ [00:04:33] 6operations: make the releases.wm.org index page look nicer - https://phabricator.wikimedia.org/T125164#1980355 (10Dzahn) [00:05:50] greg-g, i heard herbal tea somehow brings down the blood pressure as well as causes drinker not to distance oneself from the world's ills [00:06:04] greg-g: unbreak now is what we do best [00:06:08] ;-) [00:06:08] (03PS1) 10Rush: diamond: nfsiostat as a collector [puppet] - 10https://gerrit.wikimedia.org/r/267193 [00:06:09] without the "not" [00:06:13] jdlrobson: :) [00:06:20] no need for high blood pressure [00:06:23] we 0wn at fixing those [00:06:27] (03PS2) 10Rush: diamond: nfsiostat as a collector [puppet] - 10https://gerrit.wikimedia.org/r/267193 [00:07:25] ok, SWAT is back on the menu [00:07:29] WOOO [00:07:33] * greg-g is tired and making too many cultural references [00:07:33] okay [00:09:30] greg-g, you should talk to oliver - he loves to make all sorts of weird references ... that only he himself gets [00:09:36] :-P [00:10:59] (03PS1) 10Dereckson: Enable SandboxLink on or.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267194 (https://phabricator.wikimedia.org/T124614) [00:12:22] (03CR) 10Bmansurov: "Nope, swatters suggested that I used the window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [00:12:37] yurik: yeah, he's too extreme for even me [00:12:38] :) [00:13:04] 9 patches? [00:13:20] ebernhardson, one of yours is V-1'd [00:13:43] Hi. Krenair: I added the 9th, it's a throttle rule, so we need it. [00:15:13] yurik first [00:15:21] (03PS3) 10Alex Monk: Update graph settings - should be noop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267060 (owner: 10Yurik) [00:15:26] Krenair, naturally! [00:15:27] (03CR) 10Alex Monk: [C: 032] Update graph settings - should be noop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267060 (owner: 10Yurik) [00:15:50] (03PS1) 10Dereckson: Enable WikidataPageBanner on es.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267195 (https://phabricator.wikimedia.org/T125000) [00:16:02] (03Merged) 10jenkins-bot: Update graph settings - should be noop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267060 (owner: 10Yurik) [00:16:09] * yurik hides [00:17:31] Krenair: checking [00:17:46] !log krenair@mira Synchronized wmf-config: https://gerrit.wikimedia.org/r/#/c/267060/ (duration: 01m 12s) [00:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:17:51] yurik, ^ [00:18:38] (03PS2) 10EBernhardson: Point CirrusSearch queries to local datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267053 [00:18:45] was just a unit test i forgot to update... [00:18:46] Krenair, seems to be ok [00:18:52] yurik, this second patch is not reviewed by someone else? [00:19:19] (03CR) 10jenkins-bot: [V: 04-1] Point CirrusSearch queries to local datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267053 (owner: 10EBernhardson) [00:19:25] Krenair, no, but i can very quickly get it reviewed - it s a oneliner [00:19:31] ok, please do so [00:20:22] :S [00:20:58] jdlrobson, bmansurov: okay, there's some dependencies thing here [00:21:08] Krenair, max just merged it [00:21:23] Krenair: what do you need? [00:21:35] jdlrobson proposes https://gerrit.wikimedia.org/r/#/c/267025/ which depends on https://gerrit.wikimedia.org/r/#/c/264909/ [00:21:47] which is not on either prod branch yet [00:22:13] Krenair: that's fine. It's harmless and will be riding the train next week. [00:22:15] but I suppose we can do the config early [00:22:18] ok [00:22:32] (03PS3) 10EBernhardson: Point CirrusSearch queries to local datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267053 [00:22:34] yurik, he merged it to the deployment branch directly? sigh... [00:22:48] Krenair, yeah, i guess he didn't realize it was not on master [00:23:05] oh well [00:23:06] not a bigie, right? :) [00:23:21] since you wanted to deploy it anyway :D [00:24:16] not a big deal because I'm looking at it as part of this swat, it's not taking me completely by surprise [00:25:01] !log krenair@mira Synchronized php-1.27.0-wmf.11/extensions/Graph/modules/graph2.js: https://gerrit.wikimedia.org/r/#/c/267065/ (duration: 01m 11s) [00:25:01] I actually have someone present in the channel who knows what the patch is, which is better than most other times people send me surprises via the deployment branches [00:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:25:05] yurik, ^ [00:25:25] checking... [00:25:50] (03CR) 10Mobrovac: "/usr/local/bin/uprightdiff is somehow magically present on the node?" [puppet] - 10https://gerrit.wikimedia.org/r/267190 (owner: 10Subramanya Sastry) [00:26:00] 6operations: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1980389 (10yuvipanda) 3NEW [00:26:10] bblack: ^ [00:26:50] 6operations, 10Wikimedia-DNS: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1980396 (10Krenair) [00:27:20] 6operations, 10Wikimedia-DNS: Internal DNS resolver responds with NXDOMAIN for localhost AAAA - https://phabricator.wikimedia.org/T125170#1980408 (10yuvipanda) local testing with dnsmasq (for example) returns ::1 for localhost. [00:30:27] everything okay, yurik? [00:30:37] Krenair, yep, all's good ,thx! [00:30:56] my graphoid service is acting up, but that's to be expected i guess :) [00:31:05] to be fixed on monday [00:31:23] (03PS3) 10Alex Monk: Add sampling rates for mobile web language switcher in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267025 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [00:31:28] (03CR) 10Alex Monk: [C: 032] Add sampling rates for mobile web language switcher in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267025 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [00:31:47] yurik: your skills at inspiring confidence could use some updating :P [00:31:57] (03Merged) 10jenkins-bot: Add sampling rates for mobile web language switcher in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267025 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [00:31:57] 6operations, 6Labs: Manual creation of labs account - https://phabricator.wikimedia.org/T125172#1980416 (10Cobi) 3NEW [00:32:20] greg-g, being pessimist ensures that life is full of positives ;) [00:32:51] you are assuming that things turn out better than the pessimist expected, which is an optimistic thing to do [00:32:52] and when there are no positives, it was expected ) [00:32:52] heh [00:33:02] I wonder if anyone remembers how to let an old SVN account into labs [00:33:20] what have I become? why am I reading nginx source on a thursday evening... [00:33:42] YuviPanda: could be worse, i got max to read hhvm source on a thursday evening ;) [00:33:47] haha [00:33:48] * yurik googles how to create a firmware virus in qbasic [00:34:05] so far I've run into: bugs in nginx, a bug in docker, a bug in my code [00:34:19] the holy trinity! [00:34:25] Krenair: https://phabricator.wikimedia.org/T55793 [00:35:09] an RT reference, lovely [00:35:09] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/267025/ (duration: 01m 12s) [00:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:35:24] -> https://phabricator.wikimedia.org/T83042 [00:35:49] thanks Dereckson [00:35:53] You're welcome. [00:36:23] bmansurov, ^ [00:36:44] ok thanks [00:36:52] and jdlrobson ^ [00:37:27] second patch is in jenkins [00:37:32] thanks Krenair [00:38:40] while that's going, ebernhardson [00:38:52] * jdlrobson waits for jenkins [00:39:47] (03CR) 10MarcoAurelio: "Looks OK except for the minor alphabetical issue in PS3. Deployer should check and run optiPNG in the logo to ensure it displays OK in the" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu) [00:39:49] (03PS4) 10Alex Monk: Point CirrusSearch queries to local datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267053 (owner: 10EBernhardson) [00:40:01] (03CR) 10Alex Monk: [C: 032] Point CirrusSearch queries to local datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267053 (owner: 10EBernhardson) [00:40:27] ebernhardson, you are still here, right? [00:40:31] (03Merged) 10jenkins-bot: Point CirrusSearch queries to local datacenter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267053 (owner: 10EBernhardson) [00:40:44] Krenair: yup [00:40:48] ok, just checking :) [00:41:35] looks like CirrusSearch-common has to go before InitialiseSettings [00:42:21] !log krenair@mira Synchronized tests/cirrusTest.php: https://gerrit.wikimedia.org/r/#/c/267053/ (duration: 01m 11s) [00:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:42:29] Krenair: yes [00:42:41] i just looked at the graphoid service - seems like for some strange reason the deployed version is different from the tip of the graphoid deploy. If greg-g is ok with it, I would like to git deploy sync graphoid again. Its not a huge issue, but a number of romanian graphs are not drawing correctly [00:43:07] is graphoid deployed by trebuchet? [00:43:12] Krenair, correct [00:43:27] yurik: finnnne [00:43:36] * yurik gives greg-g a flower [00:43:47] !log krenair@mira Synchronized wmf-config/CirrusSearch-common.php: https://gerrit.wikimedia.org/r/#/c/267053/ (duration: 01m 10s) [00:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:43:56] yurik: I'll turn it into tea [00:44:28] ebernhardson, syncing InitialiseSettings now [00:44:33] are you sure i didn't pick a mildly poisonous one? just enough to make you sleepy? and in your absence do all sorts of nasty deployments? [00:44:51] yurik: I'll take the rest :) [00:45:24] greg-g, and you think that it is me who is a risk taker?!? [00:45:28] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/267053/ (duration: 01m 10s) [00:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:45:56] well, search still works [00:46:13] yea, morelike still works as well [00:46:18] (the one switched to codfw) [00:46:24] Krenair, should i git deploy now, or wait for you? [00:46:42] I don't think git deploys affect me [00:46:52] yurik: go ahead [00:47:06] let's get things wrapped up before it's EOD, ideally [00:48:59] greg-g: technically, i think its about 10 hours past EOD for yuri (3:45am :P) [00:49:09] !log krenair@mira Synchronized php-1.27.0-wmf.11/extensions/MobileFrontend/resources/skins.minerva.editor/init.js: https://gerrit.wikimedia.org/r/#/c/267168/ (duration: 01m 12s) [00:49:20] ebernhardson is spying on me! [00:49:24] ebernhardson: my timezone is the only one that matters, I guess [00:49:27] :) [00:49:30] (03PS2) 10Alex Monk: Return more like search queries to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266995 (owner: 10EBernhardson) [00:49:39] (03CR) 10Alex Monk: [C: 032] Return more like search queries to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266995 (owner: 10EBernhardson) [00:50:07] (03Merged) 10jenkins-bot: Return more like search queries to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266995 (owner: 10EBernhardson) [00:50:28] !log synced latest graphoid [00:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:50:31] all's good [00:50:33] thanks! [00:50:36] jdlrobson, syncing [00:50:38] jdlrobson, ^ [00:50:42] greg-g, ^^^ [00:51:01] greg-g, and that is why you are moving to east coast :-P [00:51:22] jdlrobson, can you confirm please? [00:51:31] to be closer to the proletariat [00:51:32] Krenair: on it [00:51:48] RTL fixed! yay! [00:51:52] ebernhardson, syncing [00:52:54] Krenair: kk [00:53:00] !log krenair@mira Synchronized wmf-config/CirrusSearch-production.php: https://gerrit.wikimedia.org/r/#/c/266995/ (duration: 01m 11s) [00:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:53:04] ebernhardson, ^ [00:54:11] Krenair: queries look to be working, no log explosion so probably good (also i tested this before) [00:54:15] but i'll keep an eye on my dashboards [00:54:21] k [00:55:39] (03PS2) 10Alex Monk: Bump up the QuickSurveys sampling rates for es and fa wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267071 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov) [00:55:43] bmansurov, ping [00:55:46] yes [00:55:48] (03CR) 10Alex Monk: [C: 032] Bump up the QuickSurveys sampling rates for es and fa wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267071 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov) [00:56:13] (03Merged) 10jenkins-bot: Bump up the QuickSurveys sampling rates for es and fa wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267071 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov) [00:57:44] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/267071/ (duration: 01m 11s) [00:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:58:04] bmansurov, ^ [00:58:18] Krenair: looks good [00:59:14] (03PS6) 10Alex Monk: Add sampling rates for mobile web language switcher on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [00:59:24] (03CR) 10Alex Monk: [C: 032] Add sampling rates for mobile web language switcher on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [00:59:50] (03Merged) 10jenkins-bot: Add sampling rates for mobile web language switcher on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [01:01:19] (03PS3) 10Alex Monk: Santiago Editatón throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267186 (https://phabricator.wikimedia.org/T125081) (owner: 10Dereckson) [01:01:22] !log krenair@mira Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/265292/ (duration: 01m 14s) [01:01:25] (03CR) 10Alex Monk: [C: 032] Santiago Editatón throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267186 (https://phabricator.wikimedia.org/T125081) (owner: 10Dereckson) [01:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:01:29] bmansurov, ^ [01:01:39] Krenair: thanks! [01:01:54] (03Merged) 10jenkins-bot: Santiago Editatón throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267186 (https://phabricator.wikimedia.org/T125081) (owner: 10Dereckson) [01:02:57] Krenair: does it take time before I see the change? [01:03:09] bmansurov, yes, beta doesn't receiving our syncs [01:03:14] receive* [01:03:18] !log krenair@mira Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/267186/ (duration: 01m 09s) [01:03:21] it automatically updates every so often [01:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:03:23] Dereckson, fyi ^ [01:03:27] greg-g, think that's it [01:03:31] Krenair: ok thanks [01:03:43] bmansurov, it should happen soon (tm) [01:03:54] fingers crossed [01:03:57] Krenair: thank you muchly [01:04:36] Thanks for the deploy Krenair. [01:05:27] alright, I'm going afk for the evening, thanks all [01:06:01] (03PS1) 10MaxSem: Reduce Kafka timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267200 (https://phabricator.wikimedia.org/T125084) [01:07:33] (03PS1) 10Ori.livneh: Revert "Revert "Autopromotion: remove deprecated onView event, fix INGROUPS"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267201 [01:07:40] (03CR) 10Ori.livneh: [C: 032] Revert "Revert "Autopromotion: remove deprecated onView event, fix INGROUPS"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267201 (owner: 10Ori.livneh) [01:08:25] Something weird's going on. Page categorization events are showing up on enwiki watchlists, but $wgRCWatchCategoryMembership is still set to false on enwiki [01:08:30] Any idea what's up there? [01:08:39] (03Merged) 10jenkins-bot: Revert "Revert "Autopromotion: remove deprecated onView event, fix INGROUPS"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267201 (owner: 10Ori.livneh) [01:08:58] 6operations, 6Labs: Manual creation of labs account - https://phabricator.wikimedia.org/T125172#1980554 (10Krenair) Instructions in T83042, LDAP admins CC'd [01:09:14] tto, it was enabled and disabled [01:09:38] So the events are still in the watchlist table, then. Right [01:09:53] (or recentchanges or whatever you call it) [01:10:01] that's not quite how the watchlist works, but sure [01:10:06] yes [01:21:15] (03CR) 10MtDu: "I ran optipng on the logo before I pushed the patch. Is that enough or what else do I need to do?" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267170 (https://phabricator.wikimedia.org/T124881) (owner: 10MtDu) [01:24:33] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [01:25:43] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/). [01:26:58] (03PS2) 10Ori.livneh: Enable persistent redis connections for job runners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261306 (owner: 10Aaron Schulz) [01:27:23] (03CR) 10Ori.livneh: [C: 032] "“Oh well," McWatt sang, "what the hell.”" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261306 (owner: 10Aaron Schulz) [01:27:47] (03Merged) 10jenkins-bot: Enable persistent redis connections for job runners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/261306 (owner: 10Aaron Schulz) [01:29:22] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [01:29:53] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [01:31:56] !log ori@mira Synchronized wmf-config: I83da57cf: Enable persistent redis connections for job runners (duration: 01m 11s) [01:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:02:59] mkdir: cannot create directory �/sys/fs/cgroup/memory/mediawiki/job/13186�: File exists [02:03:00] limit.sh: failed to create the cgroup. [02:03:00] sigh [02:03:03] didn't this get fixed once [02:03:06] (silver) [02:11:03] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [02:14:33] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [02:15:22] (03CR) 10Andrew Bogott: [C: 032] Define wgOpenStackManagerProject [puppet] - 10https://gerrit.wikimedia.org/r/267192 (https://phabricator.wikimedia.org/T115029) (owner: 10Andrew Bogott) [02:17:35] (03PS3) 10Rush: diamond: nfsiostat as a collector [puppet] - 10https://gerrit.wikimedia.org/r/267193 [02:20:45] (03PS4) 10Rush: diamond: nfsiostat as a collector [puppet] - 10https://gerrit.wikimedia.org/r/267193 [02:23:22] (03CR) 10Rush: [C: 032] diamond: nfsiostat as a collector [puppet] - 10https://gerrit.wikimedia.org/r/267193 (owner: 10Rush) [02:25:29] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.11) (duration: 10m 40s) [02:25:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:26:01] (03PS1) 10Rush: diamond: enable nfsiostat on labs instances with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/267204 [02:26:36] (03PS2) 10Rush: diamond: enable nfsiostat on labs instances with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/267204 [02:32:56] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Jan 29 02:32:56 UTC 2016 (duration 7m 28s) [02:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:47] (03PS3) 10Rush: diamond: enable nfsiostat on labs instances with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/267204 [02:35:37] (03PS4) 10Rush: diamond: enable nfsiostat on labs instances with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/267204 [02:35:48] (03PS5) 10Rush: diamond: enable nfsiostat on labs instances with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/267204 [02:35:54] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [24.0] [02:38:10] (03CR) 10Rush: [C: 032] diamond: enable nfsiostat on labs instances with nfs mounts [puppet] - 10https://gerrit.wikimedia.org/r/267204 (owner: 10Rush) [02:42:54] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [02:57:07] (03PS2) 10Rush: diamond: monitor nscd behavior for ldap clients [puppet] - 10https://gerrit.wikimedia.org/r/265847 [03:17:19] 6operations, 10OTRS, 7HTTPS: ssl certificate replacement: ticket.wikimedia.org (expires 2016-02-16) - https://phabricator.wikimedia.org/T122320#1980721 (10Matthewrbowker) [04:13:42] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 12.00% of data above the critical threshold [100000000.0] [04:28:41] (03PS1) 10BBlack: dnsrecursor: add localhost data [puppet] - 10https://gerrit.wikimedia.org/r/267208 [04:36:13] (03CR) 10Subramanya Sastry: "Right now, it exists because Tim built it on ruthenium using the /srv/uprightdiff repo that has been checked out via puppet. In the future" [puppet] - 10https://gerrit.wikimedia.org/r/267190 (owner: 10Subramanya Sastry) [04:41:53] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:49:58] (03CR) 10Mobrovac: [C: 031] "I see. This is not the happiest solution, but it'll work for the time being." [puppet] - 10https://gerrit.wikimedia.org/r/267190 (owner: 10Subramanya Sastry) [05:08:31] (03PS2) 10Yuvipanda: dnsrecursor: add localhost data [puppet] - 10https://gerrit.wikimedia.org/r/267208 (https://phabricator.wikimedia.org/T125170) (owner: 10BBlack) [05:49:10] I'm getting logged out within the same browser session again. It seemed to go away for a few days after the first day it happened, but it's returned. [05:53:21] repeatedly? all user sessions were killed the other day for security reasons [05:59:12] the session is limited to today. [06:00:23] PROBLEM - Apache HTTP on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50430 bytes in 0.133 second response time [06:00:42] PROBLEM - HHVM rendering on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50430 bytes in 0.101 second response time [06:07:24] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.058 second response time [06:07:43] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 70151 bytes in 0.160 second response time [06:30:13] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:13] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:33] PROBLEM - puppet last run on chromium is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:34] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:52] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:33] PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:42] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:28] wctaiwan: can you get us some details of the sessions this is happening on? [06:39:48] (03CR) 10Luke081515: [C: 031] Enable WikidataPageBanner on es.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267195 (https://phabricator.wikimedia.org/T125000) (owner: 10Dereckson) [06:40:08] what kind of details? I'm using Firefox (latest stable); I accept session cookies, but not third party ones. Cookies are not kept beyond the current session. [06:40:43] Indication that it's happening is that I would be logged out, but when I go to login I'd see my username filled in, which it wouldn't be had I not been logged in (unless I'd logged out, but I generally don't bother to). [06:42:23] it seems to happen after a period of inactivity? I don't recall evert being logged in and navigating to another page to find that I'd been logged out. But I'm not 100% sure that's not a coincidence. [06:43:02] are you moving across wikis? Logging in to a particular wiki? [06:43:46] we are having some vaguely similar reports here -- https://phabricator.wikimedia.org/T124252#1979688 [06:43:48] hmm, that's a good point, actually. I might have logged in on meta and not enwiki. In which case it's PEBKAC. [06:43:54] but no actionable details yet [06:44:51] okay, I think that's unlikely, since the username wouldn't be pre-filled on enwiki if I logged in on meta (I just tested). [06:45:15] logging in on meta and then being logged in on enwiki should generally work for sure [06:45:54] assuming that you got all the interactions with loginwiki either via the 3rd party cookie + javascript or by the 1x1 images [06:46:01] well, not for me, since meta wouldn't be able to set a cookie for *.wikipedia.org [06:46:11] yeah, I wouldn't have, because I block third-party cookies. [06:46:37] right. that's the scenario that the images are meant to work with [06:46:51] I block 3rd party cookies too [06:46:57] I think Firefox catches those. Otherwise it'd be trivial to work around its tracking protection. [06:47:37] wctaiwan: are you running incognito too? [06:47:41] yes. [06:47:50] http://i.imgur.com/8pZHgIS.png are my privacy settings in firefox [06:47:56] ah. that may certainly play into this [06:48:45] Yeah, it could be related. But this is difficult to pin down because steps for reproduction would be "log in, stop looking at wikipedia, and wait for a few hours and then remember to check" [06:49:01] I'm not even sure at this point I'm just logging out and forgetting I did. [06:49:21] s/I'm just/if I'm not just/ [06:49:59] wctaiwan: I think it's worth filing a bug about with the description you have given thus far [06:50:20] sure, I can do that. Anything I should look for next time I suspect it's happening? [06:50:22] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: Puppet has 1 failures [06:51:17] getting the cookies that you have on when you suspect you've been logged out would be good. Having the cookies form before that as well would be even better [06:51:54] s/form/from/ [06:52:39] Hmm, I don't think Firefox shows any cookies when you're using private browsing :/ [06:54:38] they should show in the developer tools [06:55:00] nope [06:55:00] https://bugzilla.mozilla.org/show_bug.cgi?id=823941 [06:55:12] RECOVERY - puppet last run on chromium is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:19] anyway, I'll file the bug. Thanks. [06:56:29] wctaiwan: I'm looking at cookies attached to a GET of enwiki in an incognito FF 44 sesion right now [06:56:42] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:56:54] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:56:54] but I'm looking specifically at the response in the network tab [06:56:57] ohh [06:57:03] I was looking in the storage tab [06:57:14] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:57:35] okay, I'll try to get that then. [06:57:54] RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:57] cool. thanks for reporting and being willing to help debug a bit [06:58:05] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:09] np. thanks for looking into it. [06:58:23] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:32] wctaiwan: please cc me on the bug you file [06:58:33] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:37] will do [07:16:52] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [08:24:54] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM, 5Patch-For-Review: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1980984 (10Joe) I think we should just backport this patch to our current package while we are confident releasing a new one. This... [08:45:03] PROBLEM - HHVM rendering on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:46:42] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 70151 bytes in 0.307 second response time [09:07:23] (03PS1) 10Muehlenhoff: Update debian-targets patch for 1.0.2f [debs/openssl] - 10https://gerrit.wikimedia.org/r/267218 [09:07:47] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update debian-targets patch for 1.0.2f [debs/openssl] - 10https://gerrit.wikimedia.org/r/267218 (owner: 10Muehlenhoff) [09:20:55] (03Abandoned) 10Ema: eqiad: add text nodes to mobile cluster [puppet] - 10https://gerrit.wikimedia.org/r/266503 (https://phabricator.wikimedia.org/T109286) (owner: 10Ema) [09:26:11] Krenair: still awake? [09:26:40] *highly doubts it* [09:32:05] or greg-g (I'm trying to hunt down the reason for https://gerrit.wikimedia.org/r/#/c/267189/) [09:38:53] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [09:48:44] (03CR) 10Addshore: "At a guess the revert is due to https://phabricator.wikimedia.org/T125147" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267189 (owner: 10Alex Monk) [09:49:23] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [09:59:05] (03CR) 10Addshore: [C: 04-1] "Per https://phabricator.wikimedia.org/T125147" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 (owner: 10Addshore) [09:59:11] (03CR) 10Addshore: [C: 04-1] "Per https://phabricator.wikimedia.org/T125147" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264734 (owner: 10Addshore) [10:01:54] 6operations, 10DBA: upgrade db servers to jessie - https://phabricator.wikimedia.org/T125028#1981051 (10jcrespo) 5Open>3Invalid a:3jcrespo As per Ops Meeting. [10:01:56] 6operations, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#1981054 (10jcrespo) [10:02:53] RECOVERY - Disk space on ms-be2015 is OK: DISK OK [10:05:21] 6operations: upgrade iron to jessie (or get rid of it) - https://phabricator.wikimedia.org/T125025#1981064 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [10:07:08] 6operations: make the releases.wm.org index page look nicer - https://phabricator.wikimedia.org/T125164#1981067 (10Legoktm) >>! In T125164#1980139, @Dzahn wrote: > - add a custom header file, so we display "Wikimedia Software Releases" > instead of just "Index of /" > (https://httpd.apache.org/docs/2.4/mod/m... [10:15:13] RECOVERY - puppet last run on ms-be2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:17:17] !log rolling restart of swift in esams [10:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:20:15] 6operations, 10ops-codfw: ms-be2015.codfw.wmnet: slot=8 dev=sdi failed - https://phabricator.wikimedia.org/T124056#1981088 (10fgiunchedi) 5Open>3Resolved disk rebuilding, resolving [10:23:10] 6operations: Move bacula director and storage daemon off helium? - https://phabricator.wikimedia.org/T123723#1981097 (10akosiaris) The storage daemon must be on hardware as it needs access to the disk shelf. The big hurdle to having the storage daemon to a VM is access to a lot of disk space which right now is s... [10:25:26] 6operations, 10MediaWiki-General-or-Unknown, 5Patch-For-Review: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378#1981104 (10akosiaris) I think we should merge T32452 in this one [10:31:13] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet last ran 21 hours ago [10:31:43] RECOVERY - salt-minion processes on mw1120 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:32:10] 6operations, 10ops-eqiad: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1981110 (10elukey) 3NEW [10:33:03] (03PS1) 10Giuseppe Lavagetto: Add support for float timeouts in socket streams [debs/hhvm] - 10https://gerrit.wikimedia.org/r/267228 (https://phabricator.wikimedia.org/T125084) [10:35:12] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 845 [10:35:22] RECOVERY - RAID on ms-be2003 is OK: OK: optimal, 13 logical, 13 physical [10:35:27] 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1981122 (10akosiaris) Just sure of something. Currently the maps cache cluster is 2 boxes and is performing quite well (with minimal load). 4 does not sound bad to me, but do we have any numbers... [10:36:14] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#1981123 (10fgiunchedi) 3NEW [10:36:42] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [10:37:13] 6operations, 7Swift: swift: puppetized mkfs/parted fails on ms-be2003, ms-be2015 / disk error - https://phabricator.wikimedia.org/T125013#1981129 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi re: ms-be2003 is {T125200} and ms-be2015 was {T124056} resolving this one, thanks @dzahn ! [10:38:01] (03PS2) 10BBlack: eqiad: remove most mobile frontends from cache_mobile [puppet] - 10https://gerrit.wikimedia.org/r/267160 (https://phabricator.wikimedia.org/T109286) [10:38:03] (03PS1) 10BBlack: eqiad: remove last cache_mobile frontend [puppet] - 10https://gerrit.wikimedia.org/r/267230 (https://phabricator.wikimedia.org/T122651) [10:40:12] RECOVERY - check_mysql on db1008 is OK: Uptime: 846116 Threads: 2 Questions: 5842269 Slow queries: 5686 Opens: 2405 Flush tables: 2 Open tables: 429 Queries per second avg: 6.904 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:41:05] (03PS3) 10BBlack: eqiad: remove most mobile frontends from cache_mobile [puppet] - 10https://gerrit.wikimedia.org/r/267160 (https://phabricator.wikimedia.org/T109286) [10:41:30] (03CR) 10BBlack: [C: 032 V: 032] eqiad: remove most mobile frontends from cache_mobile [puppet] - 10https://gerrit.wikimedia.org/r/267160 (https://phabricator.wikimedia.org/T109286) (owner: 10BBlack) [10:42:49] (03CR) 10BBlack: [C: 04-1] "Needs to wait for ok from analytics, probably Monday" [puppet] - 10https://gerrit.wikimedia.org/r/267230 (https://phabricator.wikimedia.org/T122651) (owner: 10BBlack) [10:43:51] (03PS4) 10BBlack: cache_parsoid: remove citoid+cxserver pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266741 (https://phabricator.wikimedia.org/T110476) [10:44:00] (03CR) 10BBlack: [C: 032 V: 032] cache_parsoid: remove citoid+cxserver pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266741 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack) [10:44:15] (03PS4) 10BBlack: cache_parsoid: remove restbase pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266748 (https://phabricator.wikimedia.org/T110475) [10:44:25] (03CR) 10BBlack: [C: 032 V: 032] cache_parsoid: remove restbase pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266748 (https://phabricator.wikimedia.org/T110475) (owner: 10BBlack) [10:46:29] 6operations, 10RESTBase, 6Services, 10Traffic, 5Patch-For-Review: Remove restbase from parsoidcache - https://phabricator.wikimedia.org/T110475#1981161 (10BBlack) 5Open>3Resolved a:3BBlack [10:46:33] 6operations, 6Services, 10Traffic: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#1981163 (10BBlack) [10:46:37] 6operations, 6Services, 10Traffic: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#1578453 (10BBlack) [10:47:16] 6operations, 5Patch-For-Review, 7Swift: swift upgrade plans - https://phabricator.wikimedia.org/T117972#1981172 (10fgiunchedi) FWIW swift 2.6 has been released 4 days ago, https://github.com/openstack/swift/blob/master/CHANGELOG [10:50:57] 6operations: upgrade swift servers from precise to jessie - https://phabricator.wikimedia.org/T125024#1981174 (10fgiunchedi) note these might get upgraded to trusty first, see also related {T117972} [10:54:31] (03PS1) 10Jcrespo: Add mysql grants for racktables [puppet] - 10https://gerrit.wikimedia.org/r/267232 [10:54:36] 6operations: Move bacula director and storage daemon off helium? - https://phabricator.wikimedia.org/T123723#1981176 (10akosiaris) 5Open>3declined a:3akosiaris Declining per IRC OK from @MoritzMuehlenhoff [10:55:16] (03Abandoned) 10Jcrespo: Racktables user placeholder [puppet] - 10https://gerrit.wikimedia.org/r/254288 (owner: 10Jcrespo) [10:56:25] (03PS2) 10Jcrespo: Add mysql grants for racktables [puppet] - 10https://gerrit.wikimedia.org/r/267232 [10:57:39] (03CR) 10Jcrespo: [C: 032] Add mysql grants for racktables [puppet] - 10https://gerrit.wikimedia.org/r/267232 (owner: 10Jcrespo) [11:10:41] (03PS1) 10BBlack: MW parsoid URLs: s/parsoidcache/parsoid/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267234 (https://phabricator.wikimedia.org/T110472) [11:12:27] 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1981202 (10Yurik) The 4x4 varnishes + 4x2 backends was initially suggested by @bblack as the minimal platform to serve all of Wikipedias. I tried stress-testing maps from multiple labs instances... [11:12:39] robh about? [11:13:08] jynus? [11:14:08] it is urgent. [11:14:19] !log disabled puppet on analytics1027 due to issues with Camus and HDFS [11:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:14:51] BBlack [11:15:53] Steinsplitter, yes? [11:16:19] <_joe_> Steinsplitter: if something is urgent, I guess there is an UBN! ticket on phabricator, right? [11:16:38] _joe_ can you please check how much mass messeges are queued on commons? [11:16:46] and if there are RIP ones. [11:17:46] <_joe_> I have no internal knowledge of that extension, it would take quite some time and I'm already working on an UBN! ticket right now [11:18:05] <_joe_> I can take a look at the jobqueue, if that's what that extension uses [11:18:13] maybe jynu know how [11:18:28] yes, they are in the jobqueue [11:18:49] <_joe_> the jobqueues are in good health atm [11:18:56] <_joe_> https://grafana.wikimedia.org/dashboard/db/job-queue-health [11:19:25] <_joe_> also https://grafana.wikimedia.org/dashboard/db/job-queue-rate [11:19:32] strange :/ [11:19:40] will file a bug on phab [11:19:50] thx [11:20:22] <_joe_> Steinsplitter: but this is by no means a measure of that extension working correctly [11:22:13] !log rolling restart of swift in codfw [11:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:26:11] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1981220 (10BBlack) Here's a better list: cxserver deploy: https://github.com/wikimedia/mediawiki-services-cxser... [11:32:18] ACKNOWLEDGEMENT - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi broken disk, https://phabricator.wikimedia.org/T125200 [11:35:11] 6operations, 5Patch-For-Review: move racktables to a VM - https://phabricator.wikimedia.org/T105555#1981235 (10jcrespo) [11:35:13] 6operations, 10DBA: mysql privs: restrict access to racktables to krypton - https://phabricator.wikimedia.org/T118816#1981233 (10jcrespo) 5Open>3Resolved Done on https://gerrit.wikimedia.org/r/#/c/267232/ [11:38:40] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Search-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1981236 (10akosiaris) >>! In T120281#1971185, @EBernhardson wrote: > Another option for analytics<->codfw that me and @S... [11:40:13] PROBLEM - Check size of conntrack table on mw1228 is CRITICAL: Connection refused by host [11:40:33] PROBLEM - Disk space on mw1228 is CRITICAL: Connection refused by host [11:40:42] PROBLEM - DPKG on mw1228 is CRITICAL: Connection refused by host [11:40:53] PROBLEM - salt-minion processes on mw1228 is CRITICAL: Connection refused by host [11:41:12] PROBLEM - NTP on mw1228 is CRITICAL: NTP CRITICAL: No response from NTP server [11:41:13] PROBLEM - configured eth on mw1228 is CRITICAL: Connection refused by host [11:41:14] PROBLEM - dhclient process on mw1228 is CRITICAL: Connection refused by host [11:41:24] PROBLEM - RAID on mw1228 is CRITICAL: Connection refused by host [11:41:34] PROBLEM - nutcracker process on mw1228 is CRITICAL: Connection refused by host [11:41:53] PROBLEM - nutcracker port on mw1228 is CRITICAL: Connection refused by host [11:43:17] <_joe_> !log uploaded hhvm_3.6.5+dfsg1-1+wm8 to trusty-wikimedia [11:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:49:06] 6operations, 6Collaboration-Team-Backlog, 10Flow: Flow messages are not editable and new topics can't be posted (API outage) - https://phabricator.wikimedia.org/T125080#1981245 (10mark) 5Open>3Resolved a:3mark We have no indication that anything is wrong other than some brief effects shortly after the... [11:50:57] (03PS1) 10KartikMistry: CX: Remove ContentTranslationCorpora setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267236 [11:53:31] 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1981249 (10BBlack) I don't think I've ever made recommendations about the backend service, just the 4x4 cache/termination layer. That part isn't really a "suggestion", it's an operational minimu... [11:56:07] (03PS1) 10Alexandros Kosiaris: package_builder: Improve README.md networking part [puppet] - 10https://gerrit.wikimedia.org/r/267237 [11:58:58] 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1981251 (10akosiaris) OK, that answers my question. Thanks! [12:02:00] (03PS2) 10Alexandros Kosiaris: package_builder: Improve README.md networking part [puppet] - 10https://gerrit.wikimedia.org/r/267237 [12:02:18] 6operations, 6Discovery, 10MediaWiki-Logging, 7HHVM, 5Patch-For-Review: MediaWiki monolog doesn't handle Kafka failures gracefully - https://phabricator.wikimedia.org/T125084#1981252 (10Joe) Patch applied and new package built. The package was installed on labs and my test of fsockopen now shows that th... [12:03:56] (03CR) 10Alexandros Kosiaris: [C: 032] package_builder: Improve README.md networking part [puppet] - 10https://gerrit.wikimedia.org/r/267237 (owner: 10Alexandros Kosiaris) [12:18:23] PROBLEM - Apache HTTP on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:18:53] PROBLEM - HHVM rendering on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:02] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.176 second response time [12:21:35] 6operations, 10ops-codfw, 10ops-eqiad, 10ops-esams, and 2 others: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#1981274 (10BBlack) 3NEW [12:22:23] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 70567 bytes in 0.961 second response time [12:22:37] PROBLEM - Corp OIT LDAP Mirror on pollux is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:23:11] hey [12:23:12] ldap down [12:23:21] ? [12:23:25] anyone investigating it? [12:23:31] I will now [12:23:42] PROBLEM - Apache HTTP on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.523 second response time [12:23:46] we're at lunch, but can open a laptop if needed [12:24:12] PROBLEM - salt-minion processes on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:24:13] I think the whole server is down, I can handle for now [12:24:23] PROBLEM - RAID on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:24:33] PROBLEM - dhclient process on pollux is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:25:23] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 454 bytes in 0.046 second response time [12:25:48] oh, it is a virtual server [12:26:03] RECOVERY - RAID on pollux is OK: OK: no RAID installed [12:28:04] should I open my laptop? [12:28:11] yes, probably [12:29:09] here now [12:29:48] I do not have the dns list downladed [12:30:40] (03PS1) 10Muehlenhoff: Add base::firewall to jobrunners mw1161-mw1169 (reprovisioned app servers) [puppet] - 10https://gerrit.wikimedia.org/r/267238 [12:30:57] !log force-rebooting pollux [12:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:12] RECOVERY - salt-minion processes on pollux is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:33:28] RECOVERY - Corp OIT LDAP Mirror on pollux is OK: LDAP OK - 0.118 seconds response time [12:33:29] RECOVERY - dhclient process on pollux is OK: PROCS OK: 0 processes with command name dhclient [12:33:38] ok [12:33:41] back to lunch now :) [12:33:46] ttyl! [12:34:15] paravoid: enjoy your lunch! pollux is one of the ganeti vms which doesn't have the aio workaround yet, it was probably that: https://etherpad.wikimedia.org/p/disk_aio_setting [12:37:22] yeah I figured [12:37:47] i know about the aio issue, I've spoken with the debian maint and qemu upstreams btw [12:38:05] the recommendation was to upgrade to latest qemu [12:42:19] 6operations, 10ops-codfw, 10ops-eqiad, 10ops-esams, and 2 others: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#1981316 (10BBlack) For the record, salt on all hosts (which says it hit 1210 machines) gives this list for machines with 4-digit+ kern.log alerts presently: ``` {'a... [13:16:45] 6operations, 6Services, 10Traffic, 5Patch-For-Review: Decom parsoidcache cluster - https://phabricator.wikimedia.org/T110472#1981369 (10BBlack) FWIW, in a 1 hour snapshot of all traffic to parsoidcache (regardless of internal vs external IPs), when varnish/pybal monitoring checks are excluded, we're left w... [13:18:18] (03PS1) 10Aude: Update WikidataBuildResources git source (github -> gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/267242 (https://phabricator.wikimedia.org/T111173) [13:21:01] (03PS1) 10Mforns: Remove kafka1012 from EventLogging brokers array [puppet] - 10https://gerrit.wikimedia.org/r/267243 (https://phabricator.wikimedia.org/T125199) [13:21:39] 6operations, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#1981389 (10Yurik) @bblack, thanks for the link. The purpose of this task is exactly that - to have enough hardware for this service to gain a full production status. [13:28:23] PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: puppet fail [13:43:46] <_joe_> !log installing the new HHVM package to the canary appservers (main and api) [13:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:51:44] (03CR) 10Alex Monk: "no, T125209" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267189 (owner: 10Alex Monk) [13:52:40] (03CR) 10Alex Monk: [C: 04-1] "T125209" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264735 (owner: 10Addshore) [13:52:56] (03CR) 10Alex Monk: [C: 04-1] "T125209" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264734 (owner: 10Addshore) [13:54:43] RECOVERY - puppet last run on db2002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:03:59] <_joe_> !log installing the new hhvm package on all the codfw appserver [14:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:44] (03CR) 10Luke081515: [C: 04-1] Configure default Echo subscriptions user options on he.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/246171 (https://phabricator.wikimedia.org/T114982) (owner: 10Dereckson) [14:05:33] bblack, is there anything i can help with for the mobile->desktop? [14:13:45] yurik: no, we're pretty close to done, but we need to wait on analytics, and they're busy with other issues [14:14:22] bblack, thanks for pushing it forward :) [14:24:32] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1981540 (10BBlack) Status update: We're pretty much done with the cache traffic migration, but there's still 1x eqiad mobile cache (cp1060) pooled with low weight to keep mobile... [14:24:46] (03PS1) 10Mark Bergsma: Add BGP MED support [debs/pybal] - 10https://gerrit.wikimedia.org/r/267251 [14:24:46] !log rebooting bohrium for kernel update [14:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:37] (03CR) 10Mark Bergsma: [C: 032] Add BGP MED support [debs/pybal] - 10https://gerrit.wikimedia.org/r/267251 (owner: 10Mark Bergsma) [14:26:01] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add support for float timeouts in socket streams [debs/hhvm] - 10https://gerrit.wikimedia.org/r/267228 (https://phabricator.wikimedia.org/T125084) (owner: 10Giuseppe Lavagetto) [14:27:39] (03CR) 10Giuseppe Lavagetto: [C: 032] Add IPv6 support to all monitors [debs/pybal] - 10https://gerrit.wikimedia.org/r/267008 (owner: 10Mark Bergsma) [14:35:06] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1981548 (10Ottomata) Ok great! We’re having some issues with jobs right now due to some Kafka problems, and we’ll want to make sure everything is fine before we try to move on t... [14:36:30] (03PS1) 10Giuseppe Lavagetto: scap: re-add servers to mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/267252 (https://phabricator.wikimedia.org/T124642) [14:37:02] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] scap: re-add servers to mediawiki-installation [puppet] - 10https://gerrit.wikimedia.org/r/267252 (https://phabricator.wikimedia.org/T124642) (owner: 10Giuseppe Lavagetto) [14:39:29] !log stopped kafka (service) on kafka1012 (the host that caused the outage) [14:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:41:46] 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1981554 (10elukey) Kafka stopped on the node, no more services actively running on it. [14:42:30] PROBLEM - Kafka Broker Server on kafka1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties [14:43:16] ----^ sorry it's me, turning down icinga [14:43:28] <_joe_> schedule downtime :) [14:43:37] well, kafka didn't caused the outage, mediawiki did :-) [14:43:53] <_joe_> jynus: hhvm did [14:43:57] <_joe_> not mediawiki [14:43:57] :-) [14:45:03] 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1981558 (10Ottomata) @cmjohnson are you in the DC today? Can we get this disk swapped asap? Thanks! [14:45:10] 6operations, 10ops-eqiad, 5Patch-For-Review: SMART errors on kafka1012.eqiad.wmfnet - https://phabricator.wikimedia.org/T125199#1981560 (10Ottomata) p:5Triage>3High [14:46:26] ACKNOWLEDGEMENT - Host ms-be2003 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff Host didnt come up after reboot, stuck in BIOS [14:47:22] _joe_, do you want me to take T124642 or do you want to do it yourself? [14:47:42] <_joe_> jynus: I'm almost done [14:47:51] <_joe_> see my change up there :) [14:48:17] <_joe_> but if you want to practice pooling a server, be my guest :)) [14:48:51] let me at least resync them for you [14:49:04] <_joe_> I am almost done with that too [14:49:12] you are too fast [14:49:30] <_joe_> we just need to repool them [14:50:02] <_joe_> which means a) readding them to conftool-data [14:50:41] 6operations, 10Traffic: Evaluate and Test Limited Deployment of Varnish 4 - https://phabricator.wikimedia.org/T122880#1981574 (10ema) [14:50:43] 6operations, 10Traffic: Forward-port Varnish 3 patches to Varnish 4 - https://phabricator.wikimedia.org/T124277#1981572 (10ema) 5Open>3Resolved Some of the patches have to been tackled in https://phabricator.wikimedia.org/T124281. Some other patches are not needed anymore. The remaining ones have been forw... [14:52:39] (03PS1) 10Giuseppe Lavagetto: conftool: re-pool recovered servers [puppet] - 10https://gerrit.wikimedia.org/r/267256 (https://phabricator.wikimedia.org/T124642) [14:53:25] is that live now? [14:53:36] <_joe_> what? [14:53:43] <_joe_> the servers are still depooled [14:53:45] conftool [14:53:53] <_joe_> yes, see ops@ emails [14:53:53] not the patch, the tool for equiad [14:54:04] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 68.00% of data above the critical threshold [5000000.0] [14:54:14] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 69.57% of data above the critical threshold [5000000.0] [14:54:51] (03PS2) 10Giuseppe Lavagetto: conftool: re-pool recovered servers [puppet] - 10https://gerrit.wikimedia.org/r/267256 (https://phabricator.wikimedia.org/T124642) [14:55:24] I can help, then correcting documentation [14:56:05] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] conftool: re-pool recovered servers [puppet] - 10https://gerrit.wikimedia.org/r/267256 (https://phabricator.wikimedia.org/T124642) (owner: 10Giuseppe Lavagetto) [14:56:26] (I know you did the main thing, but thare are many other places) [14:57:01] <_joe_> jynus: well, check the docs, it should be ok now, I updated it this morning :) [14:57:22] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1020 is CRITICAL: CRITICAL: 65.22% of data above the critical threshold [10.0] [14:57:52] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1014 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [10.0] [14:57:53] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1022 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [10.0] [14:57:53] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1013 is CRITICAL: CRITICAL: 59.09% of data above the critical threshold [10.0] [14:58:31] ---^ kafka is not happy about the loss of 1012 :) [14:59:14] <_joe_> elukey: expected I guess? [14:59:30] kafka is ok with it, just a little cranky [15:00:17] _joe_: yep sorry [15:00:47] ottomata: we could think about reducing the false positives [15:01:12] it is not the only place, there are links with high visibility such as https://wikitech.wikimedia.org/wiki/Depooling_servers [15:01:20] false positives? [15:01:22] PROBLEM - Kafka Broker Under Replicated Partitions on kafka1018 is CRITICAL: CRITICAL: 77.27% of data above the critical threshold [10.0] [15:01:22] !log re-enabled puppet on analytics1027 [15:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:01:52] these alarms are legitimate ones but we are basically dropping them [15:02:05] ah, yeah if we did depenencies in icinca somehow properly [15:02:11] it could know not to alert about them [15:02:13] but, dunno [15:02:16] sounds messy :) [15:02:22] we could just schedule downtime for those services [15:02:43] ACKNOWLEDGEMENT - Host ms-be2007 is DOWN: PING CRITICAL - Packet loss = 100% Muehlenhoff The reboot for the kernel update triggered a reimage, needs to be sorted out [15:02:55] but then we'd hide errors on the remaining kafka brokers :( [15:04:32] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [15:04:34] not the whole servers [15:04:49] just things like under replicated partitions [15:04:59] which we know are going to be there while 1012 is down [15:05:12] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [15:05:14] taking notes, I'll check it :) [15:07:00] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#1981605 (10Papaul) Note: This is one of the other server that is out of warranty. Once i get the drives from Chris I will replace the bad drive. Thanks [15:08:19] !log powering off nas1001-b.eqiad.wmnet. https://phabricator.wikimedia.org/T124156 [15:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:30] !log powering off nas1001-a.eqiad.wmnet. https://phabricator.wikimedia.org/T124156 [15:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:15] 6operations, 10ops-eqiad, 5Patch-For-Review: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#1981619 (10akosiaris) [15:12:40] 6operations, 6Commons, 10MassMessage: Not all mass messages sent out. - https://phabricator.wikimedia.org/T125214#1981621 (10Steinsplitter) 3NEW [15:13:02] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 206, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/2: down - nas1001-b {#2993} [10Gbps DF]BR [15:13:27] 6operations, 10DBA: Prepare db1018 for s2 master failover - https://phabricator.wikimedia.org/T125215#1981628 (10jcrespo) 3NEW [15:14:05] 6operations, 10DBA: Prepare db1018 for s2 master failover - https://phabricator.wikimedia.org/T125215#1981638 (10jcrespo) [15:14:07] 6operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1981637 (10jcrespo) [15:15:11] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 227, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/2: down - nas1001-a {#2994} [10Gbps DF]BR [15:18:24] (03PS1) 10Muehlenhoff: Enable base::firewall on alsafi [puppet] - 10https://gerrit.wikimedia.org/r/267260 [15:18:54] 6operations, 10DBA: Prepare db1018 and s2-slaves for s2 master failover - https://phabricator.wikimedia.org/T125215#1981644 (10jcrespo) [15:19:16] (03PS1) 10Jcrespo: Depool db1018 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267261 (https://phabricator.wikimedia.org/T125215) [15:20:28] (03CR) 10Jcrespo: [C: 032] Depool db1018 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267261 (https://phabricator.wikimedia.org/T125215) (owner: 10Jcrespo) [15:21:04] (03PS1) 10Faidon Liambotis: reprepro: add HP's MCP repository to updates [puppet] - 10https://gerrit.wikimedia.org/r/267262 (https://phabricator.wikimedia.org/T97998) [15:21:13] !log upgrading packages (incl kernel) on esams cache hosts (cp3xxx) (codfw, ulsfo already done) [15:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:22:51] RECOVERY - mediawiki-installation DSH group on mw1217 is OK: OK [15:23:04] 6operations, 7Monitoring, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Add RAID monitoring for HP servers - https://phabricator.wikimedia.org/T97998#1981658 (10faidon) >>! In T97998#1979159, @jcrespo wrote: > AFAIK, hpcacucli is non-free. This is the basic, free, debian-included option... [15:23:42] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1172, mw1178,mw1217, mw1257 are unresponsive, mgmt interface unreachable - https://phabricator.wikimedia.org/T124642#1981659 (10Joe) 5Open>3Resolved [15:25:19] (03CR) 10Tim Landscheidt: [C: 04-1] "I missed If56f5be90411db7895e8dbd34b8cadea95ff510b, so the rationale in the commit message is not true." [puppet] - 10https://gerrit.wikimedia.org/r/267039 (https://phabricator.wikimedia.org/T123271) (owner: 10Tim Landscheidt) [15:31:02] RECOVERY - mediawiki-installation DSH group on mw1172 is OK: OK [15:32:20] !log remove all networking configuration from asw-b-eqiad switch for nas1001-a, nas1001-b. Leave just descriptions [15:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:21] (03CR) 10Eevans: [C: 031] [production]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) (owner: 10Eevans) [15:36:13] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: Puppet has 3 failures [15:37:24] 6operations, 10ops-eqiad, 5Patch-For-Review: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#1981686 (10akosiaris) [15:37:41] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Depool db1018 for maintenance (duration: 01m 49s) [15:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:37] 6operations, 10ops-eqiad, 5Patch-For-Review: decomission the netapps in EQIAD: nas1001-a, nas1001-b - https://phabricator.wikimedia.org/T124156#1947460 (10akosiaris) I' ve powered off nas1001-a and nas1001-b. I 've also removed most configuration for nas1001-a, nas1001-b from asw-b-eqiad. I 've left only t... [15:39:49] papaul: i don't have that size disk....i thought you needed 600GB SAS not 2TB....i need one for kafka1012 as well. Gonna have to order it (robh) [15:39:51] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [15:41:50] 6operations, 6Labs: evaluate possibility for nscd use with useldap - https://phabricator.wikimedia.org/T124991#1981692 (10mark) [15:42:02] cmjohnson1ok [15:42:05] 6operations, 6Labs: evaluate possibility for nscd use with useldap - https://phabricator.wikimedia.org/T124991#1971715 (10mark) [15:42:29] cmjohnson1: will open a task for that thanks [15:45:01] !log upgrade packages (incl kernel) on eqiad caches hosts (cp1xxx) [15:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:12] RECOVERY - mediawiki-installation DSH group on mw1257 is OK: OK [15:51:09] robh, could you restart parsoid-vd and parsoid-vd-client on ruthenium? i'll then ask you for access to logs after a bit to see what is going on with them. thanks. [15:53:21] RECOVERY - mediawiki-installation DSH group on mw1178 is OK: OK [16:09:50] (03PS2) 10Mforns: Remove kafka1012 from EventLogging brokers array [puppet] - 10https://gerrit.wikimedia.org/r/267243 (https://phabricator.wikimedia.org/T125199) [16:10:39] (03PS3) 10Ottomata: Remove kafka1012 from EventLogging brokers array [puppet] - 10https://gerrit.wikimedia.org/r/267243 (https://phabricator.wikimedia.org/T125199) (owner: 10Mforns) [16:11:58] (03PS2) 10Subramanya Sastry: parsoid-vd-client & diffservice: Use uprightdiff for diffing images [puppet] - 10https://gerrit.wikimedia.org/r/267190 [16:12:00] (03PS1) 10Subramanya Sastry: T110474: Point iegreview to internal parsoid url [puppet] - 10https://gerrit.wikimedia.org/r/267269 [16:12:02] (03PS1) 10Subramanya Sastry: T110474: Point restbase to internal parsoid url [puppet] - 10https://gerrit.wikimedia.org/r/267270 [16:14:34] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1981719 (10ssastry) >>! In T110474#1981220, @BBlack wrote: > integration-visualdiff: > https://github.com/wikime... [16:15:36] (03CR) 10GWicke: [C: 031] "This is overridden in hiera to use parsoid.svc & a specific instance in labs, so this change should not affect any running instances." [puppet] - 10https://gerrit.wikimedia.org/r/267270 (owner: 10Subramanya Sastry) [16:15:38] (03CR) 10Ottomata: [C: 032] Remove kafka1012 from EventLogging brokers array [puppet] - 10https://gerrit.wikimedia.org/r/267243 (https://phabricator.wikimedia.org/T125199) (owner: 10Mforns) [16:17:07] robh, sorry i now see you are away .. i was just operating off the clinic duty topic line in my irc client. :) [16:18:34] _joe_, could you restart parsoid-vd and parsoid-vd-client services on ruthenium? [16:19:05] or whichever root is around. [16:19:29] (03PS2) 10BBlack: MW parsoid URLs: s/parsoidcache/parsoid/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267234 (https://phabricator.wikimedia.org/T110472) [16:21:35] (03PS1) 10Ottomata: Unpuppetize impala in Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/267271 (https://phabricator.wikimedia.org/T125141) [16:27:16] (03CR) 10GWicke: [C: 031] MW parsoid URLs: s/parsoidcache/parsoid/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267234 (https://phabricator.wikimedia.org/T110472) (owner: 10BBlack) [16:33:28] !log uinstalling impala in analytics cluster [16:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:35:26] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1981816 (10greg) @legoktm: Update, please? [16:38:52] (03CR) 10Ottomata: [C: 032 V: 032] Unpuppetize impala in Analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/267271 (https://phabricator.wikimedia.org/T125141) (owner: 10Ottomata) [16:46:35] subbu: did anyone help ya out? [16:46:47] not yet. :) [16:47:05] will do now [16:47:19] thanks. [16:47:35] 6operations, 10ops-codfw: Codfw: ms-be2003 2TB order request - https://phabricator.wikimedia.org/T125223#1981842 (10Papaul) 3NEW a:3RobH [16:47:48] !log restarting parsoid-vd & parsoid-vd-client on ruthenium [16:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:48:23] subbu: so want the log output in like 5 minutes from now? [16:48:28] I can just set a timer to remind me [16:48:32] yes,that would be great. thanks. [16:48:36] will do [16:49:21] i'll toss in /tmp with you as owner similar to last time [16:49:44] I should have been more detailed on why i did that restart in sal, heh [16:50:03] !log parsoid-vd restart was due to subbu irc request (i wasnt just randomly restarting things ;) [16:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:50:48] k [16:51:30] 6operations, 10Traffic: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#1981852 (10Ottomata) Just so it doesn't get lost in this process: https://gerrit.wikimedia.org/r/#/c/230173/ I still want to merge that and use it one day... :) [16:52:32] 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1981854 (10RobH) @ssastry: Since you already have shell, L3, etc... there are two things needed for this: 1.) Your managers approval for this access request expansion. 2.) Ops meeting review and approva... [16:53:58] 10Ops-Access-Requests, 6operations: add subbu to parsoid-roots - https://phabricator.wikimedia.org/T125166#1981856 (10RobH) I take that back, I actually don't see your signature on the L3 document either? (it is new and your access predates it, so that is not unusual.) Would you additionally review and sign... [16:55:51] subbu: logs are in tmp for ya [16:56:19] thanks. [16:56:40] quite welcome, hopfully we get it all approved on monday so you can do without waiting on us =] [16:56:56] yes. indeed. [16:57:03] a chunk of ops is at a conference, hence the low response rate [16:57:24] looks like i have a config problem for the client (probably some path issue in my puppet code) ... time to fix it. [16:58:48] * robh will be here all PDT AM so will be around for relevant puppet merges and service restarts [16:59:01] I was supposed to go to ulsfo but they have not sent me the completion notice for the xconnect yet... [17:00:19] greg-g: anomie, tgr, and I were wondering if we could deploy a few sessionmanager related backports today. [17:00:33] bd808: as I take a deep breath, yes [17:00:49] We have https://gerrit.wikimedia.org/r/#/q/status:open+topic:sessionmanager-backports,n,z and https://gerrit.wikimedia.org/r/#/c/267134/ right now [17:01:35] (03PS1) 10Ottomata: Respect $enabled param on kafka::server [puppet/kafka] - 10https://gerrit.wikimedia.org/r/267278 [17:01:50] !log restarting mysql at db1018 [17:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:03:51] anomie: should we get started? I can run the deploys if you can help test them [17:04:34] bd808: ok [17:05:02] anomie: is there any ordering that is better or worse? [17:05:29] bd808: I don't think there are any cross-patch dependencies in the ones I did. [17:06:04] (03CR) 10Ottomata: [C: 032] Respect $enabled param on kafka::server [puppet/kafka] - 10https://gerrit.wikimedia.org/r/267278 (owner: 10Ottomata) [17:06:09] yeah they look to be independent. Ok [17:06:33] (03PS1) 10Ottomata: Update kafka submodule with $enabled param fix [puppet] - 10https://gerrit.wikimedia.org/r/267279 [17:07:19] (03PS2) 10Ottomata: Update kafka submodule with $enabled param fix [puppet] - 10https://gerrit.wikimedia.org/r/267279 [17:07:37] (03CR) 10Ottomata: [C: 032 V: 032] Update kafka submodule with $enabled param fix [puppet] - 10https://gerrit.wikimedia.org/r/267279 (owner: 10Ottomata) [17:13:07] (03PS1) 10Bmansurov: Stop the first survey in fawiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267282 (https://phabricator.wikimedia.org/T123770) [17:14:42] greg-g: Hello. Could you please help SWAT https://gerrit.wikimedia.org/r/#/c/267282/ ? [17:15:17] (03PS1) 10Subramanya Sastry: parsoid-vd-client on ruthenium: Fix path to config file [puppet] - 10https://gerrit.wikimedia.org/r/267283 [17:16:05] bmansurov: If greg-g oks it I can push the change out [17:16:09] (03CR) 10Subramanya Sastry: "This should fix parsoid-vd-client errors seen on ruthenium." [puppet] - 10https://gerrit.wikimedia.org/r/267283 (owner: 10Subramanya Sastry) [17:16:37] Fridays aren't typically deploy days though. We should remember to point that out to folks who run quicksurveys [17:16:54] bd808: thanks, leila says in the task greg-g ok'ed it [17:17:38] bmansurov: I have a few changes queued up in front of you, so it will be a little while. [17:17:51] sure, i'm here [17:17:54] jenkins is taking his sweet time this morning [17:18:06] bd808: oh dear [17:20:29] (03CR) 10BBlack: [C: 031] T110474: Point iegreview to internal parsoid url [puppet] - 10https://gerrit.wikimedia.org/r/267269 (owner: 10Subramanya Sastry) [17:21:00] (03CR) 10BBlack: [C: 031] T110474: Point restbase to internal parsoid url [puppet] - 10https://gerrit.wikimedia.org/r/267270 (owner: 10Subramanya Sastry) [17:21:10] 6operations, 10ops-codfw: Codfw-mw* IDRAC firmware upgrade - https://phabricator.wikimedia.org/T125088#1981944 (10RobH) @Papaul: While you do this, would you also document the process on the platform specific documentation pages on wikitech? Thanks! [17:23:40] (03CR) 10BryanDavis: [C: 031] T110474: Point iegreview to internal parsoid url [puppet] - 10https://gerrit.wikimedia.org/r/267269 (owner: 10Subramanya Sastry) [17:24:35] bd808: I approved the turning off of the surveys in -staff to leila [17:24:46] greg-g: thanks [17:25:00] is audio seeming a bit crappy to anyone else via blujeans? [17:25:13] oh it just got better [17:26:53] * bd808 glares at "Build has been executing for 18 min...." [17:27:18] sorry wrong channel [17:28:16] bd808: that's the new incarnation of the 'compiling' xkcd comic I guess? [17:28:39] https://xkcd.com/303/ ;) [17:28:51] rebuild all your docker base images and the containers direved from them [17:29:01] that's my equivalent... [17:29:02] *shudder* [17:34:34] anomie: finally ready to start syncing things [17:34:39] bd808: ok [17:35:20] syncs will take ~2m each because of https://phabricator.wikimedia.org/T125108 [17:35:40] !log bd808@mira Synchronized php-1.27.0-wmf.11/includes/session/SessionBackend.php: SessionManager: Save user name to metadata even if the user doesn't exist locally (a39b4ac) (duration: 01m 29s) [17:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:36:09] anomie: ^ do we have a reproduction case for that one? [17:36:54] bd808: Not separately. The three of mine combined should fix the auto-creation not auto-creating on loginwiki bug. [17:37:30] ok. So I should just power through and we can check at the end then I guess [17:38:12] (03PS2) 10BryanDavis: Grant autocreateaccount to anons on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267134 (https://phabricator.wikimedia.org/T125133) (owner: 10Anomie) [17:38:19] (03CR) 10BryanDavis: [C: 032] Grant autocreateaccount to anons on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267134 (https://phabricator.wikimedia.org/T125133) (owner: 10Anomie) [17:38:55] (03Merged) 10jenkins-bot: Grant autocreateaccount to anons on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267134 (https://phabricator.wikimedia.org/T125133) (owner: 10Anomie) [17:39:25] !log bd808@mira Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/session/CentralAuthSessionProvider.php: CentralAuth: Take auto-creation into account (f526ef1) (duration: 01m 28s) [17:39:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:39:58] (the two code changes are needed to get CA to try triggering the auto-creation, and the config change is needed to have loginwiki allow it to happen) [17:41:56] !log bd808@mira Synchronized wmf-config/CommonSettings.php: Grant autocreateaccount to anons on loginwiki (d916008) (duration: 01m 27s) [17:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:42:14] anomie: that's the 3 important ones [17:42:18] bd808: Worked! [17:42:23] w00t! [17:42:29] * anomie sees "Anomie test 8" got created on loginwiki [17:44:44] !log bd808@mira Synchronized php-1.27.0-wmf.11/includes/api/ApiMain.php: Log user-agents that are using HTTP when HTTPS is preferred (55ac0b7) (duration: 01m 26s) [17:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:45:12] anomie: can you trigger that one? ^ [17:45:17] Sure, just a minute [17:45:34] bd808: Done. Got the warning. [17:46:01] bd808: And I see it going into logstash too [17:46:07] yup [17:46:15] 52 already :( [17:47:32] (03PS2) 10BryanDavis: Stop the first survey in fawiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267282 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov) [17:47:49] (03CR) 10BryanDavis: [C: 032] Stop the first survey in fawiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267282 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov) [17:47:58] bmansurov: you are up next [17:48:17] (03Merged) 10jenkins-bot: Stop the first survey in fawiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267282 (https://phabricator.wikimedia.org/T123770) (owner: 10Bmansurov) [17:48:55] I see a ":(", should I be worried or is it just not a fix to something that we're ok not fixing right now, bd808 anomie [17:49:16] greg-g: It's sadness at how many bots are still hitting http:// instead of https:// [17:49:17] i'm here [17:49:22] greg-g: nothign bad. we just have logging of misbehaving bots now [17:49:24] (we just added logging to log that) [17:50:12] * anomie decides to send an announcement to mediawiki-api-announce and wikitech-l [17:51:33] !log bd808@mira Synchronized wmf-config/InitialiseSettings.php: Stop the first survey in fawiki and eswiki (f89621d) (duration: 01m 25s) [17:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:51:37] bmansurov: ^ [17:52:03] bd808: thanks, surveys are gone as expected [17:52:10] sweet [17:52:15] ori, good morning. when you get a chance, can you look at https://gerrit.wikimedia.org/r/#/c/267283/ and https://gerrit.wikimedia.org/r/#/c/267190/ ? [17:52:56] (03PS2) 10Ori.livneh: parsoid-vd-client on ruthenium: Fix path to config file [puppet] - 10https://gerrit.wikimedia.org/r/267283 (owner: 10Subramanya Sastry) [17:53:02] (03CR) 10Ori.livneh: [C: 032 V: 032] parsoid-vd-client on ruthenium: Fix path to config file [puppet] - 10https://gerrit.wikimedia.org/r/267283 (owner: 10Subramanya Sastry) [17:53:07] bd808: anomie whew, thanks :) [17:53:40] greg-g: I'm all done now and I documented what we did on [[Deployments]] [17:53:43] (03PS3) 10Ori.livneh: parsoid-vd-client & diffservice: Use uprightdiff for diffing images [puppet] - 10https://gerrit.wikimedia.org/r/267190 (owner: 10Subramanya Sastry) [17:53:50] (03CR) 10Ori.livneh: [C: 032 V: 032] parsoid-vd-client & diffservice: Use uprightdiff for diffing images [puppet] - 10https://gerrit.wikimedia.org/r/267190 (owner: 10Subramanya Sastry) [17:54:24] bd808: thank you [17:56:13] subbu: merged, ran puppet on ruthenium, looks good [17:56:20] ori, great. thanks. [17:58:19] is there an ops session talk this week or not? I don't remember that one was set but better safe than sorry [18:01:34] !log creating special partitioning for db2034 and db2042 (ETA:5 days, lag) [18:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:09:16] bblack, i was about to send an email to wikitech-l and other lists about the parsoid-lb decommissioning. Should we pick a date for it that I can announce? or should I just say "soon, once we finish migrating all known services away from it"? [18:10:39] subbu: it's not terribly time-critical, I'd say announce that we plan to decom it it 3 weeks from now, and offer them pointers to switching to using the in-wiki-domain RB URLs. [18:10:56] maybe even 2 weeks. there's really not much traffic, I can't image there will be much objection [18:11:42] !log creating special partitioning for db2037 and db2044 (ETA:5 days, lag) [18:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:12:59] bblack, done. i mentioned 3 weeks. [18:13:28] robh, can i get another dump of the parsoid-vd-client logs? thanks. [18:13:36] yep [18:13:44] subbu: thanks! [18:14:19] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1982145 (10ssastry) [18:14:42] subbu: done [18:14:59] thanks. [18:15:24] welcome [18:24:09] (03CR) 10GWicke: [C: 031] [production]: match restbase config to current Cassandra cluster [puppet] - 10https://gerrit.wikimedia.org/r/266297 (https://phabricator.wikimedia.org/T123869) (owner: 10Eevans) [18:24:14] Does anyone here know how transparecy.wikimedia.org gets deployed? [18:26:07] nuria: puppet controls the software deployment [18:26:16] and in turn, the puppetization indicates the content comes from a git repo [18:26:26] $repo_dir = '/srv/org/wikimedia/TransparencyReport' [18:26:26] $docroot = "${repo_dir}/build" [18:26:26] git::clone { 'wikimedia/TransparencyReport': [18:26:26] ensure => latest, [18:26:26] directory => $repo_dir, [18:26:28] } [18:26:54] pretty sure deployment is ssh to the server [18:26:56] and git pull [18:27:09] ensure => latest doesn't pull for you on automated puppet runs? [18:27:13] since that git::clone up there will do the initial setup but not automatically pull [18:27:18] ok [18:27:19] oh, right [18:27:40] i take it back, ensure => latest should do it [18:27:45] bblack: i see, so puppet is updating to latest then [18:27:59] yeah and that will run every half hour or so [18:28:05] bblack: ok, thank you , will add piwik there too [18:28:32] 6operations, 10ops-codfw, 10hardware-requests: Codfw: ms-be2003 2TB order request - https://phabricator.wikimedia.org/T125223#1982179 (10RobH) @papaul: Please use one of the spare 2TB SATA dissk on the codfw spares hardware listing sheet. These spares are specifically for the ms-be systems that are out of w... [18:36:12] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#1982247 (10RobH) Please note that there are 10 spares on CODFW spare tracking to replace disks in out of warranty spares: HDD - SATA Seagate ST2000DM001 7.2K 2TB 10 10 These shouldn't fall be... [18:36:24] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#1982251 (10RobH) a:3Papaul [18:36:52] 6operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T125200#1981123 (10RobH) [18:36:55] 6operations, 10ops-codfw, 10hardware-requests: Codfw: ms-be2003 2TB order request - https://phabricator.wikimedia.org/T125223#1982252 (10RobH) 5Open>3Resolved I didn't want to move this into that private space, as anyone who cannot view the space is then stuck getting alerts and unable to unsubscribe. I... [18:36:56] (03PS1) 10Elukey: Termporary disable puppet on kafka1012 for maintenance purposes [puppet] - 10https://gerrit.wikimedia.org/r/267293 [18:37:58] (03PS1) 10Jcrespo: Install Jessie on db1018 [puppet] - 10https://gerrit.wikimedia.org/r/267294 [18:38:00] (03Abandoned) 10Elukey: Termporary disable puppet on kafka1012 for maintenance purposes [puppet] - 10https://gerrit.wikimedia.org/r/267293 (owner: 10Elukey) [18:38:43] (03CR) 10Jcrespo: [C: 032] Install Jessie on db1018 [puppet] - 10https://gerrit.wikimedia.org/r/267294 (owner: 10Jcrespo) [18:38:53] (03PS1) 10Elukey: Temporary disable kafka1012 for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/267295 [18:40:12] (03PS2) 10Jcrespo: Install Jessie on db1018 [puppet] - 10https://gerrit.wikimedia.org/r/267294 (https://phabricator.wikimedia.org/T125215) [18:53:18] (03PS2) 10Ottomata: Temporary disable kafka1012 for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/267295 (owner: 10Elukey) [18:54:08] (03CR) 10Ottomata: [C: 032 V: 032] Temporary disable kafka1012 for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/267295 (owner: 10Elukey) [19:00:54] (03PS1) 10Dzahn: releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) [19:01:27] (03CR) 10jenkins-bot: [V: 04-1] releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) (owner: 10Dzahn) [19:01:44] (03PS2) 10Dzahn: releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) [19:04:16] (03PS3) 10Dzahn: releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) [19:04:32] (03PS4) 10Dzahn: releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) [19:04:37] (03CR) 10jenkins-bot: [V: 04-1] releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) (owner: 10Dzahn) [19:05:59] 6operations: rhodium.eqiad.wmnet status? - https://phabricator.wikimedia.org/T125056#1982360 (10Dzahn) a:3Dzahn [19:09:50] 6operations: rhodium.eqiad.wmnet status? - https://phabricator.wikimedia.org/T125056#1982415 (10Dzahn) looks like an OS got installed but the service has not been implemented yet, and it's ----> T98173 [19:10:23] 6operations, 5Patch-For-Review: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1261185 (10Dzahn) T125056 asks for the status of this server [19:13:41] 6operations: rhodium.eqiad.wmnet status? - https://phabricator.wikimedia.org/T125056#1982455 (10Dzahn) @ArielGlenn does that answer the status question sufficiently? i'd close or merge it with T98173 [19:14:16] 6operations: rhodium.eqiad.wmnet status? - https://phabricator.wikimedia.org/T125056#1982462 (10Dzahn) [19:14:27] 6operations: rhodium.eqiad.wmnet status? - https://phabricator.wikimedia.org/T125056#1982465 (10Dzahn) a:5Dzahn>3ArielGlenn [19:36:07] !log reinstall db1018 [19:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:53:25] (03PS1) 10Subramanya Sastry: parsoid-vd-client: Fill out missing pieces of the config file [puppet] - 10https://gerrit.wikimedia.org/r/267311 [19:56:33] (03CR) 10Subramanya Sastry: "Fixes based on error logs on ruthenium." [puppet] - 10https://gerrit.wikimedia.org/r/267311 (owner: 10Subramanya Sastry) [20:05:48] (03PS5) 10Dzahn: releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) [20:07:50] (03CR) 10Dzahn: [C: 032] releases: add header for mediawiki release dir [puppet] - 10https://gerrit.wikimedia.org/r/267299 (https://phabricator.wikimedia.org/T125164) (owner: 10Dzahn) [20:10:11] 6operations, 5Patch-For-Review: make the releases.wm.org index page look nicer - https://phabricator.wikimedia.org/T125164#1982623 (10Dzahn) >>! In T125164#1981067, @Legoktm wrote: > Can we also do something similar for the `/mediawiki` page? Yes. done {F3291357} [20:10:47] legoktm: https://releases.wikimedia.org/mediawiki/ [20:10:59] latest on top, not "Index of" [20:11:02] (03CR) 10Aaron Schulz: [C: 032] Use the logical redis definition for GettingStarted. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266481 (https://phabricator.wikimedia.org/T124671) (owner: 10Giuseppe Lavagetto) [20:11:31] except that "snapshot" is from 2009 , heh [20:12:20] (03Merged) 10jenkins-bot: Use the logical redis definition for GettingStarted. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266481 (https://phabricator.wikimedia.org/T124671) (owner: 10Giuseppe Lavagetto) [20:16:24] !log aaron@mira Synchronized wmf-config/CommonSettings.php: Use the logical redis definition for GettingStarted (duration: 01m 26s) [20:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:37] (03PS2) 10Dzahn: Enable base::firewall on alsafi [puppet] - 10https://gerrit.wikimedia.org/r/267260 (owner: 10Muehlenhoff) [20:20:42] (03PS1) 10Jcrespo: Revert "Install Jessie on db1018" [puppet] - 10https://gerrit.wikimedia.org/r/267316 [20:20:51] (03PS2) 10Jcrespo: Revert "Install Jessie on db1018" [puppet] - 10https://gerrit.wikimedia.org/r/267316 [20:22:11] jynus: :( failed entirely on that hardware? [20:22:49] it goes into a crazy loop when the disk fails to mount, but it also doesn't let me mount it manually [20:23:07] (03CR) 10Jcrespo: [C: 032] Revert "Install Jessie on db1018" [puppet] - 10https://gerrit.wikimedia.org/r/267316 (owner: 10Jcrespo) [20:23:09] sigh, are they HPs? [20:23:38] no [20:24:42] hmm, i wonder.. trying trusty then? [20:25:12] yes, at least try, and I cannot leave replication paused for the whole weekend [20:25:20] gotcha [20:27:00] this is new, either the installer has been updated, or there is something wrong with the disk [20:27:29] and with the installer i mean jessie, not our recipe [20:28:40] (03PS1) 10Papaul: Decom: Remove caesium from dhcpd Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267320 (https://phabricator.wikimedia.org/T125165) [20:30:07] tusty "just works" [20:30:24] it is an upstream change on jessie's installer [20:31:23] wow, interesting [20:31:33] i was lucky so far with the hardware [20:31:45] well, or just replaced stuff with virtual ones [20:31:55] this is new, like a few weeks new, an maybe RAID-specific [20:32:03] aha [20:32:46] (03PS3) 10Dzahn: Enable base::firewall on alsafi [puppet] - 10https://gerrit.wikimedia.org/r/267260 (owner: 10Muehlenhoff) [20:32:59] (03CR) 10Dzahn: [C: 032] Enable base::firewall on alsafi [puppet] - 10https://gerrit.wikimedia.org/r/267260 (owner: 10Muehlenhoff) [20:33:22] but wasting 1 hour installing trusty is no fun [20:34:11] yea :/ [20:35:06] Reedy: https://releases.wikimedia.org/mediawiki/ should we even have that "snapshot" dir up anymore? check the date [20:35:24] it's more noticable now because of the "version sort" [20:35:48] so the latest stuff should be on top [20:36:11] i would assume a snapshot is like from last night [20:40:34] (03CR) 10Addshore: [C: 031] Update WikidataBuildResources git source (github -> gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/267242 (https://phabricator.wikimedia.org/T111173) (owner: 10Aude) [20:43:44] I am thinking of trying now jessie, or if I will waist another hour installing jessie and then tusty again [20:44:03] as it was a partition-related issue [20:44:50] I will check jessie-installer bugs first [20:45:02] jynus: yea, that might save time in the long run though [20:45:45] (03PS1) 10Papaul: Decom: Remove caesium from authoinstall Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267321 (https://phabricator.wikimedia.org/T125165) [20:46:17] I suppose you were referring before to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=788156 [20:47:11] (but this is not it) [20:47:23] i did not have that specific bug number in mind, but i had some vague memories about issues with HP hardware we had that did not happen on Dell [20:48:29] 6operations, 5Patch-For-Review: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1982693 (10ArielGlenn) Should its current salt key be kept around or can I toss it? If I can toss it then that's good enough for me. [20:49:14] 6operations: rhodium.eqiad.wmnet status? - https://phabricator.wikimedia.org/T125056#1982694 (10ArielGlenn) Well I knew about that but then it's stalled, and the think is that it has a salt key. Anyways I asked on the other ticket, likely I'll be able to close this soon. [20:50:47] apergos: i dont actually see that salt key on neodymium [20:50:57] it was... I think... [20:51:11] * apergos goes to look at their notes [20:51:17] my guess was it probably was "rhodium.wikimedia.org" [20:51:21] vs. eqiad.wmnet [20:51:32] i saw a change in gerrit that changed that name [20:51:36] (03PS1) 10Jcrespo: Revert "Revert "Install Jessie on db1018"" [puppet] - 10https://gerrit.wikimedia.org/r/267322 [20:51:38] "wrong FQDN" etc [20:51:40] it might have been one of the hosts with a puppet cert and no salt key [20:51:41] (03PS2) 10Jcrespo: Revert "Revert "Install Jessie on db1018"" [puppet] - 10https://gerrit.wikimedia.org/r/267322 [20:51:53] I was trying to get all that cleaned up and got down to only two hosts left [20:52:47] (03PS1) 10Papaul: Decom: Remove rsync_caesium role from site.pp Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267323 (https://phabricator.wikimedia.org/T125165) [20:53:03] apergos: yes, it's a puppet cert [20:53:12] for the current name [20:53:20] yeah I found the notes on the ticket [20:53:24] should have read before replying [20:53:34] (03CR) 10Jcrespo: [C: 032] Revert "Revert "Install Jessie on db1018"" [puppet] - 10https://gerrit.wikimedia.org/r/267322 (owner: 10Jcrespo) [20:54:20] apergos: maybe it's easier to just delete it.. the server is not running , it's been a couple months.. making a new one costs less time than asking [20:55:05] 6operations: rhodium.eqiad.wmnet status? - https://phabricator.wikimedia.org/T125056#1982732 (10ArielGlenn) Er rather it has NO salt key but a valid (apparently) puppet cert. I just want those lists to be in sync. [20:56:33] 6operations, 5Patch-For-Review: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1982744 (10ArielGlenn) Sorry, it's not about the salt key, it's about having a current puppet cert without having a salt key. I'm trying to keep those lists in sync at l... [20:56:57] well I don't know where alex is in it so I might as well let him reply [20:57:02] it's not going to kill me to wait [20:57:23] hate when I can't remember stuff on my own tickets I wrote not more than a day or two ago though [21:04:06] (03PS2) 10Dzahn: Decom: Remove caesium from dhcpd Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267320 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul) [21:04:16] (03CR) 10Dzahn: [C: 032] Decom: Remove caesium from dhcpd Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267320 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul) [21:04:46] (03PS2) 10Dzahn: Decom: Remove caesium from authoinstall Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267321 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul) [21:05:37] (03PS3) 10Dzahn: Decom: Remove caesium from authoinstall Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267321 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul) [21:05:56] (03PS4) 10Dzahn: Decom: Remove caesium from autoinstall [puppet] - 10https://gerrit.wikimedia.org/r/267321 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul) [21:07:26] (03CR) 10Dzahn: "you can also delete the entire role class, it was just for this purpose and not used elsewhere" [puppet] - 10https://gerrit.wikimedia.org/r/267323 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul) [21:08:01] (03CR) 10Dzahn: [C: 04-1] "please also remove roles/rsync_caesium.pp" [puppet] - 10https://gerrit.wikimedia.org/r/267323 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul) [21:08:14] (03CR) 10Dzahn: [C: 032] Decom: Remove caesium from autoinstall [puppet] - 10https://gerrit.wikimedia.org/r/267321 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul) [21:09:04] (03PS2) 10Dzahn: Decom: Remove rsync_caesium role from site.pp Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267323 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul) [21:09:06] 6operations: jessie installer fails when using db hosts- same recipe works on trusty and on other hosts/a few weeks ago - https://phabricator.wikimedia.org/T125256#1982775 (10jcrespo) 3NEW [21:10:06] 6operations: jessie installer fails when using db hosts- same recipe works on trusty and on other hosts/a few weeks ago - https://phabricator.wikimedia.org/T125256#1982783 (10jcrespo) And yes, I tried formatting it manually, too. [21:10:47] (03PS1) 10Jcrespo: Revert "Revert "Revert "Install Jessie on db1018""" [puppet] - 10https://gerrit.wikimedia.org/r/267326 [21:10:56] (03PS2) 10Jcrespo: Revert "Revert "Revert "Install Jessie on db1018""" [puppet] - 10https://gerrit.wikimedia.org/r/267326 [21:12:07] (03CR) 10Dzahn: [C: 032] Decom: Remove rsync_caesium role from site.pp Bug:T125165 [puppet] - 10https://gerrit.wikimedia.org/r/267323 (https://phabricator.wikimedia.org/T125165) (owner: 10Papaul) [21:13:00] !log bromine - stop and remove rsync service [21:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:16:08] (03PS1) 10Jcrespo: Make sure the installation is fully unattended [puppet] - 10https://gerrit.wikimedia.org/r/267328 [21:17:09] (03PS2) 10Jcrespo: Make sure the installation is fully unattended [puppet] - 10https://gerrit.wikimedia.org/r/267328 [21:17:47] (03PS3) 10Jcrespo: Make sure the installation is fully unattended [puppet] - 10https://gerrit.wikimedia.org/r/267328 [21:19:15] (03CR) 10Jcrespo: [C: 032] Make sure the installation is fully unattended [puppet] - 10https://gerrit.wikimedia.org/r/267328 (owner: 10Jcrespo) [21:19:25] (03PS3) 10Jcrespo: Revert "Revert "Revert "Install Jessie on db1018""" [puppet] - 10https://gerrit.wikimedia.org/r/267326 [21:21:03] (03CR) 10Jcrespo: [C: 032] Revert "Revert "Revert "Install Jessie on db1018""" [puppet] - 10https://gerrit.wikimedia.org/r/267326 (owner: 10Jcrespo) [21:21:33] aren't you taking it too far, jynus ? :) [21:21:49] oh, wait until the next time [21:22:07] :-) that is only 2 tries, actually [21:24:03] I'be been toold, however, that if you stick a lot of reverts there, it eventually works [21:25:03] oh, keep trying then [21:29:08] (03CR) 10Jcrespo: "This has only deleted one of the 2 confirmations, it still asks for comfirmation to write the partition table." [puppet] - 10https://gerrit.wikimedia.org/r/267328 (owner: 10Jcrespo) [21:32:10] (03CR) 10Mobrovac: [C: 031] parsoid-vd-client: Fill out missing pieces of the config file [puppet] - 10https://gerrit.wikimedia.org/r/267311 (owner: 10Subramanya Sastry) [21:33:20] 6operations, 5Patch-For-Review: decom caesium - https://phabricator.wikimedia.org/T125165#1982829 (10Dzahn) merged @papaul's changes: https://gerrit.wikimedia.org/r/#/c/267323/ https://gerrit.wikimedia.org/r/#/c/267321/ https://gerrit.wikimedia.org/r/#/c/267320/ [21:34:08] (03PS1) 10Dzahn: delete rsync_caesium class, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/267333 (https://phabricator.wikimedia.org/T125165) [21:35:06] (03PS2) 10Dzahn: parsoid-vd-client: Fill out missing pieces of the config file [puppet] - 10https://gerrit.wikimedia.org/r/267311 (owner: 10Subramanya Sastry) [21:36:00] (03CR) 10Dzahn: [C: 032] "only affects testing server" [puppet] - 10https://gerrit.wikimedia.org/r/267311 (owner: 10Subramanya Sastry) [21:38:01] (03PS1) 10Dzahn: varnish/misc-web: remove caesium backend, decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/267334 (https://phabricator.wikimedia.org/T125165) [21:38:03] (03CR) 10Jdlrobson: [C: 031] "Is someone able to get this SWATed in one of the many SWAT windows and then test it is working as expected?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267195 (https://phabricator.wikimedia.org/T125000) (owner: 10Dereckson) [21:39:47] (03PS2) 10Dzahn: delete rsync_caesium class, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/267333 (https://phabricator.wikimedia.org/T125165) [21:40:14] subbu: [parsoid-vd-client]/Service[parsoid-vd-client]: Triggered 'refresh' ... [21:41:42] mutante, what is that from? code update? [21:41:55] subbu: from the puppet run on ruthenium [21:41:59] ah, yes, you +2ed the patch. thanks. [21:42:10] subbu: it was just a lazy way to say "i merged that, ran puppet, and it restarted the service" [21:42:20] yw [21:42:56] (03CR) 10Dzahn: [C: 032] delete rsync_caesium class, not needed anymore [puppet] - 10https://gerrit.wikimedia.org/r/267333 (https://phabricator.wikimedia.org/T125165) (owner: 10Dzahn) [21:44:18] robh: whenever you are free, one more dump of the parsoid-vd-client logs if you don't mind. [21:45:18] done [21:46:29] (03CR) 10Mobrovac: [C: 031] "Yup, confirmed it's a noop - https://puppet-compiler.wmflabs.org/1666/" [puppet] - 10https://gerrit.wikimedia.org/r/267270 (owner: 10Subramanya Sastry) [21:46:37] (03PS1) 10Dzahn: admin: rm reprepro from exceptions in enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/267377 [21:47:14] (03PS2) 10Dzahn: admin: rm reprepro from exceptions in enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T125165) [21:48:18] progress .. but, Jan 29 21:44:59 ruthenium nodejs[22720]: Error: libjpeg.so.8: cannot open shared object file: No such file or directory ... via canvas ... mobrovac mutante either i am missing some package or my node_modules build on vm (on trust) doesn't translate over to jessie. [21:48:51] anyway, i think at this point, i should simply wait till monday and have my root access and mess around directly at that point. [21:49:04] (03CR) 10Mobrovac: MW parsoid URLs: s/parsoidcache/parsoid/ (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267234 (https://phabricator.wikimedia.org/T110472) (owner: 10BBlack) [21:49:14] subbu: maybe you need "libjpeg-dev" installed or a similar libjpeg package ? [21:49:29] yup, +1 [21:49:31] subbu: dpkg -l | grep libjpeg on the VM ? [21:49:39] subbu: need to install the -dev pkgs [21:50:12] subbu: it'd be better to build the deps in a Jessie VM, especially for things like canvas which have binaries [21:50:28] yea, but labs would not let him create a jessie instance :/ [21:50:37] mutante, mobrovac ah yes .. $ sudo apt-get install libcairo2-dev libjpeg8-dev libpango1.0-dev libgif-dev build-essential g++ [21:50:40] resist the temptation to just install it as root, let's [21:50:43] from https://github.com/Automattic/node-canvas/wiki/Installation---Ubuntu-and-other-Debian-based-systems [21:50:49] let's do it via puppet right away [21:50:50] mutante, ok. :) [21:50:53] just saves time later :) [21:50:54] (03CR) 10Dereckson: "Yes, I've already included this change for the next SWAT: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267195 (https://phabricator.wikimedia.org/T125000) (owner: 10Dereckson) [21:51:08] sure. works for me. how do i get puppet to install those packages? [21:51:39] mutante: why couldn't he create a jessie instance in labs? there are 8.2 images [21:51:41] package { 'foo': [21:51:47] ensure => present, [21:51:48] } [21:51:54] subbu: ^ [21:52:01] mutante, i'll pm him :) [21:52:11] ok :) [21:52:51] (03CR) 10Papaul: [V: 031] admin: rm reprepro from exceptions in enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T125165) (owner: 10Dzahn) [21:53:35] mutante, ah .. via package? got it. [21:53:38] will update. [21:53:49] subbu: package { 'foo': } suffices [21:54:19] (03CR) 10Alex Monk: "Is it present on bromine?" [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T125165) (owner: 10Dzahn) [21:54:54] yes, this is also fine: [21:54:55] ensure_packages(['virtualenv', 'gcc', 'python-dev', 'libmysqlclient-dev']) [21:55:37] ensure_packages will not conflict if the same package gets installed by multiple classes on the same machine [21:56:03] ok. [21:56:14] mutante: it's actually require_package() [21:56:17] subbu: ^ [21:56:28] mobrovac: no, it exists both [21:56:32] ah [21:56:33] kk [21:56:59] yea, eh.. we use both of them [21:59:21] (03PS1) 10Subramanya Sastry: visuadiff: add dependences on required deb packages [puppet] - 10https://gerrit.wikimedia.org/r/267378 [21:59:24] PROBLEM - Host cp3049 is DOWN: PING CRITICAL - Packet loss = 100% [22:00:03] mobrovac, mutante there is the patch. [22:01:46] mobrovac: had to check the difference again, so require_package has been written by Ori, and ensure_packages is from puppet's stdlib and: [22:01:53] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6 [22:02:02] Reedy: https://releases.wikimedia.org/mediawiki/ should we even have that "snapshot" dir up anymore? check the date wmflib: add require_package() from vagrant [22:02:03] [22:02:04] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6 [22:02:04] It's similar to ensure_packages(), but it's cleaner and faster. [22:02:04] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6 [22:02:04] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6 [22:02:23] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6 [22:02:23] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6 [22:02:27] oh [22:02:35] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3049_v4, cp3049_v6 [22:02:35] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6 [22:02:41] bblack: around ^ ? [22:02:43] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6 [22:02:53] PROBLEM - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3049_v4, cp3049_v6 [22:03:04] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6 [22:03:04] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6 [22:03:04] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3049_v4, cp3049_v6 [22:03:04] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3049_v4, cp3049_v6 [22:03:14] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6 [22:03:15] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 164 not-conn: cp3049_v4, cp3049_v6 [22:03:23] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6 [22:03:23] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6 [22:03:47] all of them about cp3049 .. ok.. looking [22:05:30] it seems down from icinga [22:06:00] yes, i am connecting to mgmt [22:06:32] nothing on console [22:06:38] !log powercycle cp3049 [22:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:07:41] ah, that is why it was denying me access [22:08:00] UEFI0030: A keyboard device is not connected to the system. [22:08:04] yea.. go on... [22:08:17] i see it coming back now [22:08:29] no, it is something else, I think my ssh config wrong [22:08:41] jynus: maybe because this has .esams. in it? [22:08:44] unlike the others [22:09:02] ah! no, worse than that, I was thinking sfo [22:09:07] so human error [22:09:10] ok [22:10:38] hmmm.. it doesnt finish the boot process. something broke [22:12:03] still no ssh access [22:12:17] yea, and no more output either [22:12:30] this was the last i saw: [22:12:37] [ OK [ 9.862350] ipmi_si ipmi_si.0: Using irq 10 [22:12:37] [ 9.863607] ------------[ cut here ]------------ [22:12:37] ] Started Create [22:13:07] shortly after [ 9.838257] systemd[1]: Mounted Huge Pages File System. [22:13:17] that looks like kernel panic [22:13:34] [ OK ] Mounted Debug File System. [22:13:35] [ 9.794621] EXT4-fs (md0): re-mounted. Opts: errors=remount-ro [22:14:07] I would try one powercycle, even if only to get more info [22:14:14] [ 6.350863] ata3: SATA link down (SStatus 0 SControl 300) [22:14:14] [ 6.675177] ata4: SATA link down (SStatus 0 SControl 300) [22:14:24] ah, disk issue [22:14:25] that's almost like controller died [22:14:32] they show up, then they all disappear [22:14:37] as if the controller broke [22:15:33] then I will leave it for reverse(kram) to check it in person [22:15:57] yea, should just make a ticket in esams [22:16:31] its under warranty as well so he'll be able to get a replacement. [22:16:52] usb 1-1.6: New USB device found, [22:17:00] usb 1-1.6: Product: Gadget USB HUB [22:17:01] are any of you going to be any time longer? [22:17:02] ? [22:17:03] mutante: no, we agreed to standardize on require_package, which (in addition to other things) also makes the package a requirement for the current class, so you don't have to declare the package and _then_ add require => Package['foo'] [22:17:26] jynus: be around? i need to take a lunch in a moment and run to the store (im out of food here) [22:17:30] but i'll be back [22:17:54] PROBLEM - Host cp1049 is DOWN: PING CRITICAL - Packet loss = 100% [22:17:57] so, I believe db1018 will conserve its downtimes, despite the resintall [22:18:01] (03PS1) 10Ottomata: Allow access to Analytlics mysql metadata instance on analytics1027 from analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/267379 [22:18:28] ori: ok, thanks! we have a lot of places to replace it then [22:18:34] PROBLEM - Host cp3042 is DOWN: PING CRITICAL - Packet loss = 100% [22:18:37] but just in case it doesnt, and icinga trolls us, know it is under maintenance and depooled [22:18:42] (03PS2) 10Ottomata: Allow access to Analytlics mysql metadata instance on analytics1027 from analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/267379 [22:19:31] good to konw [22:19:33] know [22:19:52] Ok, im away for about an hour (only mentioning it in here since im on clinic duty) [22:20:36] (03CR) 10Ottomata: [C: 032] Allow access to Analytlics mysql metadata instance on analytics1027 from analytics networks [puppet] - 10https://gerrit.wikimedia.org/r/267379 (owner: 10Ottomata) [22:20:53] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:03] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:13] PROBLEM - IPsec on cp3033 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:14] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:14] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:25] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:35] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:35] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:35] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:35] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:35] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:35] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:36] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:36] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:44] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:44] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:21:53] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:22:04] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:22:04] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:22:13] PROBLEM - IPsec on cp3043 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:22:14] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:22:14] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:22:14] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:22:15] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:22:15] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:22:23] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:22:34] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:22:34] bblack: ^ [22:22:43] it is not traffic [22:22:49] it is the server [22:22:53] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 36 connecting: cp1049_v4, cp1049_v6 [22:22:56] 1049? [22:22:59] yes [22:23:05] and 3042 [22:23:18] i think they just both had hardware fail [22:23:21] but starting to be strange [22:23:24] it doesn't even respond to console [22:23:35] !log powercycled cp1049 [22:23:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:23:59] ah, in that case it was really you [22:24:00] maybe the monitoring got more verbose recently? [22:24:15] that is the ipsec [22:24:33] yes, but did it output that many lines when a single server went down? [22:24:34] it was discussed (this?) morning (europe's) [22:24:36] guess it did [22:24:50] what was discussed? [22:24:58] it is "relativelly new" for ipsec servers [22:25:05] ah, ok [22:25:11] so cp1049 is coming back [22:25:16] at login: again [22:25:20] 3049 is not [22:25:23] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 38 ESP OK [22:25:24] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 38 ESP OK [22:25:24] RECOVERY - Host cp1049 is UP: PING OK - Packet loss = 0%, RTA = 2.82 ms [22:25:26] eh, 3042 [22:25:34] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 38 ESP OK [22:25:34] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 38 ESP OK [22:25:44] RECOVERY - IPsec on cp3043 is OK: Strongswan OK - 38 ESP OK [22:25:45] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 38 ESP OK [22:25:45] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 38 ESP OK [22:25:47] (03PS1) 10Ottomata: Add ferm::service{ 'analytics-mysql-meta' to analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/267380 [22:25:53] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 38 ESP OK [22:25:53] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 38 ESP OK [22:25:53] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 38 ESP OK [22:25:53] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 38 ESP OK [22:26:04] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 38 ESP OK [22:26:05] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 38 ESP OK [22:26:13] (03CR) 10Ottomata: [C: 032 V: 032] Add ferm::service{ 'analytics-mysql-meta' to analytics1027 [puppet] - 10https://gerrit.wikimedia.org/r/267380 (owner: 10Ottomata) [22:26:14] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 38 ESP OK [22:26:24] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 38 ESP OK [22:26:24] RECOVERY - IPsec on cp3033 is OK: Strongswan OK - 38 ESP OK [22:26:33] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 38 ESP OK [22:26:34] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 38 ESP OK [22:26:44] 3042 - md0: unknown partition table [22:26:44] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 38 ESP OK [22:26:46] oh well [22:26:53] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 38 ESP OK [22:26:53] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 38 ESP OK [22:26:53] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 38 ESP OK [22:26:53] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 38 ESP OK [22:26:53] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 38 ESP OK [22:26:53] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 38 ESP OK [22:26:54] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 38 ESP OK [22:26:54] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 38 ESP OK [22:27:04] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 38 ESP OK [22:27:04] !log cp3042 - md0: unknown partition table [22:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:28:42] 6operations, 10MediaWiki-API, 6Services, 10Traffic, 7Monitoring: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#1982972 (10GWicke) @faidon, is your view that this should be handled by somebody outside ops? [22:29:26] (03PS1) 10GWicke: WIP / untested: Don't decode percent encoding for rest.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/267381 (https://phabricator.wikimedia.org/T125176) [22:30:54] 6operations, 10ops-esams: cp3042 - controller / hardware issue - https://phabricator.wikimedia.org/T125265#1982978 (10Dzahn) 3NEW [22:31:23] (03PS2) 10Subramanya Sastry: visuadiff: add dependences on required deb packages [puppet] - 10https://gerrit.wikimedia.org/r/267378 [22:31:49] mutante, updated that as per ori's comment above. [22:32:02] subbu: perfect :) [22:32:50] ack'ed 3042 [22:33:41] just saw, thx [22:34:29] and you are using, I supose, 3049 [22:35:10] (03PS3) 10Subramanya Sastry: visuadiff: add dependencies on required deb packages [puppet] - 10https://gerrit.wikimedia.org/r/267378 [22:35:12] jynus: just disconnected now [22:35:41] powercycle? [22:37:35] !log powercycle cp3042 [22:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:37:45] (03CR) 10Dzahn: "N: Can't select versions from package 'libjpeg8-dev' as it is purely virtual" [puppet] - 10https://gerrit.wikimedia.org/r/267378 (owner: 10Subramanya Sastry) [22:37:53] !log powercycle cp3049, not 42 [22:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:39:00] let's wait and see [22:40:05] jynus: correcting the ticket name.. it was 3049 [22:40:17] no [22:40:19] 6operations, 10ops-esams: cp3049 - controller / hardware issue - https://phabricator.wikimedia.org/T125265#1982998 (10Dzahn) [22:40:24] 14:10 < mutante> !log powercycle cp3049 [22:40:25] ahhhh [22:40:44] wait, I have a confusion now [22:40:50] 49 booted for me [22:40:54] RECOVERY - Host cp3049 is UP: PING OK - Packet loss = 0%, RTA = 86.22 ms [22:40:55] 14:06 < icinga-wm> PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6 [22:41:05] jynus: started to boot or actually finished? [22:41:11] finished [22:41:15] eh.. [22:41:23] are you sure it wasn't 42 the damaged one? [22:41:27] but where are the recoveries then [22:41:35] yea, see the lines like this [22:41:37] 14:06 < icinga-wm> PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3049_v4, cp3049_v6 [22:41:40] 3049 at the end [22:41:49] let me ssh [22:42:26] yes, 3049 is alive [22:42:51] do not know about the service, but the machine is [22:42:53] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:43:12] there where 3 incidents, 2 came back alice 42 didnt [22:44:13] 1049 3049 up, 3042 down, agree? [22:44:47] (icinga agrees with me, at least) [22:48:18] jynus: yes, agree [22:48:27] let me powercycle 42 once again to be 100% sure that is the broken one [22:48:57] *3049* [22:49:45] icinga says 3042 is still borked [22:49:51] while 3049 is happy [22:50:00] yes, for that, there is nothing to lose [22:53:20] !log powercycling cp3042 to test it is really the broken one [22:53:20] 6operations, 10ops-esams: cp3042 - controller / hardware issue - https://phabricator.wikimedia.org/T125265#1983006 (10Dzahn) [22:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:53:20] (03PS2) 10Dzahn: varnish/misc-web: remove caesium backend, decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/267334 (https://phabricator.wikimedia.org/T125165) [22:53:20] (03CR) 10Subramanya Sastry: "I see .. On my laptop (trusty), I see this ... so, I wonder if the fact that I built the packages on a trusty VM means this won't work on" [puppet] - 10https://gerrit.wikimedia.org/r/267378 (owner: 10Subramanya Sastry) [22:53:24] confirmed, 3042 is the one with a kernel panic (maybe you got confuse with the other machine because the similar name) [22:53:53] (03PS4) 10Subramanya Sastry: visuadiff: add dependencies on required deb packages [puppet] - 10https://gerrit.wikimedia.org/r/267378 [22:54:30] 6operations, 10ops-esams: cp3042 - controller / hardware issue - https://phabricator.wikimedia.org/T125265#1983012 (10jcrespo) ``` [ 0.000000] ACPI: LAPIC (acpi_id[0x2e] lapic_id[0x2a] enabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x0c] lapic_id[0x10] enabled) [ 0.000000] ACPI: LAPIC (acpi_id[0x30] lapic_... [22:55:00] (03CR) 10Dzahn: [C: 032] varnish/misc-web: remove caesium backend, decom'ed [puppet] - 10https://gerrit.wikimedia.org/r/267334 (https://phabricator.wikimedia.org/T125165) (owner: 10Dzahn) [22:56:44] (03PS1) 10Dzahn: releases: switch reprepro upload server to bromine [puppet] - 10https://gerrit.wikimedia.org/r/267385 (https://phabricator.wikimedia.org/T124261) [22:56:45] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:56:50] (03CR) 10Subramanya Sastry: [C: 04-1] "Actually hold on ... since I switched to upright diff, I may not need canvas anyway since that is used by the resemble package that I am n" [puppet] - 10https://gerrit.wikimedia.org/r/267378 (owner: 10Subramanya Sastry) [22:57:46] (03PS2) 10Dzahn: releases: switch reprepro upload server to bromine [puppet] - 10https://gerrit.wikimedia.org/r/267385 (https://phabricator.wikimedia.org/T124261) [22:58:12] (03CR) 10Dzahn: [C: 032] releases: switch reprepro upload server to bromine [puppet] - 10https://gerrit.wikimedia.org/r/267385 (https://phabricator.wikimedia.org/T124261) (owner: 10Dzahn) [22:59:00] 6operations, 10MediaWiki-Authentication-and-authorization: ~3000% increase in session redis memory usage, causing evictions and session loss - https://phabricator.wikimedia.org/T125267#1983035 (10ori) 3NEW [23:01:52] 6operations, 5Patch-For-Review: decom caesium - https://phabricator.wikimedia.org/T125165#1983056 (10Dzahn) @Papaul see my changes above. Could you follow-up with the DNS removal and then move the ticket to ops-eqiad or make a new ticket for the final decom steps (wipe disk, remove from rack etc)? thanks [23:03:03] (03PS5) 10Subramanya Sastry: visualdiff: add dependencies on required deb packages [puppet] - 10https://gerrit.wikimedia.org/r/267378 [23:06:08] (03CR) 10Subramanya Sastry: "https://github.com/wikimedia/integration-visualdiff/commit/39f522385f28f4b6b6f4129bea4f3f48721b5573 removes resemblejs which removes the c" [puppet] - 10https://gerrit.wikimedia.org/r/267378 (owner: 10Subramanya Sastry) [23:13:18] (03CR) 10Dzahn: "yes, it is. thanks, amending" [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T125165) (owner: 10Dzahn) [23:13:22] (03PS3) 10Dzahn: admin: replace caesium with bromine in enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T124261) [23:14:55] (03CR) 10Alex Monk: [C: 031] "probably correct, but I'm not a releaser or ops" [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T124261) (owner: 10Dzahn) [23:15:32] (03PS4) 10Dzahn: admin: replace caesium with bromine in enforce-users-groups.sh [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T124261) [23:15:55] (03CR) 10Dzahn: [C: 032] "yes, same role class, same user, just different hostname" [puppet] - 10https://gerrit.wikimedia.org/r/267377 (https://phabricator.wikimedia.org/T124261) (owner: 10Dzahn) [23:21:26] * bd808 is running sync-file [23:22:04] ori: is that "LightProcess" fix easy? [23:22:09] it's f'ing annoying [23:22:43] !log bd808@mira Synchronized php-1.27.0-wmf.11/includes/session/SessionBackend.php: Testing proposed fix for T125267 (duration: 01m 26s) [23:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:56] anomie, ori, greg-g: ^ [23:23:39] graph to watch: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Memcached+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report [23:24:27] the periodicity of that graph is interesting [23:24:53] peaks every 15m [23:27:14] bd808: my idea of disabling LightProcess entirely for CLI mode is probably not great, since I believe we have long-running maintenance scripts that shell out [23:27:47] it may have been fixed upstream, we are pretty far behind [23:28:36] bytes_out is climbing again toward the next peak [23:29:08] bytes_in is slightly depressed, almost certainly as a result of the patch, but it didn't fix the issue by the looks of it [23:29:52] and we are pretty sure this is just redis traffic caused right? Not some memc regression somewhere else? [23:30:29] of that i am completely sure; redis memory usage is very stable at ~15mb normally, it's at 500 [23:30:39] *nod* [23:30:52] and we don't store anything other than sessions in this redis? [23:31:44] peak matches 15 min ago. so no joy yet [23:32:05] i think GettingStarted stores cleanup category memberships, but it has been doing that for ages, and it hasn't had code changes recently AFAIK [23:32:17] I'll run MONITOR on a redis instance to confirm that the traffic is due to session-related keys [23:32:50] No GettingStarted changes shown on https://www.mediawiki.org/wiki/MediaWiki_1.27/wmf.11 [23:33:07] (03PS6) 10Dzahn: visualdiff: add dependencies on required deb packages [puppet] - 10https://gerrit.wikimedia.org/r/267378 (owner: 10Subramanya Sastry) [23:34:09] (03PS1) 10Mobrovac: MobileApps: Change RESTBase URI [puppet] - 10https://gerrit.wikimedia.org/r/267392 (https://phabricator.wikimedia.org/T125252) [23:34:55] it's 3:35, btw, I'd like us to resolve this or rollback sessionmanager by 4:15 to give us some bake time before we all leave for the night [23:35:53] 100 lines of redis activity from mc1001: https://dpaste.de/SJzw/raw [23:35:56] (03CR) 10Dzahn: [C: 032] visualdiff: add dependencies on required deb packages [puppet] - 10https://gerrit.wikimedia.org/r/267378 (owner: 10Subramanya Sastry) [23:37:23] !log ruthenium - git pull origin in /srv/visualdiff/ [23:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:38:16] so, there are a few other users, it looks like (abusefilter, chronologyprotector) but it's definitely session stuff [23:38:37] to rule out the possibility of it being something else that is storing huge values, i ran redis-cli --big-keys: https://dpaste.de/WcaH/raw [23:38:59] 91 session lines and 8 others [23:40:20] So the call to init the session used to be guarded by "if ( $wgRequest->checkSessionCookie() || isset( $_COOKIE[$wgCookiePrefix . 'Token'] ) )". That seems to be gone now. [23:40:32] anomie: are we fetching from redis much more often? [23:40:46] bd808: Maybe. [23:40:57] Is the problem fetches and not storing data? [23:41:19] yes [23:41:23] the traffic spike is more data being fetch from redis to MW [23:41:30] !log ruthenium - restart parsoid-rt-client, parsoid-vd-client [23:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:35] but apparently data sotred in redis is up as well [23:41:57] but if we are starting more sessions I suppose that would increase the stored data [23:41:58] (03PS1) 10Subramanya Sastry: testreduce: Remove ensure => latest from the repo declaration [puppet] - 10https://gerrit.wikimedia.org/r/267393 [23:42:23] (03CR) 10Dzahn: "i did the git pull origin in both places, visualdiff got updated and testreduce was already latest version" [puppet] - 10https://gerrit.wikimedia.org/r/267378 (owner: 10Subramanya Sastry) [23:42:47] bytes_in is up, but the scale is different [23:42:48] http://ganglia.wikimedia.org/latest/graph.php?r=month&z=xlarge&title=&vl=&x=&n=&hreg[]=mc1001&mreg[]=bytes_in>ype=line&glegend=show&aggregate=1&embed=1&_=1454110869000 [23:42:54] (03CR) 10Jcrespo: "It is probably one of those:" [puppet] - 10https://gerrit.wikimedia.org/r/267328 (owner: 10Jcrespo) [23:43:02] it went from 1.5 to 2.2mb [23:49:55] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: puppet fail [23:50:23] subbu: we got a new problem on ruthenium [23:50:48] oh? [23:50:59] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: undefined method `function_create_resources' for nil:NilClass at /etc/puppet/modules/visualdiff/manifests/init.pp:5 [23:51:05] uhm.. [23:53:51] !log restarted db1018 replication (and its codfw slaves) after a (somewhat) failed maintenance [23:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master