[00:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180206T0000). Please do the needful. [00:00:05] Jdlrobson and tgr: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:35] \O [00:01:52] (03Draft1) 10Paladox: Set replication.maxRetries to 50 [puppet] - 10https://gerrit.wikimedia.org/r/408468 [00:01:54] (03PS2) 10Paladox: Set replication.maxRetries to 50 [puppet] - 10https://gerrit.wikimedia.org/r/408468 [00:01:56] (03CR) 10Chad: [C: 031] "Per IRC: this is safe and can land whenever" [puppet] - 10https://gerrit.wikimedia.org/r/408468 (owner: 10Paladox) [00:04:52] I can SWAT [00:06:10] w00t [00:07:06] (03PS2) 10Thcipriani: Configure settings feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408354 (https://phabricator.wikimedia.org/T182217) (owner: 10Jdlrobson) [00:07:13] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408354 (https://phabricator.wikimedia.org/T182217) (owner: 10Jdlrobson) [00:09:58] (03Merged) 10jenkins-bot: Configure settings feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408354 (https://phabricator.wikimedia.org/T182217) (owner: 10Jdlrobson) [00:10:00] (03PS1) 10BryanDavis: mariadb: remove labsdb1001 & labsdb1003 special behavior [puppet] - 10https://gerrit.wikimedia.org/r/408469 (https://phabricator.wikimedia.org/T184832) [00:10:02] (03PS1) 10BryanDavis: toolschecker: remove labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/408470 (https://phabricator.wikimedia.org/T184832) [00:10:04] (03PS1) 10BryanDavis: prometheus: remove labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/408471 (https://phabricator.wikimedia.org/T184832) [00:10:10] (03CR) 10jenkins-bot: Configure settings feedback link [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408354 (https://phabricator.wikimedia.org/T182217) (owner: 10Jdlrobson) [00:11:38] jdlrobson: ^ is live on mwdebug1002, check please [00:11:54] sync away! [00:11:56] (03CR) 10Madhuvishy: [C: 031] toolschecker: remove labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/408470 (https://phabricator.wikimedia.org/T184832) (owner: 10BryanDavis) [00:12:04] k, doing [00:12:04] (03CR) 10Madhuvishy: [C: 032] toolschecker: remove labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/408470 (https://phabricator.wikimedia.org/T184832) (owner: 10BryanDavis) [00:12:33] (03CR) 10Madhuvishy: [C: 032] prometheus: remove labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/408471 (https://phabricator.wikimedia.org/T184832) (owner: 10BryanDavis) [00:13:29] (03CR) 10Madhuvishy: [C: 032] mariadb: remove labsdb1001 & labsdb1003 special behavior [puppet] - 10https://gerrit.wikimedia.org/r/408469 (https://phabricator.wikimedia.org/T184832) (owner: 10BryanDavis) [00:13:59] (03PS4) 10Thcipriani: Update the ps mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408348 (https://phabricator.wikimedia.org/T184442) (owner: 10Jdlrobson) [00:14:21] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408348 (https://phabricator.wikimedia.org/T184442) (owner: 10Jdlrobson) [00:14:24] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:408354|Configure settings feedback link]] T182217 (duration: 00m 56s) [00:14:35] ^ jdlrobson first one live now [00:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:36] T182217: Deploy `specialpages` feature branch (promote new Special:MobileOptions, fontchanger etc) - https://phabricator.wikimedia.org/T182217 [00:15:42] thcipriani: sweet [00:16:05] (03Merged) 10jenkins-bot: Update the ps mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408348 (https://phabricator.wikimedia.org/T184442) (owner: 10Jdlrobson) [00:17:12] (03PS2) 10Madhuvishy: prometheus: remove labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/408471 (https://phabricator.wikimedia.org/T184832) (owner: 10BryanDavis) [00:17:30] (03CR) 10jenkins-bot: Update the ps mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408348 (https://phabricator.wikimedia.org/T184442) (owner: 10Jdlrobson) [00:17:46] jdlrobson: ^ is live on mwdebug1002, check please [00:18:40] thcipriani: looks good to me but just double checking wiht designer [00:19:33] (03CR) 10BryanDavis: "jynus and marostegui: can we get a +1 from either of you before we merge this?" [puppet] - 10https://gerrit.wikimedia.org/r/408469 (https://phabricator.wikimedia.org/T184832) (owner: 10BryanDavis) [00:19:35] (03PS2) 10Madhuvishy: toolschecker: remove labsdb1001 and labsdb1003 [puppet] - 10https://gerrit.wikimedia.org/r/408470 (https://phabricator.wikimedia.org/T184832) (owner: 10BryanDavis) [00:19:44] kk, let me know when/if it looks good. [00:21:05] you can sync thcipriani [00:21:06] all good [00:23:14] cool, going [00:24:37] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 78, down: 1, dormant: 0, excluded: 0, unused: 0 [00:25:03] (03CR) 10Madhuvishy: [C: 031] "lgtm, leaving here so one of the dbas can confirm that it looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/408469 (https://phabricator.wikimedia.org/T184832) (owner: 10BryanDavis) [00:25:23] !log thcipriani@tin Synchronized static/images/mobile/copyright/wikipedia-wordmark-ps.svg: SWAT: [[gerrit:408348|Update the ps mobile wordmark]] T184442 (duration: 00m 55s) [00:25:34] ^ jdlrobson live now [00:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:35] T184442: Use the correct Pashto Wikipedia wordmark on mobile site - https://phabricator.wikimedia.org/T184442 [00:25:37] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 79, down: 0, dormant: 0, excluded: 0, unused: 0 [00:26:14] tgr: around for SWAT? [00:26:20] thanks thcipriani [00:26:21] thcipriani: here [00:26:39] (03PS2) 10Thcipriani: Enable AICaptcha data collection on group0/group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408363 (https://phabricator.wikimedia.org/T186244) (owner: 10Gergő Tisza) [00:26:50] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408363 (https://phabricator.wikimedia.org/T186244) (owner: 10Gergő Tisza) [00:26:52] jdlrobson: yw :) [00:29:05] (03Merged) 10jenkins-bot: Enable AICaptcha data collection on group0/group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408363 (https://phabricator.wikimedia.org/T186244) (owner: 10Gergő Tisza) [00:29:19] (03CR) 10jenkins-bot: Enable AICaptcha data collection on group0/group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408363 (https://phabricator.wikimedia.org/T186244) (owner: 10Gergő Tisza) [00:29:40] tgr: ^ is live on mwdebug1002, check please [00:32:13] hm, the i18n message seems to be missing but only on some wikis? [00:32:25] https://en.wikinews.org/w/index.php?title=Special:CreateAccount&returnto=Main+Page [00:32:33] https://de.wikinews.org/w/index.php?title=Spezial:Benutzerkonto_anlegen&returnto=Main+Page [00:32:43] the latter works even though it also falls back to english [00:33:49] I ran scap sync for the patch with that message earlier today [00:34:18] also works on testwiki which has English as content language [00:35:51] hrm that is strange... [00:36:10] otherwise the patch seems to be working [00:38:08] https://en.wikibooks.org/wiki/MediaWiki:Wikimediaevents-aicaptcha-datacollection-description [00:38:41] er, it was news, but it seems to be there, too: https://en.wikinews.org/wiki/MediaWiki:Wikimediaevents-aicaptcha-datacollection-description [00:38:47] maybe just resourceloader cache lag then? [00:38:53] that should be about 5 min [00:40:25] yeah, seeing if I can force using: https://wikitech.wikimedia.org/wiki/How_to_deploy_code#A_note_on_JavaScript_and_CSS [00:40:38] but that doesn't seem to be having any effect afaict. [00:43:40] RL does not see those messages for sure: https://en.wikinews.org/w/load.php?debug=true&lang=en&modules=ext.wikimediaEvents.aiCaptcha&skin=vector [00:44:48] and MediaWiki does: https://en.wikinews.org/w/api.php?action=query&meta=allmessages&refix=wikimediaevents-aicaptcha- [00:47:37] scap does the update of mediawiki messages, but resource loader is it's own thing and my understanding is that it's smart about when it updates (although I don't know how to trigger an update, evidently -- although those instructions I pasted worked at some point). [00:48:19] Yeah there are sometimes issues with RL picking up message changes [00:48:55] The MessageBlobStore thing from the link you pasted should generally work for that though [00:49:17] ah, yeah, I was doing it for enwikibooks. Just did it for enwikinews and it seems to work. [00:50:24] should it be done for all wikis then? [00:51:17] e.g. https://en.wikisource.org/w/index.php?title=Special:CreateAccount&returnto=Main+Page is still broken [00:51:37] I don't know how to do that offhand, but possibly? RoanKattouw is that how we get it to pick up message changes? [00:52:02] What's broken right now [00:52:12] On that createaccount link I don't immediately see anything bad [00:52:24] the code is still on mwdebug1002 [00:53:39] forallwikisindblist group1 eval.php '(new ResourceLoader)->getMessageBlobStore()->updateMessage("wikimediaevents-aicaptcha-datacollection-description");' [00:53:42] I think [00:53:54] except for all three affected messages [00:54:14] Wait wait [00:54:23] You're trying to fix broken messages but the code is only on mwdebug? [00:54:49] no, just the config patch to display them [00:54:52] Oh OK [00:54:58] anyway the fix worked [00:55:03] for that one wiki [00:55:03] OK cool [00:55:35] Yes what you wrote should work [00:55:48] ok, running. [00:55:55] As long as the code for this (sans config patch) really is on all servers already [00:56:14] And if that's the case then I'm confused about why this is even necessary, sounds like a failure/bug in RL, unless this code was deployed very recently [00:56:38] four hours ago, not sure if that counts as recently [00:57:42] there used to be a bug like this in RL but I think it was suposed to be fixed (T47877 maybe?) [00:57:50] T47877: During deployment old servers may populate new cache URIs - https://phabricator.wikimedia.org/T47877 [00:59:21] Yeah [00:59:29] Was it deployed with a full scap four hours ago? [00:59:33] yeah [01:01:16] and it was only enabled on testwiki at that point so as far as I can see the mechanism described in that bug does not apply [01:03:56] hrm, so the foreachwikiindblist oneliner just seems to hang on the first wiki in the dblist. [01:06:30] waiting for stdin? Dunno. [01:08:03] maybe it's something like eval.php -e? [01:09:32] thcipriani: standard input, rather [01:09:51] so echo '(new ResourceLoader)->getMessageBlobStore()->updateMessage("wikimediaevents-aicaptcha-datacollection-description");' | cat eval.php ... [01:09:58] sorry [01:10:16] not quite sure how that works with foreachwiki [01:11:09] hrm, yeah, me either offhand... [01:13:12] Maybe something like for wiki in `expanddblist group1`; do echo '(new ResourceLoader)->etc->etc();' | mwscript eval.php $wiki; done ? [01:13:13] foreachwiki is kludgy, it's easy to break its assumptions [01:13:24] That's a better idea ^ [01:13:26] it uses "${@}" so probably not at all? [01:13:36] Ugh stupid formatting, I mean for wiki in $(expanddblist group1); do .... [01:13:51] ah, yeah, lemme cook something up with that. [01:17:26] seem sane? https://gist.github.com/thcipriani/869dc99a53dba8c99eaaf3cbcad1b8f5 [01:19:08] As sane as piping raw PHP in to eval() while in a loop over multiple wikis will ever be :p [01:19:41] ugggh [01:19:46] :) [01:23:01] yeah looks good [01:23:35] except it would have to be done for all three messages [01:23:50] (and then group 1 as well) [01:24:22] yep, doing now for all 3 messages for group0, then I'll move on to group1 [01:29:39] sill running for group1 fyi [01:35:02] we're in the j's. slow going. [01:35:10] er k's now, rather. [01:36:13] sorry for taking up so much of your time with this :/ [01:36:49] (03CR) 10Mattflaschen: [C: 04-2] "Under discussion by the team. Do not merge at this time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) (owner: 10Zoranzoki21) [01:37:12] no worries, rather have it out and right than out and wrong or reverted rather than run this short script. [01:38:21] plus I get to unleash my mad bash skillz which is always a bonus. [01:43:38] done! [01:44:00] tgr: final sanity check: everything look good? If so I'll go ahead and sync. [01:45:11] thcipriani: looks good, yeah [01:45:26] cool, going live! [01:47:30] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:408363|Enable AICaptcha data collection on group0/group1]] T186244 (duration: 00m 56s) [01:47:38] ^ tgr live everywhere! [01:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:44] T186244: Deploy AICaptcha data collection - https://phabricator.wikimedia.org/T186244 [01:47:56] thanks! [01:48:05] yw :) [01:56:13] (03PS1) 10Chad: tests: sync defines.php from core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408475 [01:56:24] (03CR) 10Gergő Tisza: "Already done in https://gerrit.wikimedia.org/r/#/c/408363/ . (FWIW, the testwiki line is superfluous as that wiki is part of group0; looks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408381 (https://phabricator.wikimedia.org/T186244) (owner: 10Groovier1) [01:58:52] (03CR) 10Gergő Tisza: "Already done in https://gerrit.wikimedia.org/r/#/c/408364/ . Also, should be on top of the group0/group1 patch (have that patch as a paren" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408386 (https://phabricator.wikimedia.org/T186244) (owner: 10Groovier1) [02:26:45] (03CR) 10Chad: [C: 032] tests: sync defines.php from core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408475 (owner: 10Chad) [02:29:09] (03Merged) 10jenkins-bot: tests: sync defines.php from core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408475 (owner: 10Chad) [02:29:19] (03CR) 10jenkins-bot: tests: sync defines.php from core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408475 (owner: 10Chad) [02:30:12] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.17) (duration: 06m 15s) [02:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:48] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#3948115 (10Papaul) Dear Mr Papaul Hewlett Packard Enterprise Reference Number: 5326746523 STATUS: Customer Self Repair Part has been shipped Part/s shipped: 843307-001 Part description: SPS-PCA DL380/DL360... [02:32:30] !log demon@tin Synchronized tests/Defines.php: no op (duration: 00m 55s) [02:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:17:35] !log reset email for User:Andrewman327 [03:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:25:26] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 778.75 seconds [03:31:52] (03CR) 10Chad: [V: 032 C: 032] Adding gitiles.jar @ stable-2.14 [software/gerrit] - 10https://gerrit.wikimedia.org/r/408437 (owner: 10Chad) [03:32:23] !log demon@tin Started deploy [gerrit/gerrit@f25f017]: adding gitiles plugin [03:32:33] !log demon@tin Finished deploy [gerrit/gerrit@f25f017]: adding gitiles plugin (duration: 00m 10s) [03:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:33:35] Wheee, we get two links now :) [03:33:41] (diffusion) (browse) [03:33:42] :D [03:34:13] (03PS1) 10Dzahn: gerrit: retab secure.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/408482 [03:35:19] (03CR) 10Dzahn: [C: 032] "https://gerrit.wikimedia.org/r/#/c/408482/" [puppet] - 10https://gerrit.wikimedia.org/r/407932 (owner: 10Paladox) [03:37:09] (03PS3) 10Dzahn: Set replication.maxRetries to 50 [puppet] - 10https://gerrit.wikimedia.org/r/408468 (owner: 10Paladox) [03:37:47] (03CR) 10Dzahn: [C: 032] Set replication.maxRetries to 50 [puppet] - 10https://gerrit.wikimedia.org/r/408468 (owner: 10Paladox) [03:38:38] (03CR) 10Dzahn: [C: 032] gerrit: retab secure.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/408482 (owner: 10Dzahn) [03:38:48] (03PS2) 10Dzahn: gerrit: retab secure.config.erb [puppet] - 10https://gerrit.wikimedia.org/r/408482 [03:48:46] ooh [03:49:25] no_justification: are the gitlies URLs stable / can I start linking to them? [03:49:44] I kinda wanna rewrite the urls if I can, they could be nicer. [03:49:52] https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/TemplateStyles/+/master/includes/TemplateStylesHooks.php#49 is a bit long [03:49:54] (03CR) 10Dzahn: [C: 032] "should have also had the "gerrit: " prefix" [puppet] - 10https://gerrit.wikimedia.org/r/408468 (owner: 10Paladox) [03:50:20] s/plugins\/gitlies/view/ (or something) [03:51:18] or can we take back git.wm.o now? [03:51:19] :) nice https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/b5a3581793afe2516bab6e4013873203674dff19 [03:52:29] there was a looong ticket to replace all the gitblit links [03:53:36] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 106.04 seconds [03:53:49] https://phabricator.wikimedia.org/tag/gitblit/ :p [03:54:28] yay gitiles, forget gitblit :) [06:00:59] (03CR) 10Hashar: "recheck" [debs/contenttranslation/apertium-ukr] - 10https://gerrit.wikimedia.org/r/408264 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [06:01:00] (03CR) 10jerkins-bot: [V: 04-1] apertium-ukr: Initial Debian packaging [debs/contenttranslation/apertium-ukr] - 10https://gerrit.wikimedia.org/r/408264 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [06:12:04] (03PS3) 10KartikMistry: apertium-ukr: Initial Debian packaging [debs/contenttranslation/apertium-ukr] - 10https://gerrit.wikimedia.org/r/408264 (https://phabricator.wikimedia.org/T184901) [06:13:02] (03CR) 10jerkins-bot: [V: 04-1] apertium-ukr: Initial Debian packaging [debs/contenttranslation/apertium-ukr] - 10https://gerrit.wikimedia.org/r/408264 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [06:17:27] akosiaris: E: apertium-ukr changes: bad-distribution-in-changes-file jessie-wikimedia - isn't this we're using? [06:18:14] (03PS4) 10KartikMistry: apertium-ukr: Initial Debian packaging [debs/contenttranslation/apertium-ukr] - 10https://gerrit.wikimedia.org/r/408264 (https://phabricator.wikimedia.org/T184901) [06:19:08] (03CR) 10jerkins-bot: [V: 04-1] apertium-ukr: Initial Debian packaging [debs/contenttranslation/apertium-ukr] - 10https://gerrit.wikimedia.org/r/408264 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [06:27:00] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832#3948239 (10Marostegui) >>! In T184832#3947771, @RobH wrote: > > > A quick grep of the repo shows the above for labsdb1001... [06:27:38] (03CR) 10KartikMistry: "@hashar, E: apertium-ukr changes: bad-distribution-in-changes-file jessie-wikimedia - should be disable this check? We're using jessie-wik" [debs/contenttranslation/apertium-ukr] - 10https://gerrit.wikimedia.org/r/408264 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [06:28:09] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2039 - https://phabricator.wikimedia.org/T186533#3948243 (10Marostegui) a:03Papaul Hi, Rebuild failed, do you happen to have another disk that we can try? ``` logicaldrive 1 (3.3 TB, RAID 1+0, Interim Recovery Mode) physicaldrive 1I:1:1 (p... [06:28:39] 10Operations, 10ops-codfw: Degraded RAID on db2039 - https://phabricator.wikimedia.org/T186549#3948246 (10Marostegui) [06:28:42] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2039 - https://phabricator.wikimedia.org/T186533#3948248 (10Marostegui) [06:36:26] (03PS2) 10Marostegui: filtered_tables: Add new columns [puppet] - 10https://gerrit.wikimedia.org/r/394254 (https://phabricator.wikimedia.org/T174569) [06:45:17] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 78, down: 1, dormant: 0, excluded: 0, unused: 0 [06:46:17] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 79, down: 0, dormant: 0, excluded: 0, unused: 0 [06:49:23] !log Fix replication on labsdb1010 - T186579 [06:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:37] T186579: labsdb1010 crashed - https://phabricator.wikimedia.org/T186579 [06:54:42] (03PS1) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408493 [06:56:08] (03PS2) 10Marostegui: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408493 [06:57:49] (03PS1) 10Marostegui: db1077.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/408494 [06:57:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408493 (owner: 10Marostegui) [06:59:37] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408493 (owner: 10Marostegui) [06:59:52] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408493 (owner: 10Marostegui) [07:01:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1077 for MariaDB and kernel upgrade (duration: 00m 56s) [07:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:32] !log Stop MySQL on db1077 for a full upgrade [07:06:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:11] (03CR) 10Marostegui: [C: 032] db1077.yaml: Update socket location [puppet] - 10https://gerrit.wikimedia.org/r/408494 (owner: 10Marostegui) [07:11:51] (03PS1) 10Marostegui: db-eqiad.php: Repool db1077 with low traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408495 [07:25:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1077 with low traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408495 (owner: 10Marostegui) [07:27:34] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1077 with low traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408495 (owner: 10Marostegui) [07:27:44] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1077 with low traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408495 (owner: 10Marostegui) [07:28:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1077 with low weight (duration: 00m 53s) [07:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:12] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3948321 (10elukey) >>! In T182832#3947179, @Paladox wro... [07:30:23] (03PS3) 10Marostegui: filtered_tables: Add new columns [puppet] - 10https://gerrit.wikimedia.org/r/394254 (https://phabricator.wikimedia.org/T174569) [07:32:00] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408496 [07:32:19] (03CR) 10Marostegui: [C: 032] filtered_tables: Add new columns [puppet] - 10https://gerrit.wikimedia.org/r/394254 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [07:39:59] (03PS1) 10Marostegui: filtered_tables.txt: Remove header [puppet] - 10https://gerrit.wikimedia.org/r/408497 [07:40:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408496 (owner: 10Marostegui) [07:40:50] (03CR) 10Marostegui: [C: 032] filtered_tables.txt: Remove header [puppet] - 10https://gerrit.wikimedia.org/r/408497 (owner: 10Marostegui) [07:41:51] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408496 (owner: 10Marostegui) [07:42:04] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408496 (owner: 10Marostegui) [07:43:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1077 weight (duration: 00m 55s) [07:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:31] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408498 [08:02:40] 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3948350 (10Joe) 05Resolved>03Open [08:04:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408498 (owner: 10Marostegui) [08:06:36] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408498 (owner: 10Marostegui) [08:07:00] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408498 (owner: 10Marostegui) [08:07:45] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase db1077 weight (duration: 00m 55s) [08:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:49] !log rollback apache/httpd changes on phab1001 (restart required) [08:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:57] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3948363 (10elukey) All apache httpd settings rolled bac... [08:33:58] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3948368 (10elukey) Next step in my opinion would be to... [08:36:19] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408504 [08:38:29] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408504 (owner: 10Marostegui) [08:40:01] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408504 (owner: 10Marostegui) [08:40:12] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1077 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408504 (owner: 10Marostegui) [08:41:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1077 (duration: 00m 55s) [08:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:03] (03PS1) 10Giuseppe Lavagetto: profile::etcd: allow tuning RAFT timeouts [puppet] - 10https://gerrit.wikimedia.org/r/408505 (https://phabricator.wikimedia.org/T162013) [08:49:22] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3948374 (10elukey) @akosiaris do you think that we can re-attempt to do the copy (maybe using ionice or rsync with limit bandwitdh or other) ? [09:01:51] 10Operations, 10monitoring: Unicode error while running smart-data-dump on kafka1023 - https://phabricator.wikimedia.org/T186583#3948381 (10elukey) [09:05:48] (03CR) 10Jcrespo: "The "official" socket location is on "/run/*", which is what debian fs standards recommend. "/var/run" only happens to be a link to the pr" [puppet] - 10https://gerrit.wikimedia.org/r/408469 (https://phabricator.wikimedia.org/T184832) (owner: 10BryanDavis) [09:09:16] PROBLEM - Hue Server on thorium is CRITICAL: PROCS CRITICAL: 2 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue [09:10:10] sorry me testing --^ [09:11:16] RECOVERY - Hue Server on thorium is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue [09:12:23] (03PS1) 10Jcrespo: wmfmariadbpy: remove labsdb1001 & labsdb1003 special behavior [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/408507 [09:33:24] (03CR) 10Hashar: [C: 04-1] "I don't think there is any place in production where we use php-fpm. It is probably better to stick to mod_php for consistency?" [puppet] - 10https://gerrit.wikimedia.org/r/407958 (https://phabricator.wikimedia.org/T182832) (owner: 10Paladox) [09:35:49] (03CR) 10Giuseppe Lavagetto: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/407608 (https://phabricator.wikimedia.org/T169144) (owner: 10Gilles) [09:38:19] !log Sanitizing s2 - T174569 [09:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:31] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [09:38:33] (03PS1) 10KartikMistry: apertium-rus-ukr: Initial Debian packaging [debs/contenttranslation/apertium-rus-ukr] - 10https://gerrit.wikimedia.org/r/408508 (https://phabricator.wikimedia.org/T184901) [09:39:04] (03CR) 10jerkins-bot: [V: 04-1] apertium-rus-ukr: Initial Debian packaging [debs/contenttranslation/apertium-rus-ukr] - 10https://gerrit.wikimedia.org/r/408508 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [09:47:14] (03CR) 10Volans: "LGTM, a couple of formatting nipticks online" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/408505 (https://phabricator.wikimedia.org/T162013) (owner: 10Giuseppe Lavagetto) [09:55:54] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/407433 (https://phabricator.wikimedia.org/T185216) (owner: 10Filippo Giunchedi) [10:01:41] kart_: yeah that's correct, but I am guessing lintian config needs some exception for jessie-wikimedia ? [10:12:51] (03CR) 10Hashar: "recheck" [debs/contenttranslation/apertium-ukr] - 10https://gerrit.wikimedia.org/r/408264 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [10:14:21] (03PS1) 10Filippo Giunchedi: smart: ignore output decoding errors [puppet] - 10https://gerrit.wikimedia.org/r/408512 (https://phabricator.wikimedia.org/T186583) [10:15:17] (03PS1) 10Elukey: Disable WebHDFS support for Hue [puppet/cdh] - 10https://gerrit.wikimedia.org/r/408513 [10:15:37] akosiaris: hashar is fixing it.. [10:15:48] (03PS2) 10Giuseppe Lavagetto: profile::etcd: allow tuning RAFT timeouts [puppet] - 10https://gerrit.wikimedia.org/r/408505 (https://phabricator.wikimedia.org/T162013) [10:15:49] (03PS1) 10Giuseppe Lavagetto: profile::etcd: stagger software RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/408514 (https://phabricator.wikimedia.org/T162013) [10:16:07] akosiaris: yeah turns out ema already made the package_builder module to provide a "wikimedia" lintian profile [10:16:13] which has our distro listed [10:16:52] CI now runs lintian inside the build environement to use the lintian version that comes with the target distro [10:17:02] but ema customization is not in the build env [10:17:10] I have just added a bind mount to /usr/share/lintian/vendors/wikimedia [10:17:26] and apparently lintian is smart enough to find that "jessie-wikimedia" distro should use the "wikimedia" profile [10:17:57] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/408505 (https://phabricator.wikimedia.org/T162013) (owner: 10Giuseppe Lavagetto) [10:18:00] hashar: thanks! [10:20:11] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd: allow tuning RAFT timeouts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/408505 (https://phabricator.wikimedia.org/T162013) (owner: 10Giuseppe Lavagetto) [10:21:41] <_joe_> volans: meh :P [10:22:51] for what? :) [10:22:56] PROBLEM - Nginx local proxy to apache on mw1251 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [10:23:06] PROBLEM - puppet last run on etcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:23:22] (03PS1) 10Giuseppe Lavagetto: etcd: fix parameter type [puppet] - 10https://gerrit.wikimedia.org/r/408516 [10:23:25] <_joe_> this ^^ [10:23:56] RECOVERY - Nginx local proxy to apache on mw1251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.041 second response time [10:24:16] (03PS2) 10Giuseppe Lavagetto: etcd: fix parameter type [puppet] - 10https://gerrit.wikimedia.org/r/408516 [10:26:11] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: fix parameter type [puppet] - 10https://gerrit.wikimedia.org/r/408516 (owner: 10Giuseppe Lavagetto) [10:28:07] RECOVERY - puppet last run on etcd1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [10:38:42] <_joe_> beware: pages might arrive from etcdmirror in the next few minutes [10:39:26] (03PS2) 10Elukey: Disable WebHDFS support for Hue [puppet/cdh] - 10https://gerrit.wikimedia.org/r/408513 [10:39:51] <_joe_> !log rolling restart of the codfw cluster to pick up the config changes [10:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:13] (03CR) 10Volans: [C: 04-1] "Although this is quick and nice solution, I've some reservation, see inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/407447 (https://phabricator.wikimedia.org/T185216) (owner: 10Filippo Giunchedi) [10:47:05] (03CR) 10Elukey: "Already running on thorium (puppet disabled) - https://puppet-compiler.wmflabs.org/compiler02/9863/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/408513 (owner: 10Elukey) [10:47:56] <_joe_> !log rolling restart of the eqiad etcd cluster [10:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:28] (03CR) 10Volans: "LGTM, I've just a doubt if maybe we should use 'replace' instead to see that we're missing some characters. What are the real examples of " [puppet] - 10https://gerrit.wikimedia.org/r/408512 (https://phabricator.wikimedia.org/T186583) (owner: 10Filippo Giunchedi) [10:51:01] PROBLEM - Etcd replication lag on conf2002 is CRITICAL: connect to address 10.192.32.141 and port 8000: Connection refused [10:51:01] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:51:17] PROBLEM - etcdmirror-conftool-eqiad-wmnet service on conf2002 is CRITICAL: CRITICAL - Expecting active but unit etcdmirror-conftool-eqiad-wmnet is failed [10:51:24] :( [10:51:33] _joe_: why not downtiming it? [10:51:41] is it maintenance, then? [10:51:42] I just got paged [10:51:46] <_joe_> volans: I hoped this wouldn't happen [10:51:53] or is it real? [10:51:55] <_joe_> yeah sorry, my fault [10:52:08] <_joe_> maintenance [10:52:33] <_joe_> I should've imagined it would fail when restarting the eqiad cluster [10:52:45] replication issues are hard to manage, you tell me :-) [10:52:54] (03CR) 10Hashar: "recheck" [debs/contenttranslation/apertium-ukr] - 10https://gerrit.wikimedia.org/r/408264 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [10:52:56] <_joe_> jynus: I should've just downtimed it [10:53:01] RECOVERY - Etcd replication lag on conf2002 is OK: HTTP OK: HTTP/1.1 200 OK - 148 bytes in 0.074 second response time [10:53:04] <_joe_> but tbh I wasn't sure what would happen [10:53:05] e.g. 1 server down means others can page [10:53:06] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [10:53:17] RECOVERY - etcdmirror-conftool-eqiad-wmnet service on conf2002 is OK: OK - etcdmirror-conftool-eqiad-wmnet is active [10:53:45] (03CR) 10jerkins-bot: [V: 04-1] apertium-ukr: Initial Debian packaging [debs/contenttranslation/apertium-ukr] - 10https://gerrit.wikimedia.org/r/408264 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [10:53:54] <_joe_> jynus: yeah with 20/20 hindsight, of course it would page when losing connection to the "master" [10:55:16] so we can ignore all of these or are there real issues too? [10:55:36] !log restart eqiad secondary LVSs to make them reconnect to etcd [10:55:41] <_joe_> what do you mean "all of these?" [10:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:56] <_joe_> it was one page, which was due to an ongoing maintenance [10:55:59] all of these pages [10:56:18] I just heard the two and came on [10:56:29] <_joe_> the second is the recovery I guess? [10:56:32] well "changed to the irc window" [10:56:41] yes now that I look at them [10:57:16] but "all of these" were actually the various problem messages from icinga actually, that's what I looked at first after hearing the pages [10:57:44] anyways, there is not some actual issue to look at? everything is ok? [10:57:58] !log restart pybal on eqiad primary LVSs to make them reconnect to etcd [10:58:01] <_joe_> yeah, everything is ok now :) [10:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:16] <_joe_> well, apart from pybal, with it's lovable bug when connecting to etcd [10:58:23] <_joe_> *its even [10:58:31] <_joe_> but ema and I are on it [10:58:31] :-) [10:58:35] ok, thanks! [10:59:07] <_joe_> so I'm gonna downtime that alarm as I'm gonna try to actually trigger it [11:01:07] eqiad done, now codfw [11:01:24] !log restart pybal on codfw secondary LVSs to make them reconnect to etcd [11:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:56] !log restart pybal on codfw primary LVSs to make them reconnect to etcd [11:03:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:34] <_joe_> !log forcing a resync of /dev/md1 on conf2001 to verify if the higher timeouts avoid consensus loss in etcd [11:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:21] (03CR) 10Hashar: "recheck" [debs/contenttranslation/apertium-ukr] - 10https://gerrit.wikimedia.org/r/408264 (https://phabricator.wikimedia.org/T184901) (owner: 10KartikMistry) [11:22:36] _joe_: it is possible that https://gerrit.wikimedia.org/r/#/c/408516/ has broken puppet in etcd servers in the toolsforge cluster? [11:22:49] https://www.irccloud.com/pastebin/tGiMjY3e/ [11:23:38] <_joe_> arturo: that should've fixed it [11:23:54] <_joe_> arturo: for some reason even deployment-prep seems not to have synced to that fchange [11:23:57] <_joe_> *change [11:24:17] <_joe_> arturo: ouch, actually [11:24:43] <_joe_> heh, I might have trusted my own comments on the manifest, that were wrong, apparently [11:25:01] <_joe_> arturo: fixing, sorry :) [11:25:44] thanks _joe_ np. I opened T186593 to track the thing [11:25:45] T186593: toolforge: broken puppet in etcd servers - https://phabricator.wikimedia.org/T186593 [11:26:16] (03CR) 10MarcoAurelio: Disable Flow extension on Commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) (owner: 10Zoranzoki21) [11:26:30] 10Operations, 10Traffic, 10media-storage: "Error: 404, Requested domainname does not exist" when accessing Commons categories/images; works on mobile page - https://phabricator.wikimedia.org/T181801#3948650 (10Aklapper) 05stalled>03declined Unfortunately closing this report as no further information has... [11:27:15] (03CR) 10MarcoAurelio: "> Under discussion by the team. Do not merge at this time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408073 (https://phabricator.wikimedia.org/T186463) (owner: 10Zoranzoki21) [11:31:42] (03PS1) 10Giuseppe Lavagetto: etcd: fix (again) the parameter type for peers_list [puppet] - 10https://gerrit.wikimedia.org/r/408524 (https://phabricator.wikimedia.org/T186593) [11:32:26] <_joe_> arturo: ^^ [11:32:44] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: fix (again) the parameter type for peers_list [puppet] - 10https://gerrit.wikimedia.org/r/408524 (https://phabricator.wikimedia.org/T186593) (owner: 10Giuseppe Lavagetto) [11:32:58] _joe_: great! thanks [11:34:34] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3948659 (10Paladox) @elukey do you mean phabricator ups... [11:36:38] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3948663 (10elukey) >>! In T182832#3948659, @Paladox wro... [11:37:09] (03PS1) 10Alexandros Kosiaris: Add legoktm to contint-docker admins [puppet] - 10https://gerrit.wikimedia.org/r/408525 (https://phabricator.wikimedia.org/T186475) [11:38:21] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/408512 (https://phabricator.wikimedia.org/T186583) (owner: 10Filippo Giunchedi) [11:40:20] (03CR) 10Alexandros Kosiaris: [C: 032] "Thankfully this is not a sudo request so no need to go through ops meeting (which happened just yesterday), so proceeding" [puppet] - 10https://gerrit.wikimedia.org/r/408525 (https://phabricator.wikimedia.org/T186475) (owner: 10Alexandros Kosiaris) [11:43:14] (03PS3) 10Elukey: Disable WebHDFS support for Hue [puppet/cdh] - 10https://gerrit.wikimedia.org/r/408513 (https://phabricator.wikimedia.org/T182242) [11:44:34] (03PS3) 10Filippo Giunchedi: raid: fix check-hpssacli for controllers in HBA mode [puppet] - 10https://gerrit.wikimedia.org/r/407433 (https://phabricator.wikimedia.org/T185216) [11:45:36] 10Operations, 10ops-eqiad, 10Analytics-Kanban: dbstore1002 possibly MEMORY issues - https://phabricator.wikimedia.org/T183771#3948687 (10elukey) As far as I can see there are no more actions to do on this particular task since: - the host is OOW so after a chat we Chris we'd be inclined not to replace any p... [11:48:46] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: legoktm can't deploy docker images on contint1001 - https://phabricator.wikimedia.org/T186475#3948703 (10akosiaris) 05Open>03Resolved a:03akosiaris I am thinking this is resolved now. Don't forget to logout + login to make sure... [11:51:10] (03CR) 10Filippo Giunchedi: [C: 032] raid: fix check-hpssacli for controllers in HBA mode [puppet] - 10https://gerrit.wikimedia.org/r/407433 (https://phabricator.wikimedia.org/T185216) (owner: 10Filippo Giunchedi) [11:53:32] (03CR) 10Volans: "Thanks for working on it and adding the nfs_mount capability. I've few comments inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/406779 (https://phabricator.wikimedia.org/T185967) (owner: 10Volans) [11:54:05] !log Sanitize s4 - T174569 [11:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:16] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [12:05:59] (03CR) 10MarcoAurelio: [C: 031] Remove old 'accountcreator' rules now handled by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408071 (https://phabricator.wikimedia.org/T185417) (owner: 10Framawiki) [12:07:26] RECOVERY - HP RAID on restbase1011 is OK: OK: Slot 0: no logical drives --- Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:1:5 - Controller: OK [12:08:41] expected ^ fixed check-hpssacli being deployed [12:11:16] PROBLEM - SSH on dbstore1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:14:36] PROBLEM - MariaDB Slave Lag: s4 on db1102 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 649.72 seconds [12:15:06] ^ that is me [12:15:25] I thought I had silenced it [12:15:40] (03CR) 10Volans: [C: 031] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/408512 (https://phabricator.wikimedia.org/T186583) (owner: 10Filippo Giunchedi) [12:17:56] PROBLEM - Host dbstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [12:18:07] checking that [12:21:25] 10Operations, 10Icinga, 10monitoring, 10Wikimedia-Incident: Icinga: page in case all MediaWiki are throwing 5xx - https://phabricator.wikimedia.org/T186069#3948744 (10Volans) @Dzahn my understanding is that those have some defects, first to be a bit delayed, and second to have had in the past false positiv... [12:23:23] backups server down [12:23:37] RECOVERY - MariaDB Slave Lag: s4 on db1102 is OK: OK slave_sql_lag Replication lag: 0.27 seconds [12:23:46] jynus: I am checking it [12:24:11] * volans around if I can be of any help [12:24:12] are ther network issues? [12:25:30] anyone already in console? I'm in the mgmt interface [12:25:34] I am [12:25:49] No more sessions are available for this type of connection! for me [12:25:57] 10Operations, 10Icinga, 10monitoring, 10Wikimedia-Incident: Icinga: page in case all MediaWiki are throwing 5xx - https://phabricator.wikimedia.org/T186069#3932784 (10Joe) We do have functional checks for all services but MediaWiki. there is an old ticket ( T136839) about enabling swagger specs for MediaWi... [12:26:43] it is totally frozen [12:26:45] I cannot see anything [12:27:21] you mean you cannot even log in? [12:27:37] or you logged in and does not respond? [12:27:38] Nope, the serial connection doesn't show anything [12:27:48] I cannot see the OS [12:27:53] then force a reboot [12:27:57] no prompt, or errors or anything [12:28:11] port 22 seems closed and no pings [12:28:13] sometimes if storage freces [12:28:29] going to power cycle it [12:28:31] high io servers like this one doesn't respond very well [12:28:43] I am taking a nap. Woke up at 4am this morniing and I am tired. I might not be available for the european swat 1h30 from now [12:29:39] done [12:30:21] let's see if there is any error while it boots up [12:30:43] ?!log force dbstore1001 powercycling? [12:30:51] Memory/battery problems were detected. [12:30:51] The adapter has recovered, but cached data was lost. [12:31:08] Multibit ECC errors were detected on the RAID controller. [12:31:08] If you continue, data corruption can occur [12:31:08] Please contact technical support to resolve this issue. [12:31:08] :-/ [12:31:13] that matches the errors on the logs [12:31:18] (although the date was wrong) [12:31:24] ouch [12:32:27] !log Power cycled dbstore1001 after it crashed - T186596 [12:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:39] T186596: dbstore1001 crashed - https://phabricator.wikimedia.org/T186596 [12:32:51] 10Operations, 10ops-eqiad, 10DBA: dbstore1001 crashed - https://phabricator.wikimedia.org/T186596#3948771 (10Marostegui) [12:32:57] I have pasted all the errors there [12:33:27] Should I hit continue? [12:34:18] yes, there is nothing else we can do [12:34:31] let's go [12:34:36] hw support is not going to magically save our data [12:34:59] it is booting now [12:35:31] 10Operations, 10ops-eqiad, 10DBA: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948773 (10Marostegui) [12:35:41] we may have to ditch our planning and rebuild dbstore1001 earlier than tought [12:36:21] it is up now [12:36:29] let's see what we find... [12:36:33] oh [12:36:34] not up [12:36:40] Welcome to emergGive root password for maintenance [12:36:40] (or type Control-D to continue): [12:36:45] :( [12:37:06] is anyone checking the HTTP errors? [12:37:09] this is unrelated [12:38:23] which errors jynus ? [12:39:16] sorry, I though I saw high mediawiki errors [12:40:21] there actually are, but those are known [12:42:42] (03PS1) 10Giuseppe Lavagetto: remove compare-puppet-catalogs [software] - 10https://gerrit.wikimedia.org/r/408527 (https://phabricator.wikimedia.org/T186304) [12:48:54] 10Operations, 10ops-eqiad, 10DBA: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948781 (10Marostegui) ``` Begin: Mounting root file system ... Begin: Running /scripts/loc[ 9.644741] device-mapper: uevent: version 1.0.3 al-top ..... [12:49:46] RECOVERY - Host dbstore1001 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [12:49:51] 10Operations, 10ops-eqiad, 10DBA: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948782 (10Marostegui) ``` [ 171.926534] XFS (dm-0): Mounting V4 Filesystem [ 172.507461] XFS (dm-0): failed to locate log tail [ 172.507464] XFS (dm... [12:50:33] 10Operations, 10ops-eqiad, 10DBA: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948783 (10Marostegui) The LD can be seen finely though - but might be corrupted ``` root@dbstore1001:~# megacli -LDInfo -L0 -a0 Adapter 0 -- Virtual... [12:51:57] PROBLEM - MariaDB Slave IO: s3 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:51:57] PROBLEM - MariaDB Slave SQL: s6 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:51:57] PROBLEM - MariaDB Slave IO: s7 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:51:57] PROBLEM - MariaDB Slave SQL: s8 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:12] PROBLEM - mysqld processes on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:12] PROBLEM - MariaDB Slave IO: s6 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:12] PROBLEM - configured eth on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:16] PROBLEM - Disk space on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:16] PROBLEM - MariaDB Slave SQL: m3 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:16] PROBLEM - MariaDB Slave SQL: s3 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:26] PROBLEM - DPKG on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:27] PROBLEM - MariaDB Slave SQL: s4 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:27] PROBLEM - MariaDB Slave IO: s1 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:27] PROBLEM - MariaDB Slave SQL: s1 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:27] PROBLEM - MariaDB Slave IO: s2 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:27] PROBLEM - MariaDB Slave SQL: m2 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:28] PROBLEM - dhclient process on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:28] PROBLEM - MariaDB Slave SQL: x1 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:36] PROBLEM - Check systemd state on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:36] PROBLEM - Check whether ferm is active by checking the default input chain on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:41] PROBLEM - MariaDB disk space on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:56] PROBLEM - MariaDB Slave IO: s8 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:56] PROBLEM - MariaDB Slave IO: m2 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:56] PROBLEM - MariaDB Slave IO: s5 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:57] PROBLEM - MariaDB Slave IO: x1 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:57] PROBLEM - Check size of conntrack table on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:57] PROBLEM - MariaDB Slave IO: s4 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:57] PROBLEM - MariaDB Slave SQL: s7 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:58] PROBLEM - MariaDB Slave IO: m3 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:52:58] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: Return code of 255 is out of bounds [12:59:36] RECOVERY - DPKG on dbstore1001 is OK: All packages OK [12:59:37] RECOVERY - dhclient process on dbstore1001 is OK: PROCS OK: 0 processes with command name dhclient [12:59:51] RECOVERY - MariaDB disk space on dbstore1001 is OK: DISK OK [12:59:51] RECOVERY - Check whether ferm is active by checking the default input chain on dbstore1001 is OK: OK ferm input default policy is set [12:59:51] RECOVERY - Check systemd state on dbstore1001 is OK: OK - running: The system is fully operational [12:59:57] RECOVERY - SSH on dbstore1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [13:00:06] RECOVERY - Check size of conntrack table on dbstore1001 is OK: OK: nf_conntrack is 0 % full [13:00:26] RECOVERY - configured eth on dbstore1001 is OK: OK - interfaces up [13:00:27] RECOVERY - Disk space on dbstore1001 is OK: DISK OK [13:02:55] did you manage to boot it? [13:03:01] yeah [13:03:04] but srv looks unusable [13:03:10] I am on it now [13:03:12] we will see [13:03:26] don't worry too much [13:03:47] unusable as in not even able to mount it [13:04:43] ouch [13:05:00] let me know if you need help [13:05:26] (03CR) 10Joal: [C: 031] "LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/408251 (https://phabricator.wikimedia.org/T166248) (owner: 10Elukey) [13:05:41] akosiaris: I am going to try to run an xfs_check and see what we get [13:06:21] (03PS1) 10Ladsgroup: Add entityUsageModifierLimits config for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408528 (https://phabricator.wikimedia.org/T185693) [13:06:25] ok [13:08:34] (03CR) 10jerkins-bot: [V: 04-1] Add entityUsageModifierLimits config for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408528 (https://phabricator.wikimedia.org/T185693) (owner: 10Ladsgroup) [13:09:29] 10Operations, 10ops-eqiad, 10DBA: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948795 (10Marostegui) This is what xfs_repair (dry run) shows: ``` root@dbstore1001:~# xfs_repair -n -v /dev/mapper/tank-data Phase 1 - find and verify... [13:09:33] akosiaris: ^ [13:10:27] We can probably try to let ir repair those [13:10:47] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3499267 (10mark) >>! In T172459#3939141, @jcrespo wrote: > @BBlack The thing is, we physically could do this in 2 weeks- if we put it on our top priority and do nothing else- I don'... [13:11:21] (03PS1) 10Arturo Borrero Gonzalez: aptly: add more attributes to published repos [puppet] - 10https://gerrit.wikimedia.org/r/408529 (https://phabricator.wikimedia.org/T186539) [13:12:27] (03PS2) 10Ladsgroup: Add entityUsageModifierLimits config for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408528 (https://phabricator.wikimedia.org/T185693) [13:12:48] (03CR) 10Arturo Borrero Gonzalez: [C: 032] aptly: add more attributes to published repos [puppet] - 10https://gerrit.wikimedia.org/r/408529 (https://phabricator.wikimedia.org/T186539) (owner: 10Arturo Borrero Gonzalez) [13:12:57] (03PS2) 10Arturo Borrero Gonzalez: aptly: add more attributes to published repos [puppet] - 10https://gerrit.wikimedia.org/r/408529 (https://phabricator.wikimedia.org/T186539) [13:13:08] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] aptly: add more attributes to published repos [puppet] - 10https://gerrit.wikimedia.org/r/408529 (https://phabricator.wikimedia.org/T186539) (owner: 10Arturo Borrero Gonzalez) [13:13:55] (03PS1) 10Alexandros Kosiaris: Convert to non native debian package [software/service-checker] - 10https://gerrit.wikimedia.org/r/408530 [13:14:47] (03CR) 10Joal: [C: 031] "Thanks for that @elukey!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/408513 (https://phabricator.wikimedia.org/T182242) (owner: 10Elukey) [13:17:33] (03CR) 10Alexandros Kosiaris: [C: 031] profile::etcd: stagger software RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/408514 (https://phabricator.wikimedia.org/T162013) (owner: 10Giuseppe Lavagetto) [13:17:35] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#3948818 (10elukey) All the work to move eventlogging to systemd is going to be tracked in https://phabricator.wikimedia.org/T114199, let's use this task only for the eventlog1002's productionization. [13:22:18] 10Operations, 10ops-eqiad, 10DBA: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948825 (10Marostegui) As read_only the FS can be mounted (only if the recovery is skipped): ``` root@dbstore1001:~# mount -o ro -o norecovery -n /dev/m... [13:25:09] (03CR) 10Giuseppe Lavagetto: [C: 032] Convert to non native debian package [software/service-checker] - 10https://gerrit.wikimedia.org/r/408530 (owner: 10Alexandros Kosiaris) [13:27:35] 10Operations, 10ops-eqiad, 10DBA: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948831 (10Marostegui) This is what triggered the crash: ``` Feb 6 12:06:54 dbstore1001 kernel: [10982464.366365] megaraid_sas 0000:03:00.0: Found FW i... [13:27:53] 10Operations, 10ops-eqiad, 10DBA: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948832 (10Marostegui) [13:30:03] (03PS6) 10Rush: openstack: nova-network and neutron nova::common split [puppet] - 10https://gerrit.wikimedia.org/r/405366 (https://phabricator.wikimedia.org/T171494) [13:30:44] (03CR) 10jerkins-bot: [V: 04-1] openstack: nova-network and neutron nova::common split [puppet] - 10https://gerrit.wikimedia.org/r/405366 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [13:35:35] !log upgrading prometheus-elasticsearch-exporter across all elasticsearch nodes [13:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:59] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3948843 (10elukey) >>! In T182993#3947432, @Ottomata wrote: > Thanks @bblack, it's at least good to know that we'll need to do the IPSec thing o... [13:40:39] (03PS9) 10Elukey: profile::analytics::refinery::job::camus: add netflow hourly job [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) [13:41:41] (03CR) 10Elukey: [C: 032] profile::analytics::refinery::job::camus: add netflow hourly job [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [13:41:45] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3948850 (10Gehel) elasticsearch exporter has been upgraded across all elasticsearch nodes. We should now have all the metrics we ne... [13:42:47] (03PS6) 10Arturo Borrero Gonzalez: apt: merge report-pending-upgrades script into apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) [13:43:14] 10Operations, 10Commons, 10Multimedia, 10media-storage: Generate a list of files that are supposed to exist but 404s - https://phabricator.wikimedia.org/T182822#3948853 (10Jojr149) a:03Jojr149 [13:43:41] RECOVERY - mysqld processes on dbstore1001 is OK: PROCS OK: 1 process with command name mysqld [13:46:27] 10Operations, 10ops-eqiad, 10DBA: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948856 (10Marostegui) xfs_repair was run, and /srv can be mounted. Some manual writes were good. I have started MySQL to see how it goes with the recov... [13:51:49] (03PS7) 10Arturo Borrero Gonzalez: apt: merge report-pending-upgrades script into apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) [13:55:13] (03PS2) 10Giuseppe Lavagetto: profile::etcd: stagger software RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/408514 (https://phabricator.wikimedia.org/T162013) [13:55:23] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::etcd: stagger software RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/408514 (https://phabricator.wikimedia.org/T162013) (owner: 10Giuseppe Lavagetto) [13:55:53] 10Operations, 10ops-eqiad, 10DBA: dbstore1001 crashed: Multibit ECC errors were detected on the RAID controller. - https://phabricator.wikimedia.org/T186596#3948857 (10Marostegui) MySQL won't start: ``` InnoDB: Database page corruption on disk or a failed InnoDB: file read of page 23926. InnoDB: You may have... [13:59:32] (03PS1) 10Alexandros Kosiaris: Rebuild for jessie-wikimedia [software/service-checker] (jessie) - 10https://gerrit.wikimedia.org/r/408532 [14:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180206T1400). [14:00:05] Urbanecm, subbu, matthiasmullie, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:10] o/ [14:00:17] I can [14:00:20] argh [14:00:23] I can swat today [14:00:25] :) [14:00:33] Amir1: want to deploy your change? :) [14:00:44] Sure! [14:00:47] (03PS1) 10Alexandros Kosiaris: Rebuild for stretch-wikimedia [software/service-checker] - 10https://gerrit.wikimedia.org/r/408533 [14:00:58] Amir1: go ahead, while I review other patches [14:01:11] (03CR) 10Ladsgroup: [C: 032] Add entityUsageModifierLimits config for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408528 (https://phabricator.wikimedia.org/T185693) (owner: 10Ladsgroup) [14:01:32] (03CR) 10Eranroz: "[please read just before merge]" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408528 (https://phabricator.wikimedia.org/T185693) (owner: 10Ladsgroup) [14:01:58] Urbanecm, subbu, matthiasmullie: around for swat? [14:02:05] here! [14:02:28] matthiasmullie: do you want to deploy your change? [14:02:32] (03CR) 10Volans: "-1 Given we're doing it custom let's move the time too so that if it does page at least it's not in the few hours nobody is around" [puppet] - 10https://gerrit.wikimedia.org/r/408514 (https://phabricator.wikimedia.org/T162013) (owner: 10Giuseppe Lavagetto) [14:02:53] sure [14:03:08] matthiasmullie: you are next, as soon as Amir1 is done [14:03:15] aight [14:03:17] (03Merged) 10jenkins-bot: Add entityUsageModifierLimits config for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408528 (https://phabricator.wikimedia.org/T185693) (owner: 10Ladsgroup) [14:03:37] (03PS2) 10Alexandros Kosiaris: Rebuild for stretch-wikimedia [software/service-checker] - 10https://gerrit.wikimedia.org/r/408533 [14:03:46] subbu, Urbanecm: your patches will not be deployed if you are not around for swat [14:03:46] (03CR) 10Ladsgroup: [C: 032] "1- This is not deployed yet 2- We got the numbers and reviewed everything and got to these the numbers 3- Will move on to other wikis afte" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408528 (https://phabricator.wikimedia.org/T185693) (owner: 10Ladsgroup) [14:04:28] (03PS1) 10Alexandros Kosiaris: Rebuild for stretch-wikimedia [software/service-checker] (stretch) - 10https://gerrit.wikimedia.org/r/408534 [14:04:33] zeljkof, o/ [14:04:48] (03CR) 10Giuseppe Lavagetto: [C: 032] "> -1 Given we're doing it custom let's move the time too so that if" [puppet] - 10https://gerrit.wikimedia.org/r/408514 (https://phabricator.wikimedia.org/T162013) (owner: 10Giuseppe Lavagetto) [14:04:52] subbu: want to deploy your change? [14:05:07] <_joe_> volans: I did take that into account [14:05:08] (03Abandoned) 10Alexandros Kosiaris: Rebuild for stretch-wikimedia [software/service-checker] - 10https://gerrit.wikimedia.org/r/408533 (owner: 10Alexandros Kosiaris) [14:05:17] zeljkof, can you deploy? [14:05:41] (03PS2) 10Alexandros Kosiaris: Rebuild for jessie-wikimedia [software/service-checker] (jessie) - 10https://gerrit.wikimedia.org/r/408532 [14:05:48] subbu: sure, but for future reference, releng will decrease it's presence in swat, to you have to learn/practice :) [14:06:11] ok .. i might have to first get deploy rights. :) [14:06:30] you have to start working on it now :) [14:06:34] will do. [14:06:37] (03CR) 10Alexandros Kosiaris: [C: 032] Rebuild for stretch-wikimedia [software/service-checker] (stretch) - 10https://gerrit.wikimedia.org/r/408534 (owner: 10Alexandros Kosiaris) [14:06:48] I'm not sure how much releng will be around for swat [14:06:48] (03CR) 10Alexandros Kosiaris: [C: 032] Rebuild for jessie-wikimedia [software/service-checker] (jessie) - 10https://gerrit.wikimedia.org/r/408532 (owner: 10Alexandros Kosiaris) [14:06:56] understood. [14:07:12] (03CR) 10jenkins-bot: Add entityUsageModifierLimits config for Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408528 (https://phabricator.wikimedia.org/T185693) (owner: 10Ladsgroup) [14:07:20] _joe_: 6am UTC + 2h is still 8am CET, so probably you and manuel will the only ones being around anyway... but your call ;) [14:07:24] !log re-enable smartpath on restbase1010 (revert experiment) - T178177 [14:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:38] looks fine, moving forward [14:07:38] T178177: Investigate aberrant Cassandra columnfamily read latency of restbase101{0,2,4} - https://phabricator.wikimedia.org/T178177 [14:07:43] sorry 9am [14:07:57] (03CR) 10Eranroz: Add entityUsageModifierLimits config for Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408528 (https://phabricator.wikimedia.org/T185693) (owner: 10Ladsgroup) [14:08:40] subbu: please stand by, I will let you know when your patch is at mwdebug1002, you are third, the first patch is being deployed [14:09:10] ok. [14:09:50] <_joe_> volans: 6+2 = 8 UTC => 9 CET [14:09:55] (03CR) 10Zfilipin: [C: 031] Enable RemexHtml on fiwiki, hewiki, ruwiki, svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407702 (https://phabricator.wikimedia.org/T185945) (owner: 10Subramanya Sastry) [14:09:56] <_joe_> and 10 CEST [14:10:10] !log ladsgroup@tin Synchronized wmf-config/Wikibase.php: [[gerrit:408528|Add entityUsageModifierLimits config for Wikibase (T185693)]] (duration: 00m 55s) [14:10:15] <_joe_> or, am I missing something? [14:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:22] T185693: Implement a (more liberal) usage aspect deduplicater (days: 3) - https://phabricator.wikimedia.org/T185693 [14:12:25] (03CR) 10Zfilipin: [C: 031] New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408281 (https://phabricator.wikimedia.org/T186530) (owner: 10Urbanecm) [14:13:02] I;m done [14:13:29] Amir1: great, matthiasmullie please deploy your patch [14:13:35] ok [14:13:36] (03PS2) 10Filippo Giunchedi: smart: ignore output decoding errors [puppet] - 10https://gerrit.wikimedia.org/r/408512 (https://phabricator.wikimedia.org/T186583) [14:14:35] !log mlitn@tin Synchronized php-1.31.0-wmf.17/extensions/UploadWizard/resources/details/uw.DescriptionsDetailsWidget.js: T184380 (duration: 00m 55s) [14:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:49] T184380: Upload Wizard ignores the description= field when using the upload campaign - https://phabricator.wikimedia.org/T184380 [14:15:36] zeljkof: I'm done [14:15:47] matthiasmullie: great, taking over swat [14:16:14] (03PS2) 10Zfilipin: Enable RemexHtml on fiwiki, hewiki, ruwiki, svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407702 (https://phabricator.wikimedia.org/T185945) (owner: 10Subramanya Sastry) [14:16:27] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407702 (https://phabricator.wikimedia.org/T185945) (owner: 10Subramanya Sastry) [14:16:55] (03CR) 10Filippo Giunchedi: [C: 032] smart: ignore output decoding errors [puppet] - 10https://gerrit.wikimedia.org/r/408512 (https://phabricator.wikimedia.org/T186583) (owner: 10Filippo Giunchedi) [14:17:24] (03PS2) 10Zfilipin: Typo, it's 2018 not 2017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408283 (https://phabricator.wikimedia.org/T185794) (owner: 10Urbanecm) [14:17:46] (03CR) 10Ottomata: [C: 031] Disable WebHDFS support for Hue [puppet/cdh] - 10https://gerrit.wikimedia.org/r/408513 (https://phabricator.wikimedia.org/T182242) (owner: 10Elukey) [14:18:08] (03CR) 10Elukey: [C: 032] Disable WebHDFS support for Hue [puppet/cdh] - 10https://gerrit.wikimedia.org/r/408513 (https://phabricator.wikimedia.org/T182242) (owner: 10Elukey) [14:18:43] (03Merged) 10jenkins-bot: Enable RemexHtml on fiwiki, hewiki, ruwiki, svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407702 (https://phabricator.wikimedia.org/T185945) (owner: 10Subramanya Sastry) [14:18:53] (03CR) 10jenkins-bot: Enable RemexHtml on fiwiki, hewiki, ruwiki, svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/407702 (https://phabricator.wikimedia.org/T185945) (owner: 10Subramanya Sastry) [14:19:05] 10Operations, 10Commons, 10Multimedia, 10media-storage: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101#3948893 (10Jojr149) [14:19:10] (03PS1) 10Elukey: profile::analytics::refinery::job::json_refine: add netflow job [puppet] - 10https://gerrit.wikimedia.org/r/408535 (https://phabricator.wikimedia.org/T181036) [14:19:51] subbu: the commit is at mwdebug1002, please test and let me know if I can deploy [14:19:52] (03PS2) 10Elukey: profile::analytics::refinery::job::json_refine: add netflow job [puppet] - 10https://gerrit.wikimedia.org/r/408535 (https://phabricator.wikimedia.org/T181036) [14:20:02] zeljkof, ty. testing .. [14:23:03] (03CR) 10Ottomata: [C: 031] profile::analytics::refinery::job::camus: add netflow hourly job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/406951 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [14:23:49] zeljkof, looks good on the mwdebug1002 testing side .. do you see any logging / errors? if not, all good. [14:25:18] subbu: I don't see anything unusual in logs, I'll deploy and scap will complain if it sees spikes in logs [14:25:31] sounds good. [14:26:14] (03PS1) 10Anomie: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408536 (https://phabricator.wikimedia.org/T166733) [14:26:39] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:407702|Enable RemexHtml on fiwiki, hewiki, ruwiki, svwiki (T185945)]] (duration: 00m 55s) [14:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:51] T185945: Enable RemexHTML on ruwiki, fiwiki, svwiki, hewiki - https://phabricator.wikimedia.org/T185945 [14:27:07] subbu: deployed, please check and thanks for deploying with #releng ;) [14:27:13] zeljkof, ty! :) [14:27:16] (03CR) 10Ladsgroup: [C: 032] "Nope, 73 is not preferring good usage tracking in favor of large wbc_entity_usage, that number in cawiki actually cut the size of the daba" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408528 (https://phabricator.wikimedia.org/T185693) (owner: 10Ladsgroup) [14:27:36] subbu: please make sure you _can_ deploy, in case [14:27:48] releng is not around for swat [14:28:05] (new keyboard, clicking [14:28:24] ok. [14:28:27] zeljkof: Is there room to add https://gerrit.wikimedia.org/r/408536 to SWAT? If not, I'll do it myself after. [14:28:33] (03PS1) 10Elukey: Update the cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/408537 (https://phabricator.wikimedia.org/T182242) [14:28:33] return is where shift is on the old keyboard) [14:28:57] !log Changing triggers on s2 - T174569 [14:29:02] anomie: go ahead and deploy it, Urbanecm is not around for swat [14:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:11] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [14:29:18] I'll deploy throttle changes after you are done [14:29:27] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408536 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [14:30:50] Urbanecm: around for swat? if not, I'll only deploy throttle changes [14:31:34] (03Merged) 10jenkins-bot: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408536 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [14:31:36] (03CR) 10Eranroz: "Amir, thanks for clarification." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408528 (https://phabricator.wikimedia.org/T185693) (owner: 10Ladsgroup) [14:31:48] (03CR) 10jenkins-bot: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408536 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [14:33:06] (03PS7) 10Rush: openstack: nova-network and neutron nova::common split [puppet] - 10https://gerrit.wikimedia.org/r/405366 (https://phabricator.wikimedia.org/T171494) [14:33:14] (03CR) 10Ottomata: "We talking in IRC... :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/408535 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [14:33:17] !log anomie@tin Synchronized wmf-config/InitialiseSettings.php: Setting wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on test wikis (duration: 00m 56s) [14:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:45] zeljkof: Done [14:34:10] anomie: ok, taking over swat, deploying Urbanecm's throttle changes [14:35:24] !log disable puppet on labs things for a cautious change rollout [14:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:47] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408281 (https://phabricator.wikimedia.org/T186530) (owner: 10Urbanecm) [14:37:31] (03Merged) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408281 (https://phabricator.wikimedia.org/T186530) (owner: 10Urbanecm) [14:37:33] (03CR) 10Rush: [C: 032] openstack: nova-network and neutron nova::common split [puppet] - 10https://gerrit.wikimedia.org/r/405366 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:37:42] (03PS8) 10Rush: openstack: nova-network and neutron nova::common split [puppet] - 10https://gerrit.wikimedia.org/r/405366 (https://phabricator.wikimedia.org/T171494) [14:38:59] (03CR) 10jenkins-bot: New throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408281 (https://phabricator.wikimedia.org/T186530) (owner: 10Urbanecm) [14:39:56] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:408281|New throttle rule (T186530)]] (duration: 00m 55s) [14:40:04] To current swatter: I'm very sory I'm late, I have some problems with my computer (more precisely: I've forgotten my charger at home) so I'm quite limited in my testing-possibilities [14:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:10] T186530: Lift IP cap on 2018-02-07 for Senior Citizen Write Wikipedia course - cs.wikipedia - https://phabricator.wikimedia.org/T186530 [14:40:13] But whatever you want, I can try to do [14:40:18] zeljkof ^^^ [14:40:35] Urbanecm_: great, just started deploying throttle changes [14:40:43] Great to know [14:40:44] I'll ping you for testing in a few minutes [14:40:45] jouncebot, current [14:40:49] jouncebot, now [14:40:50] For the next 0 hour(s) and 19 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180206T1400) [14:40:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Apart from a minor comment, LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/407197 (https://phabricator.wikimedia.org/T186204) (owner: 10KartikMistry) [14:41:10] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408283 (https://phabricator.wikimedia.org/T185794) (owner: 10Urbanecm) [14:41:26] (03CR) 10Zfilipin: Typo, it's 2018 not 2017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408283 (https://phabricator.wikimedia.org/T185794) (owner: 10Urbanecm) [14:41:30] (03PS3) 10Zfilipin: Typo, it's 2018 not 2017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408283 (https://phabricator.wikimedia.org/T185794) (owner: 10Urbanecm) [14:41:39] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408283 (https://phabricator.wikimedia.org/T185794) (owner: 10Urbanecm) [14:42:11] (03CR) 10Elukey: [C: 032] Update the cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/408537 (https://phabricator.wikimedia.org/T182242) (owner: 10Elukey) [14:42:17] (03PS2) 10Elukey: Update the cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/408537 (https://phabricator.wikimedia.org/T182242) [14:42:53] (03CR) 10Alexandros Kosiaris: "Just saw this, any reason to revisit it ? Or abandon ?" [software/service-checker] - 10https://gerrit.wikimedia.org/r/308019 (owner: 10Legoktm) [14:43:11] (03Merged) 10jenkins-bot: Typo, it's 2018 not 2017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408283 (https://phabricator.wikimedia.org/T185794) (owner: 10Urbanecm) [14:43:24] (03CR) 10jenkins-bot: Typo, it's 2018 not 2017 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408283 (https://phabricator.wikimedia.org/T185794) (owner: 10Urbanecm) [14:45:11] Urbanecm_: there is 15 minutes left, is any patch urgent? we will not have the time for all of them [14:45:33] !log zfilipin@tin Synchronized wmf-config/throttle.php: SWAT: [[gerrit:408283|Typo, its 2018 not 2017 (T185794)]] (duration: 00m 55s) [14:45:34] I am deploying the second throttle change, so the first two patches are done [14:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:42] T185794: Lift IP cap for account creation for Art+Feminism at Kickstarter March 4th - https://phabricator.wikimedia.org/T185794 [14:45:43] 10Operations, 10Commons, 10Multimedia, 10media-storage: Generate a list of files that are supposed to exist but 404s - https://phabricator.wikimedia.org/T182822#3948962 (10Aklapper) a:05Jojr149>03None @Jojr149: Do you plan to work on this task? If not then please don't set yourself as assignee. Thanks! [14:46:04] zeljkof, 406486 should be deployed, it's a follow up [14:46:17] 10Operations, 10Commons, 10Multimedia, 10media-storage, 10User-Josve05a: Specific revisions of multiple files missing from Swift - 404 Not Found returned - https://phabricator.wikimedia.org/T124101#3948965 (10Aklapper) @Jojr149: Please do not remove user projects from tasks. [14:46:20] Urbanecm_ 408281, 408283 are deployed [14:46:30] zeljkof, ack, thanks [14:46:42] ok, will ping you when 406486 is at mwdebug [14:46:59] (03PS2) 10Zfilipin: Allow eliminators to undelete at urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406486 (https://phabricator.wikimedia.org/T185829) (owner: 10Urbanecm) [14:47:50] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406486 (https://phabricator.wikimedia.org/T185829) (owner: 10Urbanecm) [14:47:54] zeljkof, can you do me a favour? Can you skip mwdebug this time? As I said, it's not possible to me to use my ordinary computer (it is right next to me, but battery is very low and charger is at home) [14:48:14] I cannot install extensions such as the one for mwdebug at borrowed computers [14:48:43] Urbanecm_: sure, the patch does not look like it would break anything, and if it does, I will revert [14:49:01] Thanks! [14:49:14] Urbanecm_: also, I can test, just let me know what to look for where [14:50:04] Special:Usergrouprights page at the particular wiki (you can append ?uselang=en if you want the page to be in english), scroll down to eliminators and find if there is row saying 'undelete' [14:50:07] If so, the patch is working [14:50:12] !log upgrade service-checker to 0.1.4 on scb1001 [14:50:18] Urbanecm_: will do [14:50:18] If not so, there's something bad happening (but this shouldn't happen) [14:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:22] (03Merged) 10jenkins-bot: Allow eliminators to undelete at urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406486 (https://phabricator.wikimedia.org/T185829) (owner: 10Urbanecm) [14:51:35] (03CR) 10jenkins-bot: Allow eliminators to undelete at urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/406486 (https://phabricator.wikimedia.org/T185829) (owner: 10Urbanecm) [14:51:52] 10Operations, 10monitoring, 10Patch-For-Review: Unicode error while running smart-data-dump on kafka1023 - https://phabricator.wikimedia.org/T186583#3948973 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Fixed! ``` root@kafka1023:~# /usr/local/sbin/smart-data-dump -d DEBUG:__main__:Fact 'raid' discov... [14:53:01] !log Poweroff db1051 for BBU replacement - T186049 [14:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:13] T186049: db1051 database host BBU issues - https://phabricator.wikimedia.org/T186049 [14:58:27] 10Operations, 10ops-eqiad, 10DBA: db1051 database host BBU issues - https://phabricator.wikimedia.org/T186049#3949018 (10Marostegui) @Cmjohnson this server is now off. Feel free to power it on once you've done the replacement Thanks! [14:59:42] Urbanecm_: I can see undelete at https://ur.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5:%D9%81%DB%81%D8%B1%D8%B3%D8%AA_%D8%A7%D8%AE%D8%AA%DB%8C%D8%A7%D8%B1%D8%A7%D8%AA_%DA%AF%D8%B1%D9%88%DB%81&uselang=en [14:59:44] deploying [15:01:27] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:406486|Allow eliminators to undelete at urwiki (T185829)]] (duration: 00m 55s) [15:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:39] T185829: Allow "eliminators" at urwiki to undelete pages - https://phabricator.wikimedia.org/T185829 [15:02:16] Urbanecm_: deployed, looks good https://ur.wikipedia.org/w/index.php?title=%D8%AE%D8%A7%D8%B5:%D9%81%DB%81%D8%B1%D8%B3%D8%AA_%D8%A7%D8%AE%D8%AA%DB%8C%D8%A7%D8%B1%D8%A7%D8%AA_%DA%AF%D8%B1%D9%88%DB%81&uselang=en [15:02:33] zeljkof, thanks! [15:02:41] !log EU SWAT finished [15:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:05] Urbanecm_: done for now, please reschedule the rest of the commits for another swat [15:04:09] 10Operations, 10hardware-requests: decommission mw1163 - https://phabricator.wikimedia.org/T175089#3949069 (10Joe) This server is out of production and hopefully de-racked since a long time as it was part of an older batch of servers. @Cmjohnson can confirm, but this should be decommissioned per T177387 [15:04:32] 10Operations, 10hardware-requests: decommission mw1163 - https://phabricator.wikimedia.org/T175089#3949073 (10Joe) 05Open>03Resolved [15:06:17] (03PS1) 10Elukey: camus::netflow: move the base path directory to /wmf/raw [puppet] - 10https://gerrit.wikimedia.org/r/408539 (https://phabricator.wikimedia.org/T181036) [15:06:46] (03CR) 10Ottomata: [C: 031] camus::netflow: move the base path directory to /wmf/raw [puppet] - 10https://gerrit.wikimedia.org/r/408539 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [15:10:00] (03CR) 10Elukey: [C: 032] camus::netflow: move the base path directory to /wmf/raw [puppet] - 10https://gerrit.wikimedia.org/r/408539 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [15:19:30] (03PS1) 10Tjones: Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408540 [15:28:36] (03PS2) 10Tjones: Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408540 (https://phabricator.wikimedia.org/T23582) [15:29:08] (03Draft1) 10Tulsi Bhagat: Install Translate extension on mai.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408541 (https://phabricator.wikimedia.org/T186597) [15:30:08] elukey: LMK when okay to shutdown an1038 for bbu swap [15:30:31] (03PS15) 10Andrew Bogott: openstack horizon: rough in manifests for source deploy of Horizon 'queens' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) [15:31:16] (03CR) 10Zoranzoki21: [C: 031] Updates to enable transliteration for crhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408540 (https://phabricator.wikimedia.org/T23582) (owner: 10Tjones) [15:32:34] (03CR) 10Jayprakash12345: [C: 04-1] "where is the file?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408541 (https://phabricator.wikimedia.org/T186597) (owner: 10Tulsi Bhagat) [15:33:27] cmjohnson1: ack! going to drain it now [15:36:46] !log drain + shutdown of analytics1038 to replace faulty BBU - T185409 [15:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:58] T185409: BBU alarms flapping for analytics1038 - https://phabricator.wikimedia.org/T185409 [15:37:43] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T184832#3949214 (10RobH) No worries, I just didn't want to remove all the old references directly. I wasn't sure which needed rem... [15:38:36] 10Operations, 10Gerrit, 10Release-Engineering-Team (Someday): Make sure replying to emails in gerrit 2.14 works - https://phabricator.wikimedia.org/T158915#3949229 (10Paladox) p:05Lowest>03Normal Changing priority as we are now unlocked due to us updating to 2.14 :) [15:39:21] (03PS16) 10Andrew Bogott: openstack horizon: rough in manifests for source deploy of Horizon 'queens' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) [15:39:27] no_justification: hi! how's it going? just curious, about when do you plan to do the MW branch cut? (wondering how long I reasonably have to put some stuff on the train...) thanks!!!! [15:44:22] (03CR) 10Filippo Giunchedi: raid: report PDs from get-raid-status-hpssacli (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/407447 (https://phabricator.wikimedia.org/T185216) (owner: 10Filippo Giunchedi) [15:57:37] 10Operations, 10ops-codfw, 10Cloud-VPS: Connect labtestvirt2003 eth1 and eth2 interface(s) to switch fabric - https://phabricator.wikimedia.org/T183167#3949338 (10Papaul) a:05Papaul>03RobH labtestvirt2002:eth0 = ge-5/0/17 (ID=2187) labtestvirt2002:eth1 = ge-5/0/ 31 (ID=11518) labtestvirt2003:eth... [15:57:40] (03Abandoned) 10Tulsi Bhagat: Install Translate extension on mai.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408541 (https://phabricator.wikimedia.org/T186597) (owner: 10Tulsi Bhagat) [15:58:11] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2039 - https://phabricator.wikimedia.org/T186533#3949343 (10Papaul) Trying another one. [15:58:12] cmjohnson1: an1038 is shutting down [16:00:11] (03PS1) 10Filippo Giunchedi: hieradata: extend SMART eqiad deployment [puppet] - 10https://gerrit.wikimedia.org/r/408543 (https://phabricator.wikimedia.org/T86552) [16:02:02] ok [16:04:42] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2039 - https://phabricator.wikimedia.org/T186533#3949355 (10Papaul) a:05Papaul>03Marostegui Another disk is in place. [16:05:24] 10Operations, 10monitoring, 10Graphite, 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482#3949357 (10fgiunchedi) [16:10:13] !log upgrading kartotherian / tilerator on maps codfw [16:10:16] PROBLEM - Host analytics1038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:05] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2001.codfw.wmnet [16:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:18] (03CR) 10Volans: [C: 04-1] "replied inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/407447 (https://phabricator.wikimedia.org/T185216) (owner: 10Filippo Giunchedi) [16:13:26] PROBLEM - Host db1051.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:13:35] ^ expected [16:14:10] !log gehel@tin Started deploy [kartotherian/deploy@ecdda41]: new kartotherian packaging [16:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:32] !log gehel@tin Finished deploy [kartotherian/deploy@ecdda41]: new kartotherian packaging (duration: 00m 22s) [16:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:06] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2001.codfw.wmnet [16:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:27] RECOVERY - Host analytics1038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [16:15:27] PROBLEM - Host mc2036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:17:03] (03CR) 10Rush: [C: 032] apt: merge report-pending-upgrades script into apt-upgrade [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [16:17:12] PROBLEM - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is CRITICAL: /{src}/info.json (tile service info for osm-intl) is CRITICAL: Test tile service info for osm-intl returned the unexpected status 400 (expecting: 200) [16:17:43] gehel: you I'm assuming ^ [16:17:43] gehel: anything going on? just got paged [16:17:45] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2001.codfw.wmnet [16:17:46] maps? [16:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:58] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3949382 (10Nuria) >We have a new employee starting next week who will be working in just the right area to do this review as well, but this isn'... [16:18:00] yep, that's me, slight issue during deployment... [16:18:05] I'm on it [16:18:13] ask for help if you need it [16:18:25] jynus: thanks! everything under control [16:18:37] RECOVERY - Host db1051.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [16:18:54] elukey: should be up now [16:19:19] cmjohnson1: checking, thanks! [16:19:51] (03CR) 10Rush: "* sort output of python3 apt-upgrade report so it shows per repo updates together? (sorted)" [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [16:20:23] cmjohnson1: the BBU seems charging, so far so good! [16:20:33] great [16:21:05] Relative State of Charge: 80 % [16:22:22] RECOVERY - Kartotherian LVS codfw on kartotherian.svc.codfw.wmnet is OK: All endpoints are healthy [16:22:46] 10Operations, 10Domains, 10Research, 10Traffic, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3949394 (10bmansurov) 05stalled>03Open @Dzahn the site is ready for launch. Gerrit has been updated with the latest content from Github. Please go... [16:23:17] mutante: o/ ^ fyi (excited to see the site live) [16:24:20] !log gehel@tin Started deploy [tilerator/deploy@29d633e]: new tilerator packaging [16:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:51] !log gehel@tin Finished deploy [tilerator/deploy@29d633e]: new tilerator packaging (duration: 00m 34s) [16:24:54] 10Operations, 10ops-eqiad, 10Analytics-Kanban: BBU alarms flapping for analytics1038 - https://phabricator.wikimedia.org/T185409#3949399 (10elukey) Much better now! ``` elukey@analytics1038:~$ sudo megacli -AdpBbuCmd -GetBbuCapacityInfo -aAll BBU Capacity Info for Adapter: 0 Relative State of Charge: 8... [16:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:03] 10Operations, 10ops-eqiad, 10Analytics-Kanban: BBU alarms flapping for analytics1038 - https://phabricator.wikimedia.org/T185409#3949400 (10elukey) 05Open>03Resolved a:03elukey [16:26:51] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2001.codfw.wmnet [16:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:23] !log gehel@tin Started deploy [kartotherian/deploy@ecdda41]: new kartotherian packaging [16:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:55] bmansurov: cool !:) ok, on it in a minute :) [16:27:56] PROBLEM - Apache HTTP on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [16:28:08] thanks! [16:28:08] PROBLEM - HHVM rendering on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [16:28:37] PROBLEM - Nginx local proxy to apache on mw1262 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.007 second response time [16:29:22] !log mw1262 started hhvm, it had Unhandled server exception: Class undefined: Psr\Log\LogLevel [16:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:36] RECOVERY - Nginx local proxy to apache on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.064 second response time [16:29:46] mmmm [16:29:56] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.037 second response time [16:29:57] !log gehel@tin Finished deploy [kartotherian/deploy@ecdda41]: new kartotherian packaging (duration: 02m 34s) [16:30:03] <_joe_> stat cache issues? [16:30:07] RECOVERY - HHVM rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 75080 bytes in 0.135 second response time [16:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:09] (03PS3) 10Elukey: profile::analytics::refinery::job::json_refine: add netflow job [puppet] - 10https://gerrit.wikimedia.org/r/408535 (https://phabricator.wikimedia.org/T181036) [16:30:18] !log gehel@tin Started deploy [tilerator/deploy@29d633e]: new tilerator packaging [16:30:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:49] !log gehel@tin Finished deploy [tilerator/deploy@29d633e]: new tilerator packaging (duration: 00m 31s) [16:30:50] elukey: /o btw ;) [16:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:58] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2003.codfw.wmnet [16:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:09] bmansurov: hello! [16:34:23] !log gehel@tin Started deploy [kartotherian/deploy@ecdda41]: new kartotherian packaging [16:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:24] !log gehel@tin Finished deploy [kartotherian/deploy@ecdda41]: new kartotherian packaging (duration: 01m 01s) [16:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:03] !log gehel@tin Started deploy [tilerator/deploy@29d633e]: new tilerator packaging [16:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:33] !log gehel@tin Finished deploy [tilerator/deploy@29d633e]: new tilerator packaging (duration: 00m 30s) [16:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:59] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2003.codfw.wmnet [16:38:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:12] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2004.codfw.wmnet [16:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:24] !log gehel@tin Started deploy [kartotherian/deploy@ecdda41]: new kartotherian packaging [16:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:01] !log gehel@tin Finished deploy [kartotherian/deploy@ecdda41]: new kartotherian packaging (duration: 00m 36s) [16:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:34] !log gehel@tin Started deploy [tilerator/deploy@29d633e]: new tilerator packaging [16:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:04] !log gehel@tin Finished deploy [tilerator/deploy@29d633e]: new tilerator packaging (duration: 00m 31s) [16:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:26] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2004.codfw.wmnet [16:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:32] AndyRussG: I usually do it about 10am pacific time, so in about 1h15m [16:43:51] (03PS1) 10Gehel: maps: new path to osm-bright-style [puppet] - 10https://gerrit.wikimedia.org/r/408552 [16:44:10] no_justification: ok cool, thanks! almost set infact :) [16:46:56] (03CR) 10Gehel: [C: 032] "Puppet compiler is happy: https://puppet-compiler.wmflabs.org/compiler02/9867/" [puppet] - 10https://gerrit.wikimedia.org/r/408552 (owner: 10Gehel) [16:47:53] 10Operations, 10Domains, 10Research, 10Traffic, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3949506 (10DarTar) [16:50:15] !log restarting jenkins for updates [16:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:33] !log upgrading kartotherian / tilerator on maps codfw completed [16:50:38] and sorry for all the noise... [16:50:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:47] phab dead? [16:51:56] PROBLEM - https://phabricator.wikimedia.org on phab1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Backend fetch failed - 2426 bytes in 2.329 second response time [16:52:10] :D [16:52:10] * volans looking [16:52:14] getting 503s here [16:52:18] checking [16:52:21] Exception: Failed to `proc_open()`: proc_open() expects parameter 2 to be array, integer given [16:52:21] lovely [16:52:55] urgh [16:52:58] was there a change to phab? [16:53:13] descriptor spec must be an integer indexed array at [/src/future/exec/ExecFuture.php:678] [16:53:25] just paged? [16:53:45] pretty sure entire team was, we're in awake hours for everyone =] [16:53:54] db looks good [16:54:05] https://en.wikipedia.beta.wmflabs.org/w/index.php?title=Selenium_Echo_link_test_0.7791232640420875&action=history < addshore is this anything to do with MCR ? [16:54:10] not aware of any work on it, pinged in releng [16:54:13] The last one that I know is me restarting httpd this morning, I rolled back my changes to force httpd to dump a core on segfaults [16:54:25] jdlrobson: there is a chance [16:54:31] httpd is regularly segfaulting but on php_request_shutdown [16:54:35] addshore: i can look into logstash.. [16:54:36] apache or code issuess, then? [16:54:44] it seems a code issue afaics [16:54:55] last i checked was last night if there was enough disk space for those dumps, was fine. seems like code issue but no update ?? [16:54:58] elukey: sorry I means, same issues or different? [16:55:02] Disable the translation extension? [16:55:07] different [16:55:34] the file ./libphutil/src/future/exec/ExecFuture.php is from Sep 28th [16:55:39] on phab1001 [16:55:40] (03PS1) 10Gehel: maps: new path to osm-bright-style is now the default [puppet] - 10https://gerrit.wikimedia.org/r/408554 [16:55:53] jdlrobson: PHP Fatal Error: Class undefined: MediaWikiServices /srv/mediawiki/php-master/extensions/Thanks/includes/ThanksHooks.php [16:56:02] acck [16:56:10] let's try to restart httpd to see if it works ok? [16:56:12] addshore: you raising a bug? [16:56:12] paladox: is it the latst one? [16:56:22] jdlrobson: I can do! [16:56:24] elukey: yea [16:56:26] once phab is back ;) [16:56:29] mutante it was created around the problem first started [16:56:33] addshore: thanks <3 hahaha [16:56:37] all chaos when phab is down [16:56:40] !log restart httpd on phab1001 [16:56:43] and it's in elukey debug traceback [16:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:15] phab back for me [16:57:16] back [16:57:17] we are back [16:57:33] yes [16:57:37] thanks [16:57:39] i was about to restart phd service itself, but i am not doing anything now [16:57:43] cool [16:57:43] let's file a ticket, let the team do a more thotugh incestigation [16:57:57] RECOVERY - https://phabricator.wikimedia.org on phab1001 is OK: HTTP OK: HTTP/1.1 200 OK - 31921 bytes in 0.225 second response time [16:58:05] so https://phabricator.wikimedia.org/T182832 is already a mess, a lot of things ongoing sadly [16:59:11] jdlrobson: https://phabricator.wikimedia.org/T186618 [16:59:23] thanks addshore <3 [16:59:59] elukey: I'd rather open a subtask for this, from the stacktrace it really seems a code issue [17:00:05] godog, moritzm, and _joe_: Your horoscope predicts another unfortunate Puppet SWAT(Max 8 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180206T1700). [17:00:05] thcipriani: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:19] where a method signature was changed from single int to array of ints and some places is still passing an integer [17:00:33] you can find it in the apache error logs [17:00:42] o/ around for puppet swat [17:00:58] addshore: and sorry i assumed it was MCR - I just know that's touching that code a lot so it seemed like a good guess [17:01:06] volans: oh yes I agree, didn't mean to aggregate, I was only sad [17:01:36] volans: the phab's apache error logs are another mess of errors too [17:01:43] (03PS1) 10Hashar: admin: contint-admins to restart Jenkins via systemd [puppet] - 10https://gerrit.wikimedia.org/r/408555 [17:01:45] to have more fun [17:01:52] jdlrobson: patch is up [17:02:02] So who is going to write that up? [17:02:11] (03CR) 10Hashar: [C: 031] admin: contint-admins to restart Jenkins via systemd [puppet] - 10https://gerrit.wikimedia.org/r/408555 (owner: 10Hashar) [17:02:19] if nobody has created one I'll do it in a bit [17:02:29] <_joe_> thcipriani: I'm here [17:02:38] I don't think anyone has, I was going to recommend you to write it up since you fixed ;D [17:02:46] elukey: ^ [17:02:51] <_joe_> thcipriani: we should really move that to service_checker [17:02:57] robh: ack! [17:03:11] I just wanted to make sure someone was and it wasn't going to be forgotten ;] [17:03:38] _joe_: I'm not opposed to that, we will also need service_checker on the deployment host to run the swagger spec against mediawiki as part of deployment in the near future. [17:04:06] <_joe_> thcipriani: so I have a question that is a bit meta for the change [17:04:56] <_joe_> why do we want to cache the datetime exactly? [17:05:07] so I think that the current CR is ok-ish to quickly fix the issue we have right now. Although I think that we also would need to consider additional improvements [17:05:12] like: [17:05:22] - rollback the deploy in the canary hosts but one [17:05:37] <_joe_> yeah something like that [17:05:46] - don't consider as canary hosts where the patches are deployed manuallly already (might false hte baseline) [17:06:12] yeah, rolling back the canaries would be ideal [17:06:17] elukey: robh: i started the page from template, to be edited: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180206-Phabricator [17:06:18] - if mediawiki/apache is down fail no matter what (baseline, cachedfile, etc...) [17:06:30] <_joe_> define "down" [17:06:43] <_joe_> but I fully agree [17:06:54] yeah need some definition of "down", but yeah [17:06:57] <_joe_> volans: can I trust your review of the python code? it's a lot of diff [17:07:24] <_joe_> else thcipriani: see you in 20 minutes when I'm done :P [17:07:25] there is a CI job to cover the python code! [17:07:40] hashar: the linting, not the content :-P [17:08:04] <_joe_> hashar: volans is an excellent linter too, but I was asking about the code functionality :P [17:08:11] _joe_: it looked sane to me and I had some comments along the way that were fixed [17:08:11] it's also been running in beta for a while, FWIW :) [17:08:28] it should really do what it says and AFAIK was tested in deployment prep for a week [17:09:19] <_joe_> still, I prefer to re-read it and add my comments, I have a couple [17:09:25] sure [17:10:56] and I forgot one thing to mention, should the baseline of errors be absolute? [17:11:05] 10Operations, 10hardware-requests: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#3949564 (10faidon) a:05faidon>03RobH Approved. [17:11:47] 10Operations, 10Analytics-Kanban, 10User-Elukey: Expand meitnerium's root partition to 100G - https://phabricator.wikimedia.org/T186020#3949567 (10akosiaris) @elukey. Feel free to try. If anything it will provide us with some more insight into T181121. FWIW I had refilled ganeti1005 with the VMs assigned to... [17:11:50] an absolute threshold could be a valid thing to check [17:12:43] although it would need to be sufficiently high to avoid too many false positives and ideally the delta in error rate could be a tighter limit. [17:13:15] we could do both, delta < X and absolute < Y [17:13:30] 10Operations, 10Cassandra, 10RESTBase-Cassandra, 10Services (doing), 10User-Eevans: Upload cassandra package(s) to wikimedia apt repository - https://phabricator.wikimedia.org/T186619#3949577 (10Eevans) [17:13:47] yep, I could add that. [17:14:47] (03PS1) 10Alexandros Kosiaris: Remove ores::stresstest as its no longer needed [puppet] - 10https://gerrit.wikimedia.org/r/408558 (https://phabricator.wikimedia.org/T171851) [17:14:49] (03PS1) 10Alexandros Kosiaris: ores: Set oresX00X hosts as role::ores [puppet] - 10https://gerrit.wikimedia.org/r/408559 (https://phabricator.wikimedia.org/T171851) [17:14:52] (03PS1) 10Alexandros Kosiaris: Remove ORES profile from scb [puppet] - 10https://gerrit.wikimedia.org/r/408560 (https://phabricator.wikimedia.org/T171851) [17:15:56] (03CR) 10Imarlier: [WIP] coal: Consume EventLogging from Kafka instead of ZMQ (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403560 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [17:16:09] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: legoktm can't deploy docker images on contint1001 - https://phabricator.wikimedia.org/T186475#3949603 (10Legoktm) @akosiaris thank you! Should we add the entire contint-admins team to contint-docker? I think people like @addshore are... [17:17:42] 10Operations, 10Phabricator, 10Release-Engineering-Team: Phabricator down due to - https://phabricator.wikimedia.org/T186620#3949613 (10elukey) p:05Triage>03High [17:18:01] 10Operations, 10Phabricator, 10Release-Engineering-Team: Phabricator down due to "Failed to `proc_open()`: proc_open() expects parameter 2 to be array" - https://phabricator.wikimedia.org/T186620#3949625 (10elukey) [17:20:09] 10Operations, 10ops-eqiad, 10DBA: db1051 database host BBU issues - https://phabricator.wikimedia.org/T186049#3949637 (10Marostegui) BBU is now charging Thanks Chris! [17:20:29] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: legoktm can't deploy docker images on contint1001 - https://phabricator.wikimedia.org/T186475#3949638 (10Addshore) >>! In T186475#3949603, @Legoktm wrote: > @akosiaris thank you! Should we add the entire contint-admins team to contint... [17:20:39] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2039 - https://phabricator.wikimedia.org/T186533#3949642 (10Marostegui) Thanks @Papaul - let's hope it goes fine this time! ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 62% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB,... [17:21:08] (03CR) 10Faidon Liambotis: [C: 04-1] apt: merge report-pending-upgrades script into apt-upgrade (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/407465 (https://phabricator.wikimedia.org/T181647) (owner: 10Arturo Borrero Gonzalez) [17:22:33] 10Operations, 10Phabricator, 10Release-Engineering-Team: Phabricator down due to "Failed to `proc_open()`: proc_open() expects parameter 2 to be array" - https://phabricator.wikimedia.org/T186620#3949649 (10elukey) [17:22:43] 10Operations, 10Phabricator, 10Release-Engineering-Team: Phabricator down due to "Failed to `proc_open()`: proc_open() expects parameter 2 to be array" - https://phabricator.wikimedia.org/T186620#3949613 (10elukey) [17:23:43] <_joe_> thcipriani: I'm not very happy with that patch, I have to be honest [17:24:19] 10Operations, 10Phabricator, 10Release-Engineering-Team: Phabricator down due to "Failed to `proc_open()`: proc_open() expects parameter 2 to be array" - https://phabricator.wikimedia.org/T186620#3949613 (10Dzahn) https://wikitech.wikimedia.org/wiki/Incident_documentation/20180206-Phabricator [17:24:28] <_joe_> and let me explain why: apart from a few minor implementation doubts, I don't think that's what we want. We want to cache the deployment time, not the time we last launched this script [17:25:11] but scap is the only thing that invokes this script for mwdeploy afaik, which would be the last deployment time. [17:25:20] (03PS1) 10Dzahn: microsites/research: re-enable git cloning [puppet] - 10https://gerrit.wikimedia.org/r/408562 (https://phabricator.wikimedia.org/T183916) [17:25:23] <_joe_> thcipriani: I might invoke it by hand [17:25:36] I suppose I could create a --before-timestamp flag for this script that scap could invoke [17:25:36] <_joe_> thcipriani: does scap save its deployment times anywhere? [17:25:48] <_joe_> that would be my suggestion, tbh [17:26:28] (03PS2) 10Dzahn: microsites/research: re-enable git cloning [puppet] - 10https://gerrit.wikimedia.org/r/408562 (https://phabricator.wikimedia.org/T183916) [17:26:34] <_joe_> I realized that when reading the part where you save the file I started thinking of race conditions. That made me realize we were specializing the script to run in the context of scap, that has its own locking [17:26:49] <_joe_> which made me think, this is the wrong place to save that timestamp [17:27:23] that is a fair criticism. [17:27:25] interesting how a Gerrit change can now have "owner, reviewer AND assignee" [17:27:30] <_joe_> heh sorry [17:27:32] mutante yep [17:27:48] (03PS2) 10Alexandros Kosiaris: ores: Set oresX00X hosts as role::ores [puppet] - 10https://gerrit.wikimedia.org/r/408559 (https://phabricator.wikimedia.org/T171851) [17:27:48] _joe_: which is the worst kind of criticism :) [17:27:49] (03PS2) 10Alexandros Kosiaris: Remove ORES profile from scb [puppet] - 10https://gerrit.wikimedia.org/r/408560 (https://phabricator.wikimedia.org/T171851) [17:27:49] <_joe_> I know it sucks, you worked on it and the patch was good technically [17:28:03] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3949675 (10greg) >>! In T168407#3947728, @chasemp wrote: > @greg can you weigh in here? We are attempting to reallocate this if you don't object. #REDIRECT @... [17:28:13] (03CR) 10Dzahn: [C: 032] microsites/research: re-enable git cloning [puppet] - 10https://gerrit.wikimedia.org/r/408562 (https://phabricator.wikimedia.org/T183916) (owner: 10Dzahn) [17:28:19] (03CR) 10jerkins-bot: [V: 04-1] ores: Set oresX00X hosts as role::ores [puppet] - 10https://gerrit.wikimedia.org/r/408559 (https://phabricator.wikimedia.org/T171851) (owner: 10Alexandros Kosiaris) [17:28:21] (03CR) 10jerkins-bot: [V: 04-1] Remove ORES profile from scb [puppet] - 10https://gerrit.wikimedia.org/r/408560 (https://phabricator.wikimedia.org/T171851) (owner: 10Alexandros Kosiaris) [17:28:29] _joe_: ok, I'll see what I can do to make this change on the scap side. [17:28:45] <_joe_> thcipriani: scap doesn't save deployment times? [17:29:04] it logs them [17:29:25] if the rollback is easier is even a cleaner solution [17:29:53] and you get the "lock" automatically given that it will fail again in the same way [17:29:55] the rollback is a bigger problem, but that would be the ideal. [17:30:12] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "Leaving aside smaller comments on the code, which is overall very good, I do not think this is the right approach to solve the problem:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/403574 (https://phabricator.wikimedia.org/T183999) (owner: 10Thcipriani) [17:30:12] (03PS2) 10Dzahn: Revert "Revert "add research.wikimedia.org"" [dns] - 10https://gerrit.wikimedia.org/r/402465 [17:30:38] (03PS2) 10Alexandros Kosiaris: Remove ores::stresstest as its no longer needed [puppet] - 10https://gerrit.wikimedia.org/r/408558 (https://phabricator.wikimedia.org/T171851) [17:30:40] (03PS3) 10Alexandros Kosiaris: ores: Set oresX00X hosts as role::ores [puppet] - 10https://gerrit.wikimedia.org/r/408559 (https://phabricator.wikimedia.org/T171851) [17:30:42] (03PS3) 10Alexandros Kosiaris: Remove ORES profile from scb [puppet] - 10https://gerrit.wikimedia.org/r/408560 (https://phabricator.wikimedia.org/T171851) [17:30:44] the problem with rollback is we're deploying 5 git repos as one lump of code via rsync, so it's difficult to figure out what the last change was. [17:30:50] bmansurov: there is some conflict in the repo when it pulls latest content :/ [17:31:13] CONFLICT (content): Merge conflict in why-we-read-wikipedia.html and others [17:31:22] mutante: can we force pull? I had to force push the latest changes. [17:31:48] I wanted to get rid of the squashes and push directly from github to gerrit. [17:31:55] i can try deleting it and letting it clone from scratch [17:31:56] there's not even a single version number the represents the current correct state of deployed code (afa MediaWiki is concerned). [17:32:06] mutante: that'd be cool [17:32:15] <_joe_> thcipriani: that sounds reassuring! [17:32:37] <_joe_> but yeah we feel your pain :( [17:32:55] :( [17:33:05] anyway, although this isn't the ideal solution, it's one that won't take a quarter to fix :) [17:33:11] bmansurov: rm and running puppet fixed it :) [17:33:22] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: legoktm can't deploy docker images on contint1001 - https://phabricator.wikimedia.org/T186475#3949700 (10akosiaris) >>! In T186475#3949638, @Addshore wrote: >>>! In T186475#3949603, @Legoktm wrote: >> @akosiaris thank you! Should we a... [17:33:41] yay [17:34:06] so, ok, I'll try to get a flag for --before-timestamp and implement an absolute threshold and get a new version of scap out with those changes shortly. [17:34:15] _joe_: volans thanks both for the review [17:34:42] <_joe_> thcipriani: heh sorry for not taking a look earlier [17:34:55] <_joe_> I missed the conversation until basically yesterday [17:35:06] <_joe_> and it's totally my fault [17:35:12] (03CR) 10Dzahn: [C: 032] Revert "Revert "add research.wikimedia.org"" [dns] - 10https://gerrit.wikimedia.org/r/402465 (owner: 10Dzahn) [17:35:18] can't be in all the conversations all the time :) [17:35:27] (03CR) 10Dzahn: [C: 032] "ready to go live now" [dns] - 10https://gerrit.wikimedia.org/r/402465 (owner: 10Dzahn) [17:35:46] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0 [17:35:50] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: legoktm can't deploy docker images on contint1001 - https://phabricator.wikimedia.org/T186475#3949702 (10Legoktm) >>! In T186475#3949700, @akosiaris wrote: >> I think people like @addshore are also going to need to be able to deploy d... [17:36:27] bmansurov: you are live https://research.wikimedia.org/ [17:36:43] mmh I don't see any maintenace window... XioNoX you aware? ^^^ (router) [17:36:43] mutante: thanks! [17:36:45] yw [17:38:35] 10Operations, 10Domains, 10Research, 10Traffic, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3949712 (10Dzahn) I re-enabled the git cloning. At first there were some conflicts when puppet tried to git pull. Deleting the entire /srv/org/wikime... [17:38:39] volans: my bad, I jsut enabled a new interface [17:38:56] going to be a link to eqsin [17:39:11] XioNoX: ack, no problem ;) [17:39:29] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: legoktm can't deploy docker images on contint1001 - https://phabricator.wikimedia.org/T186475#3949716 (10akosiaris) >>! In T186475#3949702, @Legoktm wrote: >>>! In T186475#3949700, @akosiaris wrote: >>> I think people like @addshore a... [17:39:46] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [17:41:18] 10Operations, 10Domains, 10Research, 10Traffic, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3949721 (10bmansurov) Thanks, @Dzahn! [17:41:29] no_justification: all set, thanks!!! this just merged https://gerrit.wikimedia.org/r/#/c/408561/-1..1 so you should see a CN submodule pointer at 6942cdb67b5ab8e3896a72c2616baaf370eec952 [17:42:04] (03PS4) 10Elukey: profile::analytics::refinery::job::json_refine: add netflow job [puppet] - 10https://gerrit.wikimedia.org/r/408535 (https://phabricator.wikimedia.org/T181036) [17:43:01] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10User-Elukey: Phabricator down due to "Failed to `proc_open()`: proc_open() expects parameter 2 to be array" - https://phabricator.wikimedia.org/T186620#3949723 (10elukey) [17:44:26] (03CR) 10Ottomata: profile::analytics::refinery::job::json_refine: add netflow job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/408535 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [17:45:27] 10Operations: setup/install bast1002 - https://phabricator.wikimedia.org/T186623#3949728 (10RobH) p:05Triage>03Normal [17:45:34] 10Operations, 10hardware-requests: hardware request for bast1001 replacement - https://phabricator.wikimedia.org/T184480#3949740 (10RobH) 05Open>03Resolved [17:45:36] 10Operations: replace bast1001 (new hardware) - https://phabricator.wikimedia.org/T183412#3949741 (10RobH) [17:48:46] 10Operations: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#3949758 (10RobH) [17:49:17] (03PS4) 10Alexandros Kosiaris: ores: Set oresX00X hosts as role::ores [puppet] - 10https://gerrit.wikimedia.org/r/408559 (https://phabricator.wikimedia.org/T171851) [17:49:45] (03PS4) 10Alexandros Kosiaris: Remove ORES profile from scb [puppet] - 10https://gerrit.wikimedia.org/r/408560 (https://phabricator.wikimedia.org/T171851) [17:49:47] (03PS1) 10Alexandros Kosiaris: ores: Allow oresX00X to reach respective oresrdb [puppet] - 10https://gerrit.wikimedia.org/r/408564 (https://phabricator.wikimedia.org/T171851) [17:50:59] (03CR) 10Elukey: profile::analytics::refinery::job::json_refine: add netflow job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/408535 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [17:52:25] (03PS5) 10Elukey: profile::analytics::refinery::job::json_refine: add netflow job [puppet] - 10https://gerrit.wikimedia.org/r/408535 (https://phabricator.wikimedia.org/T181036) [17:52:42] (03CR) 10Eevans: [C: 031] hieradata: extend SMART eqiad deployment [puppet] - 10https://gerrit.wikimedia.org/r/408543 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [17:53:15] 10Operations: apply hostname labels to bast1002/WMF4749 - https://phabricator.wikimedia.org/T186625#3949763 (10RobH) p:05Triage>03Low [17:54:43] (03PS5) 10Alexandros Kosiaris: ores: Set oresX00X hosts as role::ores [puppet] - 10https://gerrit.wikimedia.org/r/408559 (https://phabricator.wikimedia.org/T171851) [17:55:15] (03PS5) 10Alexandros Kosiaris: Remove ORES profile from scb [puppet] - 10https://gerrit.wikimedia.org/r/408560 (https://phabricator.wikimedia.org/T171851) [17:57:05] (03CR) 10Ottomata: [C: 031] profile::analytics::refinery::job::json_refine: add netflow job [puppet] - 10https://gerrit.wikimedia.org/r/408535 (https://phabricator.wikimedia.org/T181036) (owner: 10Elukey) [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: Time to snap out of that daydream and deploy Services – Graphoid / Parsoid / Citoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180206T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:01:58] (03PS1) 10RobH: set dns entries for bast1002 [dns] - 10https://gerrit.wikimedia.org/r/408565 (https://phabricator.wikimedia.org/T186623) [18:03:09] (03CR) 10RobH: [C: 032] set dns entries for bast1002 [dns] - 10https://gerrit.wikimedia.org/r/408565 (https://phabricator.wikimedia.org/T186623) (owner: 10RobH) [18:04:18] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#3949822 (10RobH) [18:08:56] greg-g: ok if I update 3d2png now-ish? [18:09:44] matthiasmullie: yes, during service deploy windows makes sense [18:10:41] (03PS1) 10Ppchelko: Disable Redis JobQueue for refreshLinks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408569 (https://phabricator.wikimedia.org/T185052) [18:12:14] (03CR) 10jerkins-bot: [V: 04-1] Disable Redis JobQueue for refreshLinks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408569 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [18:15:48] (03PS17) 10Andrew Bogott: openstack horizon: rough in manifests for source deploy of Horizon 'queens' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) [18:15:58] Hi [18:15:58] !log arlolra@tin Started deploy [parsoid/deploy@211ea5d]: Updating Parsoid to 8a0ff6c [18:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:42] Today will be resolved problem with showing pictuers on phabricator? [18:20:12] ACKNOWLEDGEMENT - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi working on transport link to eqsin [18:21:04] (03PS2) 10Ppchelko: Disable Redis JobQueue for refreshLinks. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408569 (https://phabricator.wikimedia.org/T185052) [18:22:17] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.16 (duration: 07m 29s) [18:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:36] (03PS18) 10Andrew Bogott: openstack horizon: rough in manifests for source deploy of Horizon 'queens' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) [18:29:25] !log demon@tin Started scap: bootstrap wmf.20 @ testwiki [18:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:26] 10Operations, 10ops-codfw, 10DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#3949920 (10Papaul) 05Open>03Resolved For some reason the apple airport stop working , Resetting it didn't work as well. I really do not need this on site so no need to take time to troubleshoot... [18:31:57] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459#3949929 (10ayounsi) Indeed not urgent, I was not aware of the DB requirements. Waiting for the next DC switchover works for me. [18:34:37] 10Operations, 10ops-codfw: mc2036 mainboard fuse failure - https://phabricator.wikimedia.org/T185587#3949935 (10Papaul) The tracking status on the main board says "Delay" as for today Feb. 6th at 12:33pm CT [18:35:57] !log arlolra@tin Started deploy [parsoid/deploy@211ea5d]: Updating Parsoid to 8a0ff6c [18:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:42] !log arlolra@tin Finished deploy [parsoid/deploy@211ea5d]: Updating Parsoid to 8a0ff6c (duration: 03m 47s) [18:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:22] !log mlitn@tin Started deploy [3d2png/deploy@8135c2d]: Updating 3d2png repo [18:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:45] (03PS1) 10Ppchelko: Remove jobrunner config specific to htmlCacheUpdate. [puppet] - 10https://gerrit.wikimedia.org/r/408576 (https://phabricator.wikimedia.org/T182023) [18:44:40] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "without getting into the details of the puppet patch, I think allowing any web application to write inside its installation directory is o" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/399101 (owner: 10Ayounsi) [18:44:54] (03PS19) 10Andrew Bogott: openstack horizon: rough in manifests for source deploy of Horizon 'queens' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) [18:45:23] (03CR) 10jerkins-bot: [V: 04-1] openstack horizon: rough in manifests for source deploy of Horizon 'queens' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [18:45:26] 10Operations, 10Domains, 10Research, 10Traffic, 10Patch-For-Review: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3949949 (10Dzahn) 05Open>03Resolved [18:45:41] (03PS2) 10BryanDavis: mariadb: remove labsdb1001 & labsdb1003 special behavior [puppet] - 10https://gerrit.wikimedia.org/r/408469 (https://phabricator.wikimedia.org/T184832) [18:46:45] !log mlitn@tin Finished deploy [3d2png/deploy@8135c2d]: Updating 3d2png repo (duration: 06m 23s) [18:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:00] (03PS20) 10Andrew Bogott: openstack horizon: rough in manifests for source deploy of Horizon 'queens' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) [18:47:38] (03CR) 10Jcrespo: [C: 031] "This is ok to me, although maybe there is some changes to be done regarding db1069 references, on a separate ticket @marostegui ?" [puppet] - 10https://gerrit.wikimedia.org/r/408469 (https://phabricator.wikimedia.org/T184832) (owner: 10BryanDavis) [18:47:51] !log Updated Parsoid to 8a0ff6c (T183515, T129372, T181408) [18:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:06] T129372: gallery syntax claims to require 'px' for widths/heights attributes, but actually discards all strings after the number - https://phabricator.wikimedia.org/T129372 [18:48:06] T183515: Possibly permit more ext tag types in directive - https://phabricator.wikimedia.org/T183515 [18:48:06] T181408: Rethink responsive references wrappers - https://phabricator.wikimedia.org/T181408 [18:55:16] !log mlitn@tin Started deploy [3d2png/deploy@8135c2d]: Updating 3d2png repo [18:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:31] !log mlitn@tin Finished deploy [3d2png/deploy@8135c2d]: Updating 3d2png repo (duration: 00m 15s) [18:55:34] !log demon@tin Finished scap: bootstrap wmf.20 @ testwiki (duration: 26m 09s) [18:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:55] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: legoktm can't deploy docker images on contint1001 - https://phabricator.wikimedia.org/T186475#3950009 (10hashar) Both @Legoktm and @Addshore already have the privileges to run privileged code on CI and can really run any random Docker... [18:57:43] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1115 (tendril replacement database) - https://phabricator.wikimedia.org/T185788#3950016 (10Cmjohnson) racked in A6 wmf7316 [19:00:04] 10Operations, 10Gerrit, 10Patch-For-Review, 10Performance: New gerrit login ui is causing performance problems when going through gerrit.wikimedia.org - https://phabricator.wikimedia.org/T185506#3950028 (10Krinkle) @demon I understand. My point is merely about the "flash of unstyled content", which is actu... [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180206T1900) [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:01:49] (03PS21) 10Andrew Bogott: openstack horizon: rough in manifests for source deploy of Horizon 'queens' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) [19:02:04] 10Operations, 10Commons, 10Multimedia, 10media-storage: Generate a list of files that are supposed to exist but 404s - https://phabricator.wikimedia.org/T182822#3950029 (10Dispenser) @Aklapper I'm currently analyzing SVGs files for problems 404 file, missing `xmlns=`, font issues, etc. In a month or two,... [19:05:25] (03PS1) 10Chad: Group0 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408580 [19:06:47] 10Operations, 10Commons, 10Multimedia, 10media-storage: Generate a list of files that are supposed to exist but 404s - https://phabricator.wikimedia.org/T182822#3835699 (10ArielGlenn) >>! In T182822#3950029, @Dispenser wrote: > @Aklapper I'm currently analyzing SVGs files for problems 404 file, missing `xm... [19:07:07] 10Operations, 10Commons, 10Multimedia, 10media-storage, 10User-ArielGlenn: Generate a list of files that are supposed to exist but 404s - https://phabricator.wikimedia.org/T182822#3950039 (10ArielGlenn) [19:09:41] (03CR) 10Andrew Bogott: [V: 032 C: 032] openstack horizon: rough in manifests for source deploy of Horizon 'queens' [puppet] - 10https://gerrit.wikimedia.org/r/406853 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [19:14:57] (03PS1) 10Andrew Bogott: added 'newhorizon' domain [dns] - 10https://gerrit.wikimedia.org/r/408584 [19:15:48] (03PS4) 10Ayounsi: [WIP] Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 (https://phabricator.wikimedia.org/T106056) [19:16:33] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 (https://phabricator.wikimedia.org/T106056) (owner: 10Ayounsi) [19:18:26] (03PS1) 10Giuseppe Lavagetto: [WiP][DnR] Add support for jsonschema-based entities [software/conftool] - 10https://gerrit.wikimedia.org/r/408585 [19:19:27] (03CR) 10Andrew Bogott: [C: 032] added 'newhorizon' domain [dns] - 10https://gerrit.wikimedia.org/r/408584 (owner: 10Andrew Bogott) [19:20:06] (03CR) 10jerkins-bot: [V: 04-1] [WiP][DnR] Add support for jsonschema-based entities [software/conftool] - 10https://gerrit.wikimedia.org/r/408585 (owner: 10Giuseppe Lavagetto) [19:20:18] (03PS1) 10RobH: setting bast1002 install params [puppet] - 10https://gerrit.wikimedia.org/r/408586 (https://phabricator.wikimedia.org/T186623) [19:20:20] (03PS5) 10Ayounsi: [WIP] Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 (https://phabricator.wikimedia.org/T106056) [19:20:22] (03PS1) 10Andrew Bogott: labweb: set up misc-web to service the 'newhorizon' domain [puppet] - 10https://gerrit.wikimedia.org/r/408587 [19:20:56] (03CR) 10jerkins-bot: [V: 04-1] setting bast1002 install params [puppet] - 10https://gerrit.wikimedia.org/r/408586 (https://phabricator.wikimedia.org/T186623) (owner: 10RobH) [19:22:14] (03PS2) 10RobH: setting bast1002 install params [puppet] - 10https://gerrit.wikimedia.org/r/408586 (https://phabricator.wikimedia.org/T186623) [19:23:39] (03CR) 10RobH: [C: 032] setting bast1002 install params [puppet] - 10https://gerrit.wikimedia.org/r/408586 (https://phabricator.wikimedia.org/T186623) (owner: 10RobH) [19:23:41] (03CR) 10Andrew Bogott: [C: 032] labweb: set up misc-web to service the 'newhorizon' domain [puppet] - 10https://gerrit.wikimedia.org/r/408587 (owner: 10Andrew Bogott) [19:24:23] 10Operations, 10netops, 10Patch-For-Review: set up a looking glass for WMF ASes - https://phabricator.wikimedia.org/T106056#3950085 (10ayounsi) Gerrit change 390330 is up for reviews. @faidon ? or anyone else? It will then need to be deployed on netmon1002/2001 [19:24:28] andrewbogott: you merge my puppetmater change? [19:24:40] i didnt see it but you also have some c:2 patches so maybe was you? [19:24:48] I did, sorry to confuse [19:24:51] no worries! [19:25:20] !log Restarted Zuul due to T186381 [19:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:38] T186381: Exception while launching job: TypeError: 'int' object has no attribute '__getitem__' - https://phabricator.wikimedia.org/T186381 [19:28:39] 10Operations, 10ops-codfw, 10DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#3950118 (10RobH) 05Resolved>03stalled Just because we don't have time to fix it now doesn't resolve the task. I've reopened it to stalled. [19:28:49] (03PS2) 10Andrew Bogott: labweb: set up misc-web to service the 'newhorizon' domain [puppet] - 10https://gerrit.wikimedia.org/r/408587 [19:28:56] 10Operations, 10ops-codfw, 10DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#3950123 (10RobH) p:05Normal>03Low [19:31:10] 10Operations, 10ops-codfw, 10Cloud-VPS: Connect labtestvirt2003 eth1 and eth2 interface(s) to switch fabric - https://phabricator.wikimedia.org/T183167#3950129 (10chasemp) @papaul -- I'm still a bit confused. Really appreciate you sanity checking all this, I think something is still off >>! In T183167#3949... [19:33:38] (03CR) 10Mobrovac: [C: 031] Remove jobrunner config specific to htmlCacheUpdate. [puppet] - 10https://gerrit.wikimedia.org/r/408576 (https://phabricator.wikimedia.org/T182023) (owner: 10Ppchelko) [19:51:35] 10Operations, 10Domains, 10Research, 10Traffic: Create subdomain for Research landing page - https://phabricator.wikimedia.org/T183916#3950197 (10bmansurov) [19:53:20] (03PS1) 10Andrew Bogott: labweb: add profile::openstack::main::horizon::dashboard_source_deploy [puppet] - 10https://gerrit.wikimedia.org/r/408595 [19:57:19] 10Operations, 10ops-codfw, 10Cloud-VPS: Connect labtestvirt2003 eth1 and eth2 interface(s) to switch fabric - https://phabricator.wikimedia.org/T183167#3950222 (10RobH) a:05RobH>03Papaul Ok, synced up with Chase via IRC: @Papaul: Your cable trace of: >>! In T183167#3949338, @Papaul wrote: > labtestvir... [19:58:32] (03CR) 10Andrew Bogott: [C: 032] labweb: add profile::openstack::main::horizon::dashboard_source_deploy [puppet] - 10https://gerrit.wikimedia.org/r/408595 (owner: 10Andrew Bogott) [20:00:04] no_justification: Time to snap out of that daydream and deploy MediaWiki train. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180206T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:00:33] (03PS1) 10Andrew Bogott: labweb: set profile::openstack::main::version [puppet] - 10https://gerrit.wikimedia.org/r/408596 [20:00:58] (03CR) 10Andrew Bogott: [C: 032] labweb: set profile::openstack::main::version [puppet] - 10https://gerrit.wikimedia.org/r/408596 (owner: 10Andrew Bogott) [20:03:32] (03PS1) 10Andrew Bogott: labweb: use new, temporary hostname: newhorizon.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/408597 [20:04:26] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#3950254 (10chasemp) @MoritzMuehlenhoff when you get a chance can you help @Bstorm get setup with pwstore? She has a key in the public registery and confirmed during a hangout t... [20:04:28] (03CR) 10Andrew Bogott: [C: 032] labweb: use new, temporary hostname: newhorizon.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/408597 (owner: 10Andrew Bogott) [20:04:41] (03PS1) 10RobH: fixing bast1002 entry [dns] - 10https://gerrit.wikimedia.org/r/408598 (https://phabricator.wikimedia.org/T186623) [20:04:56] Choo choo it's train time [20:06:24] (03CR) 10Chad: [C: 032] Group0 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408580 (owner: 10Chad) [20:08:02] (03Merged) 10jenkins-bot: Group0 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408580 (owner: 10Chad) [20:08:13] (03CR) 10RobH: [C: 032] fixing bast1002 entry [dns] - 10https://gerrit.wikimedia.org/r/408598 (https://phabricator.wikimedia.org/T186623) (owner: 10RobH) [20:08:19] (03PS1) 10Andrew Bogott: labweb: fix host name [puppet] - 10https://gerrit.wikimedia.org/r/408600 [20:09:06] PROBLEM - restbase endpoints health on restbase1015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [20:09:08] (03CR) 10Andrew Bogott: [C: 032] labweb: fix host name [puppet] - 10https://gerrit.wikimedia.org/r/408600 (owner: 10Andrew Bogott) [20:09:46] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed o [20:09:46] e was received [20:09:57] RECOVERY - restbase endpoints health on restbase1015 is OK: All endpoints are healthy [20:10:37] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [20:11:07] RECOVERY - Apache HTTP on labweb1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 620 bytes in 0.043 second response time [20:11:27] RECOVERY - HHVM rendering on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 75099 bytes in 0.138 second response time [20:11:35] !log demon@tin Synchronized php: symlink swap (duration: 01m 17s) [20:11:41] (03CR) 10jenkins-bot: Group0 to wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408580 (owner: 10Chad) [20:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:12] !log demon@tin rebuilt and synchronized wikiversions files: group0 to wmf.20 [20:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:53] !log andrew@tin Started deploy [horizon/deploy@fbf761e]: (no justification provided) [20:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:06] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#3950295 (10RobH) I'm getting an SDA error in the installer. I removed all auto partitioning, and it still has the error, so it is a defective disk. I'm rebooting into the ePSA to test things. Otherw... [20:25:13] !log andrew@tin Finished deploy [horizon/deploy@fbf761e]: (no justification provided) (duration: 01m 21s) [20:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:31] 10Operations, 10Puppet: Port puppetlabs PuppetDB 4.4 package to stretch - https://phabricator.wikimedia.org/T185502#3950321 (10herron) Upon closer inspection the puppetdb-termini package needed some adjustments to cooperate with the debian puppet packages (on a puppetmaster) * Change puppetdb-termini packag... [20:33:17] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#3950325 (10Dzahn) @chasemp Adding the key to pwstore requires that it has at least 2 signatures on it. Since you already confirmed the key during hangout, could you add one of... [20:39:19] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3950332 (10Dzahn) @Prtksxna I asked around a bit and yea there is already precedence for this. Others are doing it that way. So as... [20:40:00] (03PS1) 10Krinkle: gerrit: Move header styles back out of login section [puppet] - 10https://gerrit.wikimedia.org/r/408611 [20:40:01] (03PS1) 10Krinkle: gerrit: Apply class=loginParent earlier on login page. [puppet] - 10https://gerrit.wikimedia.org/r/408612 (https://phabricator.wikimedia.org/T185506) [20:40:05] (03PS1) 10Krinkle: gerrit: Scope login-specific styles to loginParent [puppet] - 10https://gerrit.wikimedia.org/r/408613 (https://phabricator.wikimedia.org/T185506) [20:40:17] paladox: Could you test these two patches? ^ [20:40:23] s/two/three [20:42:44] no_justification: The current swagger endpoint checks, during which stage will they be used initially? on tin? canaries? every server? [20:45:57] They're used in canary checks right now [20:46:08] Run from tin, I *think*? [20:46:12] thcipriani knows! [20:48:58] I am getting DB lock errors [20:49:27] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#3950383 (10Dzahn) get key: gpg --search-keys bstorm@wikimedia.org gpg --recv-keys 2051251AF5172F75 show fingerprint: (verified against a file on bast1001 that Brooke uploade... [20:49:51] 10Operations, 10Patch-For-Review: setup/install bast1002(WMF4749) - https://phabricator.wikimedia.org/T186623#3950384 (10RobH) a:05RobH>03Cmjohnson I put in self dispatch case # SR960490901 and the replacement disk will ship directly to eqiad with a return tag for the old one. Escalating to Chris for the... [20:50:02] matanya: What wikis? Just rolled out to group0 a bit ago [20:50:12] hewiki [20:50:46] Hmm, definitely not from the wmf.20 deploy then [20:50:55] no_justification: not sure it is related, it says DB locked until replicas catch up [20:54:20] anomie: anomie: +1 for your edit summary on testwiki [20:57:48] lol [21:00:57] Krinkle yes [21:01:36] no_justification: Yeah, it'd be nice if the swagger checks happen from tin to tin, so that we don't break production traffic to canaries in the first place for obvious cases. [21:01:44] Especially given there's no built-in rollback. [21:02:24] Yeah [21:03:11] Krinkle: no_justification my plan was to run them from tin. [21:03:16] current checks run from tin [21:03:34] thcipriani: ah, 'from', but not 'on' [21:03:38] right [21:03:48] Sounds good. Also we should follow up to catch other projects and languages [21:03:49] I mean the logstash couldn't practically check on tin given it has no traffic [21:03:53] exactly [21:03:58] But the swagger check would be very useful to have on tin [21:04:04] but now that there's going to be a swagger spec it probably..yeah :) [21:04:17] (03CR) 10Paladox: [C: 031] "tested on https://gerrit.git.wmflabs.org/r/login/%23%2Fq%2Fstatus%3Aopen" [puppet] - 10https://gerrit.wikimedia.org/r/408611 (owner: 10Krinkle) [21:04:22] (03CR) 10Paladox: "tested on https://gerrit.git.wmflabs.org/r/login/%23%2Fq%2Fstatus%3Aopen" [puppet] - 10https://gerrit.wikimedia.org/r/408612 (https://phabricator.wikimedia.org/T185506) (owner: 10Krinkle) [21:04:28] especially combined with an mwscript invokation to apply zero tolerance on non-fatal notices/warnings on a simple invocation, which we already reached in prod so non-zero is always a regression. [21:04:31] (03CR) 10Paladox: [C: 031] "tested on https://gerrit.git.wmflabs.org/r/login/%23%2Fq%2Fstatus%3Aopen" [puppet] - 10https://gerrit.wikimedia.org/r/408613 (https://phabricator.wikimedia.org/T185506) (owner: 10Krinkle) [21:04:42] Krinkle: also we added closed.dblist to group0 [21:04:49] no_justification: Interesting. [21:04:57] So we have a slightly larger sample size [21:04:59] Yeah [21:05:01] For train [21:05:12] Krinkle tested @ https://gerrit.git.wmflabs.org/r/login/%23%2Fq%2Fstatus%3Aopen [21:05:15] I should add office wiki too [21:05:31] no_justification: So the case where the current canary checks detect a problem with average error rate, does it give a prompt or just abort by default? [21:05:39] E.g. [y/n] or, re-run with --force [21:05:51] Prompts [21:06:43] no_justification: Hm.. I guess prompt isn't so bad. Although one character could lead to disaster, maybe a bit too easy. [21:07:01] Indeed. Could prompt for a sentence [21:07:10] "yes I understand" [21:07:43] Mainly what I think is missing is that "canaries have high average error rate" doesn't directly trigger people to think "this patch just caused a 1% prod outage, sync a fix now if you know the problem based on , or create a revert and sync that afterwards immediately" [21:08:09] True [21:08:44] Also, maybe it makes sense to have a way to do the same sync-file logic for mwdebug. Right now we do pull instead of sync-file for mwdebug, thus giving false confidence in that "everything is fine" on mwdebug, given it is a full production app server, the risk area between mwdebug and app server is almost non-existent in people's minds. [21:08:55] But I understand why we use pull, because we want to undo local hacks, right? [21:09:13] Maybe we should change the workflow to do scap-pull before staging on fenari, and then sync-file-one-server or some such. [21:09:29] Or maybe we should require patches to be split, and disallow use of sync-file unless the commit itself was one file. [21:09:37] (or one dir, etc.) [21:09:46] We should just use scap pull unconditionally imho [21:09:48] So that you can still safely do scap pull, and do the same thing. [21:10:04] Right, but whatever the case, it *has* to be the same for debug and app servers, which it isn't right now. [21:10:13] And get rid of sync file [21:10:25] Scap pull is sufficiently fast [21:10:32] If we skip cdb stuff [21:10:32] Maybe keep it behind a force- prefix for emergency fixes [21:10:37] Yeah, or that. [21:10:45] Given a plain pull is actually pretty fast [21:10:51] we use it for debug servers already [21:11:01] ... which is another thing that's different. [21:11:07] Do we test l10n on debug right now? [21:11:23] (03PS1) 10Ladsgroup: Enable fine grained usage tracking, another batch. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408624 (https://phabricator.wikimedia.org/T186645) [21:11:28] I guess not, but is there a way to do it? [21:11:50] You could force a cdb refresh after pulling [21:12:39] Which is basically what scap does (we actually generate them twice and throw away one of them) [21:14:06] !log restarted zuul due to patch being stuck (T186381) [21:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:21] T186381: Exception while launching job: TypeError: 'int' object has no attribute '__getitem__' - https://phabricator.wikimedia.org/T186381 [21:15:21] no_justification: Right. I suppose something like a scap local-sync and scal local-sync-file might make sense as standard workflow for those cases. [21:15:30] e.g. pull instead of push, logically. [21:15:35] bbiab [21:16:25] 10Operations, 10Gerrit, 10Patch-For-Review, 10Performance: New gerrit login ui is causing performance problems when going through gerrit.wikimedia.org - https://phabricator.wikimedia.org/T185506#3950466 (10Krinkle) p:05Triage>03Low a:05Paladox>03Krinkle [21:33:38] (03PS1) 10Ladsgroup: Add edit and create rate limit for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/408629 (https://phabricator.wikimedia.org/T184948) [21:34:48] !log andrew@tin Started deploy [horizon/deploy@a316e45]: (no justification provided) [21:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:48] !log andrew@tin Finished deploy [horizon/deploy@a316e45]: (no justification provided) (duration: 01m 00s) [21:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:35] !log Going to shutdown Zuul in a few for an emergency hotfix | T186381 [21:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:48] T186381: Exception while launching job: TypeError: 'int' object has no attribute '__getitem__' - https://phabricator.wikimedia.org/T186381 [21:44:56] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#3950558 (10chasemp) >>! In T185493#3950325, @Dzahn wrote: > @chasemp Adding the key to pwstore requires that it has at least 2 signatures on it. Since you already confirmed the... [21:47:47] 10Operations, 10ops-codfw, 10Cloud-VPS: Connect labtestvirt2003 eth1 and eth2 interface(s) to switch fabric - https://phabricator.wikimedia.org/T183167#3950562 (10Papaul) All the information I provide this morning on labtestvirt2002 indeed are correct labtestvirt2001:eth0 = ge-5/0/8 (ID=1080) labtestvirt... [21:49:19] !log Flushing Zuul queue and upgrading [21:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:50] !log Flushing Zuul queue and upgrading to zuul_2.5.1-wmf2 | T186381 [21:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:03] T186381: Exception while launching job: TypeError: 'int' object has no attribute '__getitem__' - https://phabricator.wikimedia.org/T186381 [21:56:29] PROBLEM - pdfrender on scb1001 is CRITICAL: connect to address 10.64.0.16 and port 5252: Connection refused [21:57:29] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [22:09:48] (03CR) 10Dzahn: "since i compiled this on * it found a bunch of unrelated compiler issues mostly due to non-existing fake secrets in labs/private as is oft" [puppet] - 10https://gerrit.wikimedia.org/r/406794 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:14:55] (03PS1) 10Ottomata: Add some date aliases to otto's shell [puppet] - 10https://gerrit.wikimedia.org/r/408697 [22:15:43] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#3950669 (10Dzahn) I can't see your signature yet. I tried my default keyserver (hkps://hkps.pool.sks-keyservers.net) and pgp.mit.edu. Depending on which keyserver you used it mi... [22:15:59] (03CR) 10Ottomata: [C: 032] Add some date aliases to otto's shell [puppet] - 10https://gerrit.wikimedia.org/r/408697 (owner: 10Ottomata) [22:20:44] (03PS1) 10Dzahn: add missing profile::openstack::labtestn::rabbit_cleanup_pass to fix compiler runs [labs/private] - 10https://gerrit.wikimedia.org/r/408699 [22:21:24] 10Operations, 10ops-codfw, 10Cloud-VPS: Connect labtestvirt2003 eth1 and eth2 interface(s) to switch fabric - https://phabricator.wikimedia.org/T183167#3950688 (10chasemp) We sorted things out in real time and the definitive is: ```labtestvirt2001:eth0 = ge-5/0/17 (ID=2187) labtestvirt2001:eth1 = ge-5/0/31... [22:21:29] (03CR) 10Dzahn: [V: 032 C: 032] add missing profile::openstack::labtestn::rabbit_cleanup_pass to fix compiler runs [labs/private] - 10https://gerrit.wikimedia.org/r/408699 (owner: 10Dzahn) [22:22:15] wtf @ [22:22:15] Error: Evaluation Error: Error while evaluating a Function Call, nova-compute is only valid for [4.4.0-109-generic] and not 4.4.0-81-generic [22:23:14] mutante i think that is a whitelist for drivers [22:23:33] !log Zuul/CI seems to work all fine now [22:23:34] thanks, i just found $whitelist_kernels .. ehm.. [22:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:12] (03CR) 10Paladox: "@hashar though apache them self's recommend you use php-fpm over mod_php due to mod_php has known memory leak problems." [puppet] - 10https://gerrit.wikimedia.org/r/407958 (https://phabricator.wikimedia.org/T182832) (owner: 10Paladox) [22:26:00] (03PS1) 10Dzahn: openstack: add 4.4.0-81-generic to whitelisted kernel versions for compute [puppet] - 10https://gerrit.wikimedia.org/r/408706 [22:26:54] Could not find class ::ldap::role::client::labs [22:27:25] Evaluation Error: Operator '[]' is not applicable to an Undef Value. [22:27:52] moar bugs :) [22:29:30] mutante i think that was renamed to be a profile [22:30:20] mutante https://gerrit.wikimedia.org/r/c/407039/ [22:30:44] the " Operator '[]' is not applicable to an Undef Value" thing is in redis/slave code [22:31:03] paladox: thanks, yep [22:31:09] andrewbogott ^^ [22:32:58] it fails on labweb1002 [22:33:04] which directly includes that on site level [22:33:11] checks in prod [22:33:27] puppet disabled there :/ [22:33:53] mutante ah [22:34:03] it wasen't removed in that change from that server [22:34:31] see how it gets removed https://gerrit.wikimedia.org/r/#/c/407039/10/manifests/site.pp [22:34:39] but not from all [22:34:46] what is the right fix [22:35:09] that uses mediawiki::appserver role [22:35:19] but i doin't see it in https://github.com/wikimedia/puppet/blob/production/manifests/site.pp#L986 [22:35:25] just removing it i guess [22:36:01] mutante: I'm working on the labweb boxes right now, are they alerting or something? [22:36:15] labtestweb != labweb @ paladox [22:36:24] ah ok [22:36:46] andrewbogott: should we remove the "include ::ldap::role::client::labs" from labweb* ? [22:36:52] you removed it from the other nodes [22:37:15] and it can't find the class [22:37:16] mutante: if it's possible for you to ignore those two hosts, please do [22:37:17] (03PS1) 10Ayounsi: Adding missing eqsin PTR [dns] - 10https://gerrit.wikimedia.org/r/408711 [22:37:45] they're downtimed, right? [22:38:10] it means i can't compile an unrelated change on everything. i was just trying to fix the compiler runs [22:38:19] but yea, i can ignore it [22:38:43] ah, I see, on everything [22:39:03] It's probably fine if you just remove those includes. [22:39:04] yea, i was touching the ganglia_cluster thing, so i wanted * and then i just saw a few special cases [22:39:07] that was all [22:39:30] ok :) [22:39:33] (03CR) 10Ayounsi: [C: 032] Adding missing eqsin PTR [dns] - 10https://gerrit.wikimedia.org/r/408711 (owner: 10Ayounsi) [22:43:43] andrewbogott: i had started the compiler run yesterday since it takes so long.. meanwhile things changed. that include isn't there anymore, ignore that stuff :) [22:44:07] cool :) I was wondering since it's working for me currently. (Well, not perfectly, but /that/ part is working) [22:44:12] i have another one though, heh [22:44:47] not important, but: https://gerrit.wikimedia.org/r/#/c/408706/ [22:45:38] (03PS2) 10Dzahn: add missing profile::openstack::labtestn::rabbit_cleanup_pass to fix compiler runs [labs/private] - 10https://gerrit.wikimedia.org/r/408699 [22:46:09] (03CR) 10Dzahn: [V: 032 C: 032] add missing profile::openstack::labtestn::rabbit_cleanup_pass to fix compiler runs [labs/private] - 10https://gerrit.wikimedia.org/r/408699 (owner: 10Dzahn) [22:46:12] (03CR) 10Andrew Bogott: [C: 04-1] "This guard is important to avoid security issues and/or kernel freezes and/or network failures on the labvirts. Any change would need to " [puppet] - 10https://gerrit.wikimedia.org/r/408706 (owner: 10Dzahn) [22:46:41] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Onboard bstorm to WMF - https://phabricator.wikimedia.org/T185493#3950825 (10Bstorm) It's on there, now. [22:46:55] (03CR) 10Krinkle: "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/408612 (https://phabricator.wikimedia.org/T185506) (owner: 10Krinkle) [22:47:47] (03Abandoned) 10Dzahn: openstack: add 4.4.0-81-generic to whitelisted kernel versions for compute [puppet] - 10https://gerrit.wikimedia.org/r/408706 (owner: 10Dzahn) [22:50:33] (03CR) 10Dzahn: "so yea, see compiler output. i can't fix all the failing ones but they are not related to this change. see the changes on lvs hosts though" [puppet] - 10https://gerrit.wikimedia.org/r/406794 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:52:01] (03CR) 10Dzahn: "attempts to fix: https://gerrit.wikimedia.org/r/408699 https://gerrit.wikimedia.org/r/408706" [puppet] - 10https://gerrit.wikimedia.org/r/406794 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [22:53:31] !log andrew@tin Started deploy [horizon/deploy@48c51e9]: (no justification provided) [22:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:16] !log andrew@tin Finished deploy [horizon/deploy@48c51e9]: (no justification provided) (duration: 02m 45s) [22:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:13] !log andrew@tin Started deploy [horizon/deploy@48c51e9]: (no justification provided) [23:00:16] !log andrew@tin Finished deploy [horizon/deploy@48c51e9]: (no justification provided) (duration: 00m 03s) [23:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:29] !log andrew@tin Started deploy [horizon/deploy@48c51e9]: (no justification provided) [23:02:33] !log andrew@tin Finished deploy [horizon/deploy@48c51e9]: (no justification provided) (duration: 00m 04s) [23:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:12] (03CR) 10Volans: [C: 04-1] "Seems correct, although I would have used a slightly different approach, see inline, plus one possible error (reason of the -1)." (034 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/408290 (owner: 10Giuseppe Lavagetto) [23:18:12] 10Operations, 10Gerrit, 10Release-Engineering-Team (Someday): Make sure replying to emails in gerrit 2.14 works - https://phabricator.wikimedia.org/T158915#3950905 (10demon) p:05Normal>03Low Actually I think this is kinda low priority since we don't offer the service yet--it would be an enhancement. [23:19:36] (03Abandoned) 10Nuria: PageCreate events are no longer flowing [puppet] - 10https://gerrit.wikimedia.org/r/383485 (https://phabricator.wikimedia.org/T171629) (owner: 10Nuria) [23:23:54] (03CR) 10Mobrovac: hieradata: extend SMART eqiad deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/408543 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [23:30:07] (03CR) 10Chad: [C: 031] gerrit: Apply class=loginParent earlier on login page. [puppet] - 10https://gerrit.wikimedia.org/r/408612 (https://phabricator.wikimedia.org/T185506) (owner: 10Krinkle) [23:30:25] (03CR) 10Chad: [C: 031] gerrit: Move header styles back out of login section [puppet] - 10https://gerrit.wikimedia.org/r/408611 (owner: 10Krinkle) [23:59:16] (03CR) 10Chad: [C: 031] gerrit: Scope login-specific styles to loginParent [puppet] - 10https://gerrit.wikimedia.org/r/408613 (https://phabricator.wikimedia.org/T185506) (owner: 10Krinkle)