[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160908T0000). [00:01:46] (03PS1) 10Yuvipanda: ldap: Factor out role that provides ldap.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/309209 [00:02:19] (03PS1) 10Mobrovac: Kartotherian: Do not use a proxy [puppet] - 10https://gerrit.wikimedia.org/r/309210 [00:02:23] (03PS2) 10Yuvipanda: ldap: Factor out role that provides ldap.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/309209 [00:02:29] (03CR) 10Yuvipanda: [C: 032 V: 032] ldap: Factor out role that provides ldap.yaml file [puppet] - 10https://gerrit.wikimedia.org/r/309209 (owner: 10Yuvipanda) [00:03:09] !log Phabricator upgrade starting momentarily. Service will be offline for a short time, most likely less than 5 minutes. [00:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:15] y [00:08:32] n [00:10:04] (03PS1) 10Yuvipanda: labs: Have puppet enc work for labtest too [puppet] - 10https://gerrit.wikimedia.org/r/309212 [00:10:36] (03PS2) 10Yuvipanda: labs: Have puppet enc work for labtest too [puppet] - 10https://gerrit.wikimedia.org/r/309212 [00:10:49] (03CR) 10Yuvipanda: [C: 032 V: 032] labs: Have puppet enc work for labtest too [puppet] - 10https://gerrit.wikimedia.org/r/309212 (owner: 10Yuvipanda) [00:11:38] (03PS2) 10Mobrovac: Kartotherian: Do not use a proxy [puppet] - 10https://gerrit.wikimedia.org/r/309210 [00:13:03] Dereckson: SWAT done? Can I/Krenair push something else out? VE UBN. :-( [00:14:00] James_F: just FYI, phab is down momentarily, if that matters [00:14:13] twentyafterfour: It doesn't, but yeah, looking forward to the upgrade. :-) [00:14:15] greg-g: y u n? [00:14:47] !log phabricator upgrade is running database migrations now, taking longer than expected [00:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:15:51] !log upgrade complete. Service restored and everything seems normal. [00:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:16:04] twentyafterfour: mostly for the lulz [00:16:11] heheh [00:16:17] I lol'd [00:16:38] LOL [00:16:44] anyways the logo looks good [00:16:45] now [00:16:47] :) [00:16:57] Wow, the font for the name of the site is a bit jarring. [00:17:00] Yeah thanks for taking care of that, paladox [00:17:12] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:17:16] Your welcome :) [00:17:20] James_F: yeah I'm not sure if I like the new font or not yet [00:17:26] * James_F shrugs. [00:17:38] Is there a change list? [00:17:44] twentyafterfour what do you mean the new front? [00:17:57] ooo, and the administrator notifications are now nicer than a bar of red text [00:18:02] Yep [00:18:09] paladox: the name "Phabricator" in the top left [00:18:14] Ah [00:18:19] Thanks [00:18:25] :) [00:18:25] greg-g check the new config section [00:18:31] (03PS3) 10Mobrovac: Kartotherian: Do not use a proxy [puppet] - 10https://gerrit.wikimedia.org/r/309210 [00:18:37] that seems to be the nice part [00:18:38] new icons [00:18:53] I haven't used it much before so it's all new to me anyways [00:19:05] only got admin to help with spam that day, mostly [00:19:17] Oh [00:19:48] They also fixed diffusion so now it will correctly identify renames. [00:20:19] We can always customise the font size for the top logo part if it is too small for anyone [00:20:48] James_F: yes, SWAT is done [00:21:07] Krenair: You OK to deploy? [00:21:40] (03CR) 10Mobrovac: "PCC - https://puppet-compiler.wmflabs.org/4016/" [puppet] - 10https://gerrit.wikimedia.org/r/309210 (owner: 10Mobrovac) [00:22:14] (03CR) 10Yurik: [C: 031] Kartotherian: Do not use a proxy [puppet] - 10https://gerrit.wikimedia.org/r/309210 (owner: 10Mobrovac) [00:22:27] magical ops powers needed ^ [00:22:47] (03PS1) 1020after4: phabricator php.ini: set always_populate_raw_post_data = -1 [puppet] - 10https://gerrit.wikimedia.org/r/309214 [00:23:05] i will reboot and verify the service myself, but the puppet +2 is still required ^^^ [00:23:16] James_F, ok [00:23:34] James_F: Phabricator change list could be found at https://secure.phabricator.com/w/changelog/ [00:23:45] (03PS1) 10Alex Monk: Use ::hostname instead of ::instancename to fix compatibility with labs ENC [puppet] - 10https://gerrit.wikimedia.org/r/309215 (https://phabricator.wikimedia.org/T101447) [00:23:52] yuvipanda, ^ [00:24:38] krenair let's just get rid of that branch completely? [00:24:39] the \h is equivalent [00:24:39] to hostname [00:24:41] James_F, want to check it yourself, or are there bug reproduction steps? [00:25:03] Krenair: Repro steps are "try to load VE a two–three times 'til it happens". [00:25:12] Krenair: Happy to check on mw1099 myself though. [00:25:16] partially yuvipanda [00:25:20] ok [00:25:43] oh, you already +2'd [00:25:47] ok [00:26:01] Dereckson: Yes, I know, but it's useless. I don't know what start and end points to look at. They have "minor" items based on their judgement of what is big (whereas e.g. "Paste now uses a typeahead for highlight language selection." is quite a useful change). [00:26:24] There is another document more interesting [00:26:34] Dereckson: It's mostly focussed on "here are the breaking DB changes", which is understandable but aimed at Phab admins, not Phab users. [00:26:40] Ah? Cool. :-) [00:26:46] yuvipanda, could you +2 a puppet change pls? [00:26:58] https://gerrit.wikimedia.org/r/#/c/309210 [00:27:02] the answer is 'it depends' [00:27:12] yuvipanda, its simple, and i will do all the testing :) [00:27:15] FOr one thing we should notice improvements in daemon [00:27:38] ie not seeing those time out errors because we were importing refs/changes/ [00:27:49] it should also free alot of resources for us. [00:27:49] (03PS4) 10Yuvipanda: Kartotherian: Do not use a proxy [puppet] - 10https://gerrit.wikimedia.org/r/309210 (owner: 10Mobrovac) [00:27:54] (03CR) 10Yuvipanda: [C: 032 V: 032] Kartotherian: Do not use a proxy [puppet] - 10https://gerrit.wikimedia.org/r/309210 (owner: 10Mobrovac) [00:28:10] James_F: a "companion to the changelog" explaining big changes, but it's only published when there is such big change. They could be found at https://secure.phabricator.com/phame/blog/view/111/ [00:28:46] Or you can look on the repos history [00:29:09] Dereckson: Thanks. I'll add that to my RSS reader, I guess. [00:29:22] I usually try to write up the big changes on our own phame blog but this time there aren't any huge changes really [00:29:32] (03PS2) 10Alex Monk: Use ::hostname instead of ::instancename to fix compatibility with labs ENC [puppet] - 10https://gerrit.wikimedia.org/r/309215 (https://phabricator.wikimedia.org/T101447) [00:29:32] twentyafterfour im wondering could you have a look at the settings for https://phabricator.wikimedia.org/phame/ [00:29:34] please [00:29:37] quite a lot of little user experience stuff [00:29:46] since it should work now, but probaly it only allows admins [00:30:24] twentyafterfour: Understood. Thank you! :-) [00:30:37] paladox: what setting am I looking for? [00:30:46] yuvipanda, I'm really tempted to get rid of that last branch [00:30:53] and fill labs terminals with colourful PS1s [00:31:22] krenair let's do it [00:32:36] Im not sure [00:32:38] let me see [00:32:40] which one [00:32:46] paladox: there aren't many settings for phame [00:32:49] (03PS3) 10Alex Monk: Use ::hostname instead of ::instancename to fix compatibility with labs ENC [puppet] - 10https://gerrit.wikimedia.org/r/309215 (https://phabricator.wikimedia.org/T101447) [00:33:05] Yep, but it seems to work for phab-01 but not on phabricator.wm.org [00:33:25] yuvipanda, are you deploying anything, or is it ok to scap3 kartotherian? [00:33:52] yurik nope, go ahead [00:33:57] twentyafterfour i guess it's because of the edit poly [00:33:59] yurik check with greg-g tho [00:33:59] policy [00:34:04] or twentyafterfour :) [00:34:24] fire in hall... kartotherian [00:34:26] twentyafterfour i guess you should ask upstream to implement a second poly for comments [00:34:49] Unless your willing to open up the edit policy on blogs [00:35:13] yuvipanda, actually, it might not because $TERM is xterm-256color instead of the expected xterm-color [00:35:43] (03PS1) 10BryanDavis: bigbrother: Rewrite as python script [puppet] - 10https://gerrit.wikimedia.org/r/309216 (https://phabricator.wikimedia.org/T144955) [00:36:42] (03PS4) 10Alex Monk: Use ::hostname instead of ::instancename to fix compatibility with labs ENC [puppet] - 10https://gerrit.wikimedia.org/r/309215 (https://phabricator.wikimedia.org/T101447) [00:36:48] James_F, okay, looks like jenkins completed [00:37:31] James_F, we've got these: [00:37:37] ba3e6f753f0db21001ac51a4e15dd7099bd9342c Update VE core submodule to wmf/1.28.0-wmf.18 HEAD (d1a128f) [00:37:50] 83c3e976ae499f5cec275d0a1db89b5669132024 Merge "Fix parent constructor call" into wmf/1.28.0-wmf.18 [00:37:59] f1f9d88b754eed5af17e9f44bde4d659401e6ded Make ext.visualEditor.mediawiki a dependency of .mwcore [00:38:11] I know the last one's right, the others seem okay. were you expecting them? [00:38:30] Merge "Fix parent constructor call" into wmf/1.28.0-wmf.18 was exxplicitely required during the SWAT [00:38:51] well it didn't seem to get deployed? [00:39:00] it shows on git log HEAD..origin/wmf/1.28.0-wmf.18 [00:39:08] ah [00:39:13] Oh dear. [00:40:09] Krenair: SWA/srv/mediawiki-staging/php-1.28.0-wmf.18/extensions/VisualEditor is at 83c3e976ae499f5cec275d0a1db89b5669132024 [00:40:34] krenair lgtm, shall I merge? [00:40:39] twentyafterfour strange [00:40:43] it works for phab-01 [00:40:49] yuvipanda, maybe test puppet-compiler against a prod host or two first to be sure? [00:40:49] but not for phabricator.wm.org [00:41:00] Krenair: on /srv/mediawiki-staging/php-1.28.0-wmf.18/extensions/VisualEditor I've this: [00:41:07] $ git log HEAD..origin/wmf/1.28.0-wmf.18 [00:41:12] f1f9d88b754eed5af17e9f44bde4d659401e6ded Make ext.visualEditor.mediawiki a dependency of .mwcore [00:41:27] I'm looking at /srv/mediawiki-staging/php-1.28.0-wmf.18 [00:41:59] krenair ok, doing [00:42:07] Okay, I understood, what happened: I rebased core a little too soon, when Gerrit was still ending merge [00:42:12] ah [00:42:16] --- a/extensions/VisualEditor [00:42:16] +++ b/extensions/VisualEditor [00:42:19] -Subproject commit 14aa1e361daf57207385c860bf44d29ebc43cc35 [00:42:19] +Subproject commit 83c3e976ae499f5cec275d0a1db89b5669132024 [00:42:23] * James_F nods. [00:42:28] !log kartotherian synced T145042 [00:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:42:56] the version in the VE repo includes the patch requested earlier [00:42:59] so it probably was deployed [00:43:07] but the submodule update was handled incorrectly [00:43:28] oh submodules, how I love them [00:43:35] Aren't they great? [00:43:44] TBF, they're better than subtrees. [00:44:20] (03PS2) 10BryanDavis: bigbrother: Rewrite as python script [puppet] - 10https://gerrit.wikimedia.org/r/309216 (https://phabricator.wikimedia.org/T144955) [00:44:30] James_F, syncing [00:44:56] To 1099 or prod? [00:45:22] urgh, 1099 [00:45:23] right [00:45:26] forgot we added that step [00:45:27] :-) [00:45:57] twentyafterfour you may want to compare the setting shere https://phab-01.wmflabs.org/phame/ with phabricator.wm.org [00:46:11] otherwise it still needs fixing if that dosent work [00:46:19] !log krenair@tin Synchronized php-1.28.0-wmf.18/extensions/VisualEditor/extension.json: https://gerrit.wikimedia.org/r/#/c/309213/ (duration: 00m 46s) [00:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:47:22] it still works [00:47:31] not convinced it's fully fixed the issue yet [00:47:54] Me neither. [00:48:01] here's one I've not seen before [00:48:02] TypeError: Expecting a function in instanceof check, but got undefined [00:48:10] Exception in module-execute in module mediawiki.Upload.BookletLayout: [00:48:14] completely useless stack trace [00:48:20] Eurgh. [00:48:31] Krinkle: Stop bloody breaking RL, please. :-) [00:48:42] twentyafterfour [00:48:43] ah [00:48:47] maybe https://phabricator.wikimedia.org/transactions/editengine/phame.blog/ [00:48:49] Krenair: Sync it. [00:48:50] It looks empty [00:48:55] I did [00:49:02] Oh, everywhere? OK. [00:49:03] whereas isent [00:49:04] https://phab-01.wmflabs.org/transactions/editengine/phame.blog/ [00:49:08] * James_F tests in prod not 1099. [00:49:18] Did you sync the submodule? [00:49:22] twentyafterfour could you see if creating ^^ that in phabricator fixes it [00:50:06] I sync'd the only file we changed [00:50:14] By the submodule was dirty? [00:50:17] But, even. [00:50:31] twentyafterfour and https://phabricator.wikimedia.org/transactions/editengine/phame.post/ [00:50:33] From the SWAT? Or was that just an artefact and all's well? [00:50:37] https://phab-01.wmflabs.org/transactions/editengine/phame.post/view/5/ [00:50:52] I already fixed up the submodule, but the mw servers don't care, it's traditional scap [00:51:01] Ah, OK. [00:51:25] with scap3 I believe you'd need to deal with that [00:51:40] James_F: Exception in module-execute, meaning, the module is broken, not RL. It's just reporting that there was an uncaught exception, which RL caught to avoid other modules from breaking. The TypeError is the actual problem and should have a trace. [00:52:15] Krinkle: We've had three or four weeks in a row of new RL async issues. A breather would be appreciated. [00:52:23] (03PS5) 10Alex Monk: base: Use ::hostname instead of ::instancename [puppet] - 10https://gerrit.wikimedia.org/r/309215 (https://phabricator.wikimedia.org/T101447) [00:52:32] (03PS6) 10Yuvipanda: base: Use ::hostname instead of ::instancename [puppet] - 10https://gerrit.wikimedia.org/r/309215 (https://phabricator.wikimedia.org/T101447) (owner: 10Alex Monk) [00:52:36] James_F: Which RL async issues? [00:52:38] (03CR) 10Yuvipanda: [C: 032 V: 032] base: Use ::hostname instead of ::instancename [puppet] - 10https://gerrit.wikimedia.org/r/309215 (https://phabricator.wikimedia.org/T101447) (owner: 10Alex Monk) [00:52:42] I've not seen any reports in phab [00:52:45] log, handler, fire, self.fireWith, self.fire, mw.track, runScript, checkCssHandles, (anonymous function), fire, self.fireWith, self.fire, fireCallbacks, addEmbeddedCSS, (anonymous function) [00:52:49] not a useful stack trace [00:53:07] oh, wait [00:53:08] Krinkle: Because they're not bugs in RL, they're existing bugs in code that changes in RL are uncovering and making lots of UBN bugs. [00:53:16] there's another expand button [00:53:18] Krinkle: So we don't tag them as RL, obviously. [00:53:33] Object.oo.inheritClass, eval, mw.loader.implement.css, ... [00:53:34] I don't see how any recent RL changes would've uncovered existing race conditions. [00:53:43] Well, something is. :-( [00:53:48] twentyafterfour [00:53:52] i found the problem [00:54:01] yayayay, please create the above [00:54:10] and set them public please [00:54:16] krenair done [00:54:25] They wont be able to create posts [00:54:32] but will be able to posts comments again [01:00:08] twentyafterfour? [01:01:35] (03PS1) 10Alex Monk: bashrc: Fix TERM checks for xterm-256color [puppet] - 10https://gerrit.wikimedia.org/r/309222 [01:02:10] yuvipanda, ^ [01:02:20] (03PS2) 10Yuvipanda: bashrc: Fix TERM checks for xterm-256color [puppet] - 10https://gerrit.wikimedia.org/r/309222 (owner: 10Alex Monk) [01:02:34] (03CR) 10Yuvipanda: [C: 032 V: 032] bashrc: Fix TERM checks for xterm-256color [puppet] - 10https://gerrit.wikimedia.org/r/309222 (owner: 10Alex Monk) [01:02:48] there we go :) [01:02:56] now am out [01:03:00] thanks for the patches, krenair [01:03:07] thanks for merging [01:06:09] James_F, of course this other issue doesn't seem to occur if you wait for it with debug=true [01:07:08] Krenair: Indeed. :-( [01:10:09] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [01:14:13] (03CR) 10Paladox: "recheck" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/308785 (owner: 10Paladox) [01:17:31] RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0] [01:30:10] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service: some icinga checks on WDQS do not send notifications - https://phabricator.wikimedia.org/T144948#2618101 (10Smalyshev) p:05Triage>03High [01:34:06] (03CR) 10Paladox: [C: 031] phabricator php.ini: set always_populate_raw_post_data = -1 [puppet] - 10https://gerrit.wikimedia.org/r/309214 (owner: 1020after4) [01:47:01] (03PS1) 10Dereckson: Set $wgDefaultExternalStore for wikitech before Flow settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309225 (https://phabricator.wikimedia.org/T127792) [02:05:36] (03PS1) 10Dereckson: Use === for $wgDBname comparison [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309226 [02:12:38] (03PS2) 10Dereckson: Set $wgDefaultExternalStore for wikitech before Flow settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309225 (https://phabricator.wikimedia.org/T127792) [02:14:01] (03CR) 10Dereckson: "PS2: rebased against I76324124c to fix ==/=== incoherence." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309225 (https://phabricator.wikimedia.org/T127792) (owner: 10Dereckson) [02:14:30] (03CR) 10MZMcBride: "Nice." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309226 (owner: 10Dereckson) [02:36:28] (03Abandoned) 10Dereckson: Serve jQuery locally [software/dbtree] - 10https://gerrit.wikimedia.org/r/298011 (https://phabricator.wikimedia.org/T139762) (owner: 10Dereckson) [02:38:23] 06Operations, 10DBA, 10Traffic, 06WMF-Legal, and 2 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#2618187 (10Dereckson) [02:39:51] 06Operations, 10DBA, 10Traffic, 06WMF-Legal, and 2 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#1218427 (10Dereckson) [02:40:30] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.17) (duration: 17m 56s) [02:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:16:52] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.18) (duration: 18m 14s) [03:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:24:15] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Sep 8 03:24:15 UTC 2016 (duration 7m 23s) [03:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:25:01] (03CR) 10Ottomata: [C: 032] Point archiva at meitnerium [dns] - 10https://gerrit.wikimedia.org/r/308997 (https://phabricator.wikimedia.org/T123725) (owner: 10Ottomata) [03:33:33] !log merging dns change to point archiva.wikimedia.org at new archiva node meitnerium [03:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:18:20] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1809.391397 Seconds [04:20:43] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [04:30:29] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 0.059 second response time [05:15:55] (03CR) 1020after4: [C: 031] "I haven't tested this but it looks ok to me." [puppet] - 10https://gerrit.wikimedia.org/r/308885 (https://phabricator.wikimedia.org/T137354) (owner: 10Paladox) [05:52:20] (03PS1) 10Jcrespo: prometheus: Remove db1075 from the s3 slaves; it was duplicated [puppet] - 10https://gerrit.wikimedia.org/r/309241 (https://phabricator.wikimedia.org/T126757) [06:15:01] (03CR) 10Dereckson: [C: 031] "Configuration looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [06:17:56] 06Operations, 10DBA, 10MediaWiki-Maintenance-scripts, 06Release-Engineering-Team, and 2 others: Add section for long-running tasks on the Deployment page (specially for database maintenance) - https://phabricator.wikimedia.org/T144661#2618330 (10jcrespo) > Also, re the inclusion criteria part: if we list t... [06:34:13] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [06:44:43] !log reimaging mw2208->mw2211 to jessie [06:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:50:49] (03CR) 10Marostegui: [C: 031] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/309241 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [06:51:51] !log reimaging mw2212-mw2214 to jessie [06:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:58:52] (03CR) 10Muehlenhoff: [C: 04-1] "Please rather use a ferm service, for generic fules like this it's the better abstraction. See https://gerrit.wikimedia.org/r/#/c/309041/ " [puppet] - 10https://gerrit.wikimedia.org/r/309153 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [06:59:34] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:48] (03CR) 10Giuseppe Lavagetto: "I don't particularly like these classes encapsulating a single define, but it's ok if it keeps our syntax consistent." [puppet] - 10https://gerrit.wikimedia.org/r/309035 (owner: 10Elukey) [07:06:21] 06Operations: graphite-web cronspam - https://phabricator.wikimedia.org/T144797#2618348 (10elukey) Checked also https://github.com/graphite-project/graphite-web/blob/0.9.13-pre1/webapp/graphite/logger.py These are the rotation rules: ``` #Setup formatter & handlers self.formatter = logging.Formatter("%... [07:13:35] graphite-web has an hardcoded and not configurable logrotation sigh [07:13:51] (configurable only from 0.10 onwards) [07:18:12] (03PS2) 10Elukey: Add the apache::mod::substitute class [puppet] - 10https://gerrit.wikimedia.org/r/309035 [07:20:29] (03CR) 10Elukey: [C: 032] Add the apache::mod::substitute class [puppet] - 10https://gerrit.wikimedia.org/r/309035 (owner: 10Elukey) [07:23:52] (03PS1) 10Muehlenhoff: Update SSH key for Zhou Zhou [puppet] - 10https://gerrit.wikimedia.org/r/309249 (https://phabricator.wikimedia.org/T144624) [07:24:39] (03PS2) 10Jcrespo: prometheus: Remove db1075 from the s3 slaves; it was duplicated [puppet] - 10https://gerrit.wikimedia.org/r/309241 (https://phabricator.wikimedia.org/T126757) [07:27:06] (03CR) 10Muehlenhoff: [C: 032] Update SSH key for Zhou Zhou [puppet] - 10https://gerrit.wikimedia.org/r/309249 (https://phabricator.wikimedia.org/T144624) (owner: 10Muehlenhoff) [07:27:12] (03PS2) 10Muehlenhoff: Update SSH key for Zhou Zhou [puppet] - 10https://gerrit.wikimedia.org/r/309249 (https://phabricator.wikimedia.org/T144624) [07:33:21] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "See a few inline comments for small/mildly larger things I'd fix." (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/308520 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [07:33:28] RECOVERY - mediawiki-installation DSH group on mw2204 is OK: OK [07:34:11] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 for ZZhou (WMF) - https://phabricator.wikimedia.org/T144624#2618369 (10MoritzMuehlenhoff) 05Open>03Resolved Your key has been updated, you should now be able to log into stat1002 again. [07:35:59] PROBLEM - MD RAID on mw2211 is CRITICAL: Connection refused by host [07:36:48] PROBLEM - Apache HTTP on mw2211 is CRITICAL: Connection refused [07:37:16] (03PS2) 10Elukey: Add a Substitute Apache directive to fix broken links in the Yarn UI [puppet] - 10https://gerrit.wikimedia.org/r/309016 (https://phabricator.wikimedia.org/T116192) [07:37:18] PROBLEM - configured eth on mw2211 is CRITICAL: Connection refused by host [07:37:37] PROBLEM - dhclient process on mw2211 is CRITICAL: Connection refused by host [07:37:48] PROBLEM - mediawiki-installation DSH group on mw2211 is CRITICAL: Host mw2211 is not in mediawiki-installation dsh group [07:38:07] PROBLEM - nutcracker port on mw2211 is CRITICAL: Connection refused by host [07:38:29] PROBLEM - nutcracker process on mw2211 is CRITICAL: Connection refused by host [07:38:42] this is my host [07:38:47] PROBLEM - puppet last run on mw2211 is CRITICAL: Connection refused by host [07:38:53] already reimaged, silencing it [07:39:07] PROBLEM - salt-minion processes on mw2211 is CRITICAL: Connection refused by host [07:41:00] (03CR) 10Elukey: [C: 032] Add a Substitute Apache directive to fix broken links in the Yarn UI [puppet] - 10https://gerrit.wikimedia.org/r/309016 (https://phabricator.wikimedia.org/T116192) (owner: 10Elukey) [07:45:38] OC [07:45:49] sorry, wrong window [07:50:16] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor/exiftool deadlock, likely full pipe - https://phabricator.wikimedia.org/T144928#2618384 (10Gilles) Did that fix the problem? That feature isn't worth it if it leads to those problems. Now that only one temp file is used per thumbor i... [07:50:33] (03CR) 10Jcrespo: [C: 032] prometheus: Remove db1075 from the s3 slaves; it was duplicated [puppet] - 10https://gerrit.wikimedia.org/r/309241 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [07:50:39] (03PS3) 10Jcrespo: prometheus: Remove db1075 from the s3 slaves; it was duplicated [puppet] - 10https://gerrit.wikimedia.org/r/309241 (https://phabricator.wikimedia.org/T126757) [07:54:03] (03CR) 10Volans: "@jynus, @filippo: what will happen when switching master?" [puppet] - 10https://gerrit.wikimedia.org/r/309241 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [07:54:39] volans, the current setup [07:54:46] is only to test prometheus/graphs [07:55:03] I am now working on (obviouly) making it automatic [07:55:11] I accept help! :-) [07:55:19] but it is half done [07:55:37] we are just waiting on the puppet work to manage resources [07:56:17] jynus: I know it's a test and I'm ok not being automated right now, as long as we "tested" that can be done :) same thing for the old metrics [07:56:34] to know what will be the behaviour and if it's ok for us [07:56:43] then I do not understand [07:57:11] the original question? [07:57:34] read the comments in the CR, it's multiline :) [07:57:42] there were multiple questions :D [07:58:06] "could not be done using the $mysql_role?" - that is the whole idea [07:58:36] that is why I setup group, shard and role on puppet [07:59:01] just Filippo didn't want to roll it in yet at the time [07:59:07] and I agreed [07:59:09] that's ok [07:59:21] then metrics will not be gone [07:59:30] they will just have "tags" [07:59:46] 06Operations, 06Performance-Team, 10Thumbor: thumbor handling of originals 404 - https://phabricator.wikimedia.org/T144956#2618388 (10Gilles) Those failures from the log excerpts are on the thumbs, not the originals. I don't know why those ClientExceptions aren't caught, they should be. It's expected behavi... [07:59:58] that is up to the user on grafana to interpret those [08:00:01] so [08:00:07] but the tags are "per metric" or "per metric with time", like if I add a tag tomorrow to a metric, the whole metric will have the new tag also for old data? [08:00:15] no [08:00:20] and we do not want that either [08:00:29] so, let me give you an example [08:00:39] db1075 is current s3 master [08:01:33] so a metric could be mysql_lag{dc=eqiad,shard=s3,role=master,instance=db1075} = 0 [08:01:59] if we promote db1077 to be the new master [08:01:59] 06Operations, 06Performance-Team, 10Thumbor: thumbor error spawning ghostscript 'libcgroup initialization failed: Cgroup is not mounted' - https://phabricator.wikimedia.org/T144938#2618391 (10Gilles) Currently thumbor production puppet has this configuration: ``` SUBPROCESS_CGEXEC_PATH = '/usr/bin/cgexec'... [08:02:08] the old metrics will stay like that [08:02:12] and new ones will be [08:02:25] mysql_lag{dc=eqiad,shard=s3,role=slave,instance=db1075} = 0 [08:02:35] mysql_lag{dc=eqiad,shard=s3,role=master,instance=db1077} = 0 [08:02:43] it is up to you to interpret that [08:02:57] "show me the metrics of allways db1075" [08:03:09] vs. "show me the metrics of the master" [08:03:31] 06Operations, 06Performance-Team, 10Thumbor: thumbor handling of originals 404 - https://phabricator.wikimedia.org/T144956#2618393 (10Gilles) Yep: http://docs.openstack.org/developer/python-swiftclient/3.0.0/swiftclient.html#module-swiftclient.exceptions Fucking library breaking changes... [08:03:44] jynus: thanks for the feedback on the PageAssessments roll-out. It's much appreciated. FWIW, deploying PageAssessments to en.wiki is one of our quarterly goals, i.e. we're hoping to accomplish it before the end of September, so I'll be poking the performance team to help us get these spikes smoothed out. [08:03:52] ok [08:04:00] we have a month for that, don't we? [08:04:09] jynus: I agree, I just want to check how prometheus stores the tags, thanks [08:04:31] kaldari, reliability > goals, don't you agree [08:04:44] jyrus: certainly :) [08:04:51] also, in the future, I would suggest not commit to a specific wiki [08:04:56] in case there are issues [08:05:15] kaldari, I think this should only take a few days to solve [08:05:20] jynus: I don't mind missing goals as long as we have a good reason. [08:05:25] and I will help with that [08:05:31] thanks [08:05:44] PROBLEM - configured eth on mw2213 is CRITICAL: Connection refused by host [08:05:46] PROBLEM - Apache HTTP on mw2213 is CRITICAL: Connection refused [08:05:50] PROBLEM - configured eth on mw2214 is CRITICAL: Timeout while attempting connection [08:05:50] PROBLEM - Apache HTTP on mw2214 is CRITICAL: Connection timed out [08:05:53] kaldari, there is a chance [08:06:04] that the issues have nothing to do with the extension [08:06:10] but we should check that [08:06:11] PROBLEM - dhclient process on mw2213 is CRITICAL: Connection refused by host [08:06:20] PROBLEM - dhclient process on mw2214 is CRITICAL: Timeout while attempting connection [08:06:29] 06Operations, 06Performance-Team, 10Thumbor: thumbor handling of originals 404 - https://phabricator.wikimedia.org/T144956#2618394 (10Gilles) Or maybe it was never the right path to begin with? Anyway, I'll fix it. [08:06:37] I just got worried when you scheduled a next deployment without proper research of issues [08:06:54] let's research, and then take a decision based on data and facts [08:07:09] and if you do not have access, ask and I will do it for you [08:07:20] (which is what I was going to do anyway :-)) [08:12:06] good morning [08:12:33] 'morning [08:13:04] (03Abandoned) 10Phuedx: Disable Wikidata descriptions for 6 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308990 (https://phabricator.wikimedia.org/T143345) (owner: 10Phuedx) [08:19:06] 06Operations, 06Performance-Team, 10Thumbor: thumbor handling of originals 404 - https://phabricator.wikimedia.org/T144956#2618409 (10Gilles) Actually, due to an import in client.py, both paths should be fine. Verified on the version installed on the production servers: ``` gilles@thumbor1001:~$ cat /usr/... [08:23:12] !log Drop tables: ImageMetricsLoadingTime_10078363 and ImageMetricsCorsSupport_11686678 - T141407 [08:23:13] T141407: Drop EventLogging tables for ImageMetricsLoadingTime and ImageMetricsCorsSupport - https://phabricator.wikimedia.org/T141407 [08:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:24:26] volans, let me copy my own chat to gerrit as an answer [08:25:54] ok [08:28:43] 06Operations, 06Performance-Team, 10Thumbor: thumbor handling of originals 404 - https://phabricator.wikimedia.org/T144956#2618433 (10Gilles) I finally got it. I was being misled by the different format of the output you've pasted. Those swift error log entries happen locally as well. The exception are caugh... [08:28:43] (03PS2) 10Hashar: contint: allow ssh from contint1001 to labs instance [puppet] - 10https://gerrit.wikimedia.org/r/309153 (https://phabricator.wikimedia.org/T137323) [08:28:45] (03PS1) 10Hashar: contint: port labs fw rule to ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/309254 [08:29:41] (03CR) 10Jcrespo: " volans, the current setup" [puppet] - 10https://gerrit.wikimedia.org/r/309241 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [08:29:51] (03CR) 10Hashar: "Thank you Moritz for the suggestion, ferm::service looks much better. I have ported the currently deployed rule for gallium with the pare" [puppet] - 10https://gerrit.wikimedia.org/r/309153 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [08:29:55] (03PS1) 10Elukey: Fix and simplify mod_substitute's configuration for yarn.w.o [puppet] - 10https://gerrit.wikimedia.org/r/309255 (https://phabricator.wikimedia.org/T116192) [08:30:13] volans, indeed we need some training regarding the prometheus model [08:30:23] e.g. on our next large meeting [08:30:47] but I would prefer filippo to do that, as he was the real MVP for prometheus [08:30:55] yeah, I'll ping him [08:31:16] but I certainly think prometheus model is superior to graphite with the little information I have [08:31:47] (03CR) 10Elukey: [C: 032] Fix and simplify mod_substitute's configuration for yarn.w.o [puppet] - 10https://gerrit.wikimedia.org/r/309255 (https://phabricator.wikimedia.org/T116192) (owner: 10Elukey) [08:32:57] yeah I'll prepare sth for the offsite for sure, if that fails an ops session [08:33:03] (03PS2) 10Giuseppe Lavagetto: [WiP] scap: introduce scap_source type [puppet] - 10https://gerrit.wikimedia.org/r/308973 [08:33:47] <_joe_> let's see what rubocop thinks of this [08:34:10] (03CR) 10jenkins-bot: [V: 04-1] [WiP] scap: introduce scap_source type [puppet] - 10https://gerrit.wikimedia.org/r/308973 (owner: 10Giuseppe Lavagetto) [08:34:17] <_joe_> grrr [08:34:32] PROBLEM - Apache HTTP on mw2212 is CRITICAL: Connection refused [08:34:53] <_joe_> https://integration.wikimedia.org/ci/job/rake-jessie/56624/console [08:35:09] <_joe_> scap_source.rb:123:5: W: Unnecessary disabling of GuardClauses. [08:35:24] <_joe_> scap_source.rb:124:5: C: Use a guard clause instead [08:35:27] <_joe_> GRRRRR [08:35:30] PROBLEM - nutcracker process on mw2212 is CRITICAL: Connection refused by host [08:35:52] (03PS1) 10Aaron Schulz: Update constants file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309256 [08:36:27] (03CR) 10jenkins-bot: [V: 04-1] Update constants file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309256 (owner: 10Aaron Schulz) [08:37:06] 06Operations, 06Performance-Team, 10Thumbor: thumbor handling of originals 404 - https://phabricator.wikimedia.org/T144956#2618456 (10Gilles) [08:37:09] 06Operations, 06Performance-Team, 10Thumbor: thumbor handling of originals 404 - https://phabricator.wikimedia.org/T144956#2615390 (10Gilles) [08:38:07] (03PS3) 10Giuseppe Lavagetto: scap: introduce scap_source type [puppet] - 10https://gerrit.wikimedia.org/r/308973 [08:39:58] (03PS2) 10Muehlenhoff: zuul::merger: Convert to ferm service and restrict to labs + gallium [puppet] - 10https://gerrit.wikimedia.org/r/309041 [08:40:31] (03PS2) 10Aaron Schulz: Update constants file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309256 [08:40:49] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor/exiftool deadlock, likely full pipe - https://phabricator.wikimedia.org/T144928#2618461 (10fgiunchedi) p:05High>03Low yeah I haven't seen a reoccurence after disabling it. lowering priority since it isn't used now [08:40:49] jynus: perhaps could you sketch quickly something with pencil and paper to better illustrate how you imagine the ideal UI [08:41:01] 06Operations, 06Performance-Team, 10Thumbor: Unsupported header value None - https://phabricator.wikimedia.org/T145051#2618464 (10Gilles) [08:41:03] (03CR) 10Muehlenhoff: [C: 032 V: 032] zuul::merger: Convert to ferm service and restrict to labs + gallium [puppet] - 10https://gerrit.wikimedia.org/r/309041 (owner: 10Muehlenhoff) [08:41:15] jynus: that would help people to make a better proposal of interface [08:41:21] Dereckson, I do not have any constraint, I am actually not a user of dbtree [08:41:32] but I know a lot of devels use it [08:41:44] the main idea is that it is not a tree anymore [08:41:53] there are 2 datacenters [08:42:06] and master from eqiad replicates to master from codfw [08:42:10] and viceversa [08:42:21] so it is topologically not a tree anymore [08:42:31] so there is a cycle we should represent, and a graph is better suitable for that, understood [08:42:40] yes, the rest can stay as it is [08:42:54] the current library cannot even show that [08:42:57] (03CR) 10Volans: "The goal of this script is (was?) to quickly help the ongoing effort in T143536 to reimage a large number of hosts. If we want that we nee" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/308520 (https://phabricator.wikimedia.org/T143536) (owner: 10Volans) [08:44:03] Dereckson, this is a good example http://erlycoder.com/43/mysql-master-slave-and-master-master-replication-step-by-step-configuration-instructions- [08:44:15] * Dereckson looks [08:44:16] 1 was our old setup [08:44:24] #2 is our current setup [08:44:42] ok [08:44:47] (03PS3) 10Aaron Schulz: Update constants file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309256 [08:45:02] 06Operations, 06Performance-Team, 10Thumbor: thumbor handling of originals 404 - https://phabricator.wikimedia.org/T144956#2618498 (10Gilles) Actually now that I look at the production logs again, I'm still unsure about whether swiftclient is just outputting noise or if the exception is truly uncaught. I gue... [08:46:44] (03PS1) 10Hashar: zuul::merger: allow contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/309261 (https://phabricator.wikimedia.org/T137323) [08:49:30] (03CR) 10Hashar: [C: 031] "This is a noop for production." [puppet] - 10https://gerrit.wikimedia.org/r/309153 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [08:51:29] (03PS1) 10Gilles: Don't retry swift requests for thumbnails [puppet] - 10https://gerrit.wikimedia.org/r/309264 [08:51:44] (03CR) 10Hashar: [C: 031] "This is a noop for production" [puppet] - 10https://gerrit.wikimedia.org/r/309254 (owner: 10Hashar) [08:52:55] !log initial data mimport on wdqs codfw cluster - T144380 [08:52:57] T144380: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380 [08:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:56:40] (03CR) 10Hashar: [C: 031] "Ran it in the puppet compiler for the sole production host having zuul::merger (scandium.eqiad.wmnet) https://puppet-compiler.wmflabs.org/" [puppet] - 10https://gerrit.wikimedia.org/r/309261 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [09:01:25] RECOVERY - Apache HTTP on mw2214 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.074 second response time [09:06:13] PROBLEM - Apache HTTP on mw2202 is CRITICAL: Connection timed out [09:06:33] PROBLEM - configured eth on mw2202 is CRITICAL: Timeout while attempting connection [09:07:03] PROBLEM - dhclient process on mw2202 is CRITICAL: Timeout while attempting connection [09:07:04] (03CR) 10Filippo Giunchedi: [C: 032] Don't retry swift requests for thumbnails [puppet] - 10https://gerrit.wikimedia.org/r/309264 (owner: 10Gilles) [09:07:06] PROBLEM - mediawiki-installation DSH group on mw2202 is CRITICAL: Host mw2202 is not in mediawiki-installation dsh group [09:07:06] (03CR) 10DCausse: [C: 031] "eqiad seems to be happy with this setting" [puppet] - 10https://gerrit.wikimedia.org/r/308561 (https://phabricator.wikimedia.org/T143571) (owner: 10Gehel) [09:07:34] PROBLEM - nutcracker port on mw2202 is CRITICAL: Timeout while attempting connection [09:07:53] PROBLEM - nutcracker process on mw2202 is CRITICAL: Timeout while attempting connection [09:07:54] RECOVERY - Apache HTTP on mw2211 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.075 second response time [09:07:57] (03PS2) 10Gehel: elasticsearch - enable row aware shard allocation [puppet] - 10https://gerrit.wikimedia.org/r/308561 (https://phabricator.wikimedia.org/T143571) [09:08:07] PROBLEM - puppet last run on mw2202 is CRITICAL: Timeout while attempting connection [09:08:25] PROBLEM - salt-minion processes on mw2202 is CRITICAL: Timeout while attempting connection [09:08:32] (03PS1) 10Gilles: Upgrade to 0.1.14 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309266 [09:11:17] 06Operations, 10Traffic, 10Continuous-Integration-Infrastructure (phase-out-gallium): Move gallium to an internal host? - https://phabricator.wikimedia.org/T133150#2618555 (10hashar) 05stalled>03declined From T140257#2595926 and follow up response from ops, we are keeping the status quo of using a public... [09:12:07] (03CR) 10Gehel: [C: 032] elasticsearch - enable row aware shard allocation [puppet] - 10https://gerrit.wikimedia.org/r/308561 (https://phabricator.wikimedia.org/T143571) (owner: 10Gehel) [09:16:17] RECOVERY - MD RAID on mw2211 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [09:17:14] RECOVERY - configured eth on mw2211 is OK: OK - interfaces up [09:17:24] RECOVERY - dhclient process on mw2211 is OK: PROCS OK: 0 processes with command name dhclient [09:17:45] RECOVERY - salt-minion processes on mw2211 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:17:48] (03PS1) 10Filippo Giunchedi: thumbor: enable memory cgroup [puppet] - 10https://gerrit.wikimedia.org/r/309267 (https://phabricator.wikimedia.org/T144938) [09:18:14] RECOVERY - nutcracker port on mw2211 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [09:18:44] RECOVERY - nutcracker process on mw2211 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [09:22:22] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.14 (031 comment) [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309266 (owner: 10Gilles) [09:31:40] 06Operations, 10hardware-requests: eqiad: (4) worker servers for kubernetes - https://phabricator.wikimedia.org/T141624#2618604 (10mark) @RobH: I think it's a much better idea to lease these new systems (with TPMs) separately. Kubernetes nodes are an excellent candidate for leasing, unlike most one-off request... [09:34:34] RECOVERY - Apache HTTP on mw2213 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.074 second response time [09:37:10] (03CR) 10Filippo Giunchedi: "@Volans: the way we did it now (i.e. tags on the prometheus configuration) the old metric will disappear and a new metric will appear. The" [puppet] - 10https://gerrit.wikimedia.org/r/309241 (https://phabricator.wikimedia.org/T126757) (owner: 10Jcrespo) [09:39:47] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium): Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T145057#2618640 (10hashar) [09:39:57] (03CR) 10Gilles: [C: 031] thumbor: enable memory cgroup [puppet] - 10https://gerrit.wikimedia.org/r/309267 (https://phabricator.wikimedia.org/T144938) (owner: 10Filippo Giunchedi) [09:40:29] 10Blocked-on-Operations, 06Operations, 10Continuous-Integration-Infrastructure, 07Zuul: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T140894#2479965 (10hashar) I have bumped the version to `zuul_2.5.0-8-gcbc7f62-wmf2jessie1` and the new package `.changes` f... [09:40:49] (03CR) 10Muehlenhoff: [C: 032] openldap: enable the memberof overlay [puppet] - 10https://gerrit.wikimedia.org/r/295357 (https://phabricator.wikimedia.org/T142817) (owner: 10Faidon Liambotis) [09:40:54] (03PS4) 10Muehlenhoff: openldap: enable the memberof overlay [puppet] - 10https://gerrit.wikimedia.org/r/295357 (https://phabricator.wikimedia.org/T142817) (owner: 10Faidon Liambotis) [09:42:14] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: thumbor/exiftool deadlock, likely full pipe - https://phabricator.wikimedia.org/T144928#2618659 (10Gilles) I don't think it's fixable, I'll likely get rid of the related code. We really have no control over how slow or fast exiftool consumes... [09:42:21] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: enable memory cgroup [puppet] - 10https://gerrit.wikimedia.org/r/309267 (https://phabricator.wikimedia.org/T144938) (owner: 10Filippo Giunchedi) [09:42:25] (03PS2) 10Filippo Giunchedi: thumbor: enable memory cgroup [puppet] - 10https://gerrit.wikimedia.org/r/309267 (https://phabricator.wikimedia.org/T144938) [09:42:29] (03CR) 10Filippo Giunchedi: [V: 032] thumbor: enable memory cgroup [puppet] - 10https://gerrit.wikimedia.org/r/309267 (https://phabricator.wikimedia.org/T144938) (owner: 10Filippo Giunchedi) [09:43:25] RECOVERY - configured eth on mw2214 is OK: OK - interfaces up [09:44:39] RECOVERY - dhclient process on mw2214 is OK: PROCS OK: 0 processes with command name dhclient [09:44:46] (03PS2) 10محمد شعیب: Enable Education Program extension at urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309062 (https://phabricator.wikimedia.org/T144927) [09:44:55] (03CR) 10jenkins-bot: [V: 04-1] Enable Education Program extension at urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309062 (https://phabricator.wikimedia.org/T144927) (owner: 10محمد شعیب) [09:46:10] (03PS1) 10Elukey: Fix the order of the ProxyPass directives for yarn.w.o [puppet] - 10https://gerrit.wikimedia.org/r/309269 (https://phabricator.wikimedia.org/T116192) [09:46:18] !log roll-reboot thumbor machines to apply memory cgroup enablement T144938 [09:46:19] T144938: thumbor error spawning ghostscript 'libcgroup initialization failed: Cgroup is not mounted' - https://phabricator.wikimedia.org/T144938 [09:46:21] gilles: ^ [09:46:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:46:25] RECOVERY - configured eth on mw2213 is OK: OK - interfaces up [09:46:44] RECOVERY - dhclient process on mw2213 is OK: PROCS OK: 0 processes with command name dhclient [09:47:13] (03CR) 10Elukey: [C: 032] Fix the order of the ProxyPass directives for yarn.w.o [puppet] - 10https://gerrit.wikimedia.org/r/309269 (https://phabricator.wikimedia.org/T116192) (owner: 10Elukey) [09:51:13] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Move data storage to /srv/wdqs/ on codfw WDQS nodes - https://phabricator.wikimedia.org/T144536#2618669 (10Gehel) All known work is done and merged. Data import is in progress on codfw wdqs cluster which will validate that all t... [09:51:35] (03PS3) 10محمد شعیب: Enable Education Program extension at urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309062 (https://phabricator.wikimedia.org/T144927) [09:52:05] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.033 second response time [09:52:13] (03PS5) 10Muehlenhoff: openldap: enable the memberof overlay [puppet] - 10https://gerrit.wikimedia.org/r/295357 (https://phabricator.wikimedia.org/T142817) (owner: 10Faidon Liambotis) [09:55:50] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Upgrade elasticsearch to 2.4.0 - https://phabricator.wikimedia.org/T145058#2618677 (10Gehel) [09:56:07] (03CR) 10Mobrovac: [C: 031] Change-Prop: Concurrency bump for transclusions [puppet] - 10https://gerrit.wikimedia.org/r/309077 (owner: 10Ppchelko) [09:58:56] (03PS1) 10Giuseppe Lavagetto: admin: stop passing around a huge data structure [puppet] - 10https://gerrit.wikimedia.org/r/309270 [09:59:11] <_joe_> akosiaris, godog ^^ [09:59:21] <_joe_> this should reduce catalog size dramatically IMHO [09:59:29] <_joe_> let's test it [10:00:44] (03PS2) 10Giuseppe Lavagetto: Change-Prop: Concurrency bump for transclusions [puppet] - 10https://gerrit.wikimedia.org/r/309077 (owner: 10Ppchelko) [10:02:33] _joe_: nice! I've added myself to the review but likely will have time to look later [10:03:08] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Make elasticsearch actually uses shard allocation awareness - https://phabricator.wikimedia.org/T143571#2618711 (10Gehel) Row awareness allocation is active on both eqiad and codfw cluster. File based confi... [10:05:39] <_joe_> godog: the catalog size goes, for a random appserver, down from 20 Mb to 1.5 Mb [10:06:05] dammit we should have made that a goal [10:06:14] (03CR) 10Giuseppe Lavagetto: [C: 032] Change-Prop: Concurrency bump for transclusions [puppet] - 10https://gerrit.wikimedia.org/r/309077 (owner: 10Ppchelko) [10:06:32] <_joe_> godog: 27 mb -> 1.5, actually [10:06:36] <_joe_> LOL [10:08:12] (03PS4) 10محمد شعیب: Enable Education Program extension at urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309062 (https://phabricator.wikimedia.org/T144927) [10:08:23] key achievement in TechOps for Q1: reduced catalogue size by 95% [10:08:58] it can go on that achievements slide ;) [10:09:05] (03CR) 10Giuseppe Lavagetto: "Sample compilation on one appserver: the catalog size goes down from 28 mb to 1.5 mb. see https://puppet-compiler.wmflabs.org/4020/" [puppet] - 10https://gerrit.wikimedia.org/r/309270 (owner: 10Giuseppe Lavagetto) [10:09:45] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [10:09:54] https://puppet-compiler.wmflabs.org/4020/mw2231.codfw.wmnet/ -> the catalog difference is awesome [10:10:14] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [10:12:13] (03PS1) 10Hashar: ci::master: drop legacy definitions [puppet] - 10https://gerrit.wikimedia.org/r/309274 [10:12:13] (03PS1) 10Hashar: ci::master: drop mwext-sync leftover [puppet] - 10https://gerrit.wikimedia.org/r/309275 (https://phabricator.wikimedia.org/T51846) [10:13:39] (03CR) 10Paladox: [C: 031] ci::master: drop legacy definitions [puppet] - 10https://gerrit.wikimedia.org/r/309274 (owner: 10Hashar) [10:14:07] !log change-prop deploying a991e25 [10:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:18:01] RECOVERY - Apache HTTP on mw2212 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.074 second response time [10:25:03] (03CR) 10Faidon Liambotis: [C: 031] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/309270 (owner: 10Giuseppe Lavagetto) [10:26:39] (03CR) 10Alexandros Kosiaris: [C: 031] "fine on my side. Probably some more PCCs are in order but looks good" [puppet] - 10https://gerrit.wikimedia.org/r/309270 (owner: 10Giuseppe Lavagetto) [10:34:48] <_joe_> the real issue is that pcc is not really usable for this [10:34:53] <_joe_> given the catalog size [10:36:46] RECOVERY - nutcracker process on mw2212 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [10:41:24] RECOVERY - Apache HTTP on mw2202 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time [10:42:06] RECOVERY - dhclient process on mw2202 is OK: PROCS OK: 0 processes with command name dhclient [10:42:35] RECOVERY - nutcracker port on mw2202 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [10:42:55] RECOVERY - nutcracker process on mw2202 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [10:42:58] (03CR) 10Muehlenhoff: "You can remove the outer brackets, your puppet manifests are not yet written in Lisp :-)" [puppet] - 10https://gerrit.wikimedia.org/r/309254 (owner: 10Hashar) [10:43:37] RECOVERY - salt-minion processes on mw2202 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:44:16] RECOVERY - configured eth on mw2202 is OK: OK - interfaces up [10:44:42] (03CR) 10Muehlenhoff: [C: 031] "Looks good, let me know when I should merge this." [puppet] - 10https://gerrit.wikimedia.org/r/309261 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [10:46:11] (03CR) 10Muehlenhoff: "Likewise wrt outer brackets." [puppet] - 10https://gerrit.wikimedia.org/r/309153 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [10:46:34] RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:47:08] (03PS1) 10Filippo Giunchedi: thumbor: use native systemd memory limiting [puppet] - 10https://gerrit.wikimedia.org/r/309279 (https://phabricator.wikimedia.org/T144938) [10:47:29] (03PS7) 10Volans: Automation: automatically reimage host [puppet] - 10https://gerrit.wikimedia.org/r/308520 (https://phabricator.wikimedia.org/T143536) [10:48:00] _joe_: ^^^ [10:51:56] (03PS2) 10Hashar: contint: port labs fw rule to ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/309254 [10:51:58] (03PS3) 10Hashar: contint: allow ssh from contint1001 to labs instance [puppet] - 10https://gerrit.wikimedia.org/r/309153 (https://phabricator.wikimedia.org/T137323) [10:52:30] (03CR) 10Hashar: "I have removed the extra parenthesis. Did same in the child change :)" [puppet] - 10https://gerrit.wikimedia.org/r/309254 (owner: 10Hashar) [10:53:53] (03CR) 10Paladox: [C: 031] contint: allow ssh from contint1001 to labs instance [puppet] - 10https://gerrit.wikimedia.org/r/309153 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [10:59:53] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: use native systemd memory limiting [puppet] - 10https://gerrit.wikimedia.org/r/309279 (https://phabricator.wikimedia.org/T144938) (owner: 10Filippo Giunchedi) [10:59:57] (03PS2) 10Filippo Giunchedi: thumbor: use native systemd memory limiting [puppet] - 10https://gerrit.wikimedia.org/r/309279 (https://phabricator.wikimedia.org/T144938) [11:04:33] (03PS1) 10Ema: Revert "cache_upload: route around codfw in cache::route_table" [puppet] - 10https://gerrit.wikimedia.org/r/309280 [11:06:18] (03PS3) 10Muehlenhoff: contint: port labs fw rule to ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/309254 (owner: 10Hashar) [11:06:35] PROBLEM - mediawiki-installation DSH group on mw2208 is CRITICAL: Host mw2208 is not in mediawiki-installation dsh group [11:06:35] PROBLEM - mediawiki-installation DSH group on mw2210 is CRITICAL: Host mw2210 is not in mediawiki-installation dsh group [11:06:35] PROBLEM - mediawiki-installation DSH group on mw2209 is CRITICAL: Host mw2209 is not in mediawiki-installation dsh group [11:08:00] this is me --^ [11:10:32] (03CR) 10BBlack: [C: 031] Revert "cache_upload: route around codfw in cache::route_table" [puppet] - 10https://gerrit.wikimedia.org/r/309280 (owner: 10Ema) [11:13:03] (03PS2) 10Ema: Revert "cache_upload: route around codfw in cache::route_table" [puppet] - 10https://gerrit.wikimedia.org/r/309280 [11:13:26] (03CR) 10Muehlenhoff: [C: 032] contint: port labs fw rule to ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/309254 (owner: 10Hashar) [11:13:34] (03CR) 10Ema: [C: 032 V: 032] Revert "cache_upload: route around codfw in cache::route_table" [puppet] - 10https://gerrit.wikimedia.org/r/309280 (owner: 10Ema) [11:13:58] (03PS3) 10Ema: Revert "cache_upload: route around codfw in cache::route_table" [puppet] - 10https://gerrit.wikimedia.org/r/309280 [11:14:00] (03PS4) 10Muehlenhoff: contint: allow ssh from contint1001 to labs instance [puppet] - 10https://gerrit.wikimedia.org/r/309153 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [11:14:02] (03CR) 10Ema: [V: 032] Revert "cache_upload: route around codfw in cache::route_table" [puppet] - 10https://gerrit.wikimedia.org/r/309280 (owner: 10Ema) [11:16:37] (03CR) 10Muehlenhoff: [C: 032] contint: allow ssh from contint1001 to labs instance [puppet] - 10https://gerrit.wikimedia.org/r/309153 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [11:16:42] (03PS5) 10Muehlenhoff: contint: allow ssh from contint1001 to labs instance [puppet] - 10https://gerrit.wikimedia.org/r/309153 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [11:19:26] PROBLEM - mediawiki-installation DSH group on mw2162 is CRITICAL: Host mw2162 is not in mediawiki-installation dsh group [11:20:13] (03PS2) 10Muehlenhoff: zuul::merger: allow contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/309261 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [11:20:52] 06Operations, 06Performance-Team, 10Thumbor: invalid literal for int() with base 10: - https://phabricator.wikimedia.org/T145061#2618786 (10Gilles) [11:21:30] (03CR) 10Muehlenhoff: [C: 032] zuul::merger: allow contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/309261 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [11:22:08] Hey there, I have an issue regarding server access through ssh [11:24:36] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:25:04] akrause: to which server? [11:25:23] can you pm me? [11:25:58] sure [11:27:51] 06Operations, 06Performance-Team, 10Thumbor: invalid literal for int() with base 10: - https://phabricator.wikimedia.org/T145061#2618805 (10Gilles) I bet this is actually due to the cgroup thing, which makes the exiftool call fail silently. As far as I can see in the exiftool runner, we don't do anything wit... [11:32:45] 06Operations, 06Performance-Team, 10Thumbor: invalid literal for int() with base 10: - https://phabricator.wikimedia.org/T145061#2618811 (10Gilles) [11:35:03] !log reimaging mw2161, mw2162, mw2081, mw2082 to jessie [11:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:39:37] (03PS2) 10Faidon Liambotis: Add role::mirrors to sodium [puppet] - 10https://gerrit.wikimedia.org/r/308981 [11:39:44] (03CR) 10Faidon Liambotis: [C: 032] Add role::mirrors to sodium [puppet] - 10https://gerrit.wikimedia.org/r/308981 (owner: 10Faidon Liambotis) [11:39:57] (03CR) 10Faidon Liambotis: [V: 032] Add role::mirrors to sodium [puppet] - 10https://gerrit.wikimedia.org/r/308981 (owner: 10Faidon Liambotis) [11:43:34] (03PS1) 10Elukey: Avoid cron duplicates that might lead to inconsistencies with Camus [puppet] - 10https://gerrit.wikimedia.org/r/309286 [11:44:26] (03PS1) 10Mobrovac: RESTBaseUpdateJobs: Un-deploy the extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309287 (https://phabricator.wikimedia.org/T144843) [12:10:53] (03PS1) 10Faidon Liambotis: Point mirrors from carbon to sodium [dns] - 10https://gerrit.wikimedia.org/r/309293 [12:10:59] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Decrease time required to fully restart the Cirrus elasticsearch clusters - https://phabricator.wikimedia.org/T145065#2618884 (10Gehel) [12:11:42] RECOVERY - mediawiki-installation DSH group on mw2202 is OK: OK [12:37:11] PROBLEM - mediawiki-installation DSH group on mw2162 is CRITICAL: Host mw2162 is not in mediawiki-installation dsh group [12:37:41] PROBLEM - mediawiki-installation DSH group on mw2161 is CRITICAL: Host mw2161 is not in mediawiki-installation dsh group [12:38:29] 06Operations, 10DBA, 05Prometheus-metrics-monitoring: Create a script to regenerate prometheus mysqld exporter listing that works with puppetdb - https://phabricator.wikimedia.org/T145072#2619017 (10jcrespo) [12:38:31] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[mw-cgroup],Package[jobrunner] [12:39:41] (03PS1) 10Mobrovac: JobRunners: Remove the RESTBase runners [puppet] - 10https://gerrit.wikimedia.org/r/309298 (https://phabricator.wikimedia.org/T144843) [12:39:52] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 21 seconds ago with 2 failures. Failed resources (up to 3 shown): Service[mw-cgroup],Package[jobrunner] [12:40:19] Dereckson: aude & hashar again I am okay with doing the eu swat today (as yet again one of the patches is mine) :) [12:40:20] PROBLEM - mediawiki-installation DSH group on mw2082 is CRITICAL: Host mw2082 is not in mediawiki-installation dsh group [12:40:39] addshore: sounds good [12:40:40] PROBLEM - mediawiki-installation DSH group on mw2081 is CRITICAL: Host mw2081 is not in mediawiki-installation dsh group [12:40:47] mobrovac: don't you have a patch to swat deploy ? [12:40:51] PROBLEM - puppet last run on mw2162 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Service[mw-cgroup],Package[jobrunner] [12:40:53] ok [12:41:08] yup hashar [12:41:15] hashar: yup, I'v seen that one too :) [12:41:18] it's listed on the calendar [12:41:52] PROBLEM - puppet last run on mw2082 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[mw-cgroup],Package[jobrunner] [12:42:05] (03PS2) 10Mobrovac: JobRunners: Remove the RESTBase runners [puppet] - 10https://gerrit.wikimedia.org/r/309298 (https://phabricator.wikimedia.org/T144843) [12:42:18] hashar: the slot starts in 20 mins, right? [12:42:58] (03PS1) 10Hashar: zuul: migrate server only settings out of merger [puppet] - 10https://gerrit.wikimedia.org/r/309299 [12:43:00] mobrovac: yeah [12:43:06] k [12:44:09] !log uploaded new linux package for jessie (based on 4.4.19 with bumped kernel ABI=2) [12:44:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:45:40] RECOVERY - mediawiki-installation DSH group on mw2211 is OK: OK [12:48:19] (03CR) 10Hashar: [C: 031] "https://puppet-compiler.wmflabs.org/4026/" [puppet] - 10https://gerrit.wikimedia.org/r/309299 (owner: 10Hashar) [12:52:04] !log redeploying wdqs on wdqs2002.codfw.wmnet - T144380 [12:52:07] T144380: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380 [12:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:52:16] (03CR) 10Giuseppe Lavagetto: "I did another round of PCC on some of the most critical systems, and everything seems ok. The change is straightforward and yields a 90%+ " [puppet] - 10https://gerrit.wikimedia.org/r/309270 (owner: 10Giuseppe Lavagetto) [12:54:02] RECOVERY - puppet last run on mw2162 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:54:21] RECOVERY - puppet last run on mw2161 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:55:12] gehel: what is the timeline for the new wdqs hosts being behind the load balancer and ready for queries? [12:55:41] probably early next week, but we won't actually send traffic to those servers [12:55:51] cool, okay! [12:56:20] addshore: those servers are there in case our primary datacenter (or the wdqs cluster in our primary DC) fails [12:56:34] gehel: but they will be kep up to date right the data I guess? [12:57:04] (03CR) 10Hashar: "Added to Sep. 8th Puppet SWAT" [puppet] - 10https://gerrit.wikimedia.org/r/308805 (https://phabricator.wikimedia.org/T137525) (owner: 10Hashar) [12:57:06] (03CR) 10Hashar: "Added to Sep. 8th Puppet SWAT" [puppet] - 10https://gerrit.wikimedia.org/r/308778 (https://phabricator.wikimedia.org/T139527) (owner: 10Hashar) [12:57:09] (03CR) 10Hashar: "Added to Sep. 8th Puppet SWAT" [puppet] - 10https://gerrit.wikimedia.org/r/309299 (owner: 10Hashar) [12:57:12] (03CR) 10Hashar: "Added to Sep. 8th Puppet SWAT" [puppet] - 10https://gerrit.wikimedia.org/r/309154 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [12:57:13] addshore: yep, that's the idea... we want them to be fully operational in case we need to switch them to active [12:57:15] (03CR) 10Hashar: "Added to Sep. 8th Puppet SWAT" [puppet] - 10https://gerrit.wikimedia.org/r/309274 (owner: 10Hashar) [12:57:18] gehel: I'll add them to this script then then at some point next week! :) https://github.com/wikimedia/analytics-wmde-scripts/blob/master/src/wikidata/sparql/minutely.php#L17 [12:57:18] (03CR) 10Hashar: "Added to Sep. 8th Puppet SWAT" [puppet] - 10https://gerrit.wikimedia.org/r/309275 (https://phabricator.wikimedia.org/T51846) (owner: 10Hashar) [12:58:19] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/309270 (owner: 10Giuseppe Lavagetto) [12:58:24] addshore: cool! [12:58:39] _joe_: I'd keep the root shell on the puppet master, just in case [12:58:51] <_joe_> godog: of course I will [12:59:07] addshore: That's actually something that would probably benefit from being migrated to diamond. This way we could ensure that any new server we add is monitored in the same way. [12:59:25] <_joe_> godog: I wanted to wait for chase to take a peek [12:59:27] gehel: yes! :) [12:59:53] mobrovac: I will start with your change for swat! [13:00:04] hashar, Dereckson, addshore, and aude: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160908T1300). Please do the needful. [13:00:04] Addshore and mobrovac: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:01:01] addshore: if you need some help to port it over, let me know [13:01:20] (03CR) 10Addshore: [C: 032] RESTBaseUpdateJobs: Un-deploy the extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309287 (https://phabricator.wikimedia.org/T144843) (owner: 10Mobrovac) [13:01:47] (03Merged) 10jenkins-bot: RESTBaseUpdateJobs: Un-deploy the extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309287 (https://phabricator.wikimedia.org/T144843) (owner: 10Mobrovac) [13:01:51] gehel: if there a file / place to point me at to give me some idea of where things should end up? (as right now I don't know much about diamon)! [13:02:16] addshore: let me check... [13:02:28] addshore: syncing? [13:02:40] mobrovac: your change is on 1099 please check :) [13:02:50] (03PS1) 10Elukey: Add the pivot.wikimedia.org VHost to stat1001 [puppet] - 10https://gerrit.wikimedia.org/r/309301 (https://phabricator.wikimedia.org/T138262) [13:02:52] (03PS1) 10Gilles: Upgrade to 0.1.15 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309302 [13:03:31] PROBLEM - Apache HTTP on mw2203 is CRITICAL: Connection refused [13:04:08] Sorry for my lateness! I'm here for todays SWAT. [13:04:44] (03PS2) 10Elukey: Add the pivot.wikimedia.org VHost to stat1001 [puppet] - 10https://gerrit.wikimedia.org/r/309301 (https://phabricator.wikimedia.org/T138262) [13:04:46] mobrovac: it looks good to me, shall I continue? [13:05:26] yup addshore, go ahead [13:05:42] (03PS3) 10Elukey: Add the pivot.wikimedia.org VHost to stat1001 [puppet] - 10https://gerrit.wikimedia.org/r/309301 (https://phabricator.wikimedia.org/T138262) [13:06:16] addshore: Please notice I've added one my patch to the SWAT before a moment. I had it in the morning one but I meant europan one. Sorry for it again. [13:06:24] Urbanecm: no worries [13:06:28] mobrovac: syncing now [13:06:30] PROBLEM - Check size of conntrack table on mw2203 is CRITICAL: Connection refused by host [13:06:31] !log addshore@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:309287|RESTBaseUpdateJobs: Un-deploy the extension]] 1/3 (duration: 00m 49s) [13:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:06:37] Thanks addshore [13:07:03] PROBLEM - DPKG on mw2203 is CRITICAL: Connection refused by host [13:07:16] (03PS2) 10Gilles: Upgrade to 0.1.15 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309302 [13:07:21] PROBLEM - Disk space on mw2203 is CRITICAL: Connection refused by host [13:07:24] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:309287|RESTBaseUpdateJobs: Un-deploy the extension]] 2/3 (duration: 00m 46s) [13:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:08:17] !log addshore@tin Synchronized wmf-config/extension-list: SWAT: [[gerrit:309287|RESTBaseUpdateJobs: Un-deploy the extension]] 3/3 (duration: 00m 47s) [13:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:08:28] mobrovac: all done [13:08:36] thnx addshore [13:08:59] (03CR) 10Giuseppe Lavagetto: "I think this goes in the right direction; however, we really need to be able to change a config from puppet and see it have an effect wiho" [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) (owner: 10Mobrovac) [13:09:10] all good, the extension is gone from Special:Version [13:09:14] (03PS3) 10Addshore: Add massmessage-sender group to urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309047 (https://phabricator.wikimedia.org/T144701) (owner: 10محمد شعیب) [13:09:27] (03CR) 10Addshore: [C: 032] Add massmessage-sender group to urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309047 (https://phabricator.wikimedia.org/T144701) (owner: 10محمد شعیب) [13:09:34] Urbanecm: doing yours now [13:09:40] Okay, thanks. [13:09:47] RECOVERY - mediawiki-installation DSH group on mw2209 is OK: OK [13:09:47] RECOVERY - mediawiki-installation DSH group on mw2208 is OK: OK [13:09:47] RECOVERY - mediawiki-installation DSH group on mw2210 is OK: OK [13:09:55] (03Merged) 10jenkins-bot: Add massmessage-sender group to urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309047 (https://phabricator.wikimedia.org/T144701) (owner: 10محمد شعیب) [13:10:01] will this be something that you will be able to check when it is on mw1099 or not? [13:10:24] (03CR) 10Faidon Liambotis: [C: 032] Point mirrors from carbon to sodium [dns] - 10https://gerrit.wikimedia.org/r/309293 (owner: 10Faidon Liambotis) [13:10:45] I can see if the group will appear in Special:UserGroupRights. But I have no rights to send a test message addshore [13:10:58] Urbanecm: okay, the change is on mw1099 now! [13:11:04] Checking... [13:12:02] The group has appeared. You can deploy it everywhere I think. [13:12:08] syncing [13:12:18] (03PS7) 10Addshore: Enable mention status notifications everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304608 (https://phabricator.wikimedia.org/T143101) [13:12:52] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:309047|Add massmessage-sender group to urwiki]] (duration: 00m 47s) [13:12:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:03] (03CR) 10Addshore: [C: 032] Enable mention status notifications everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304608 (https://phabricator.wikimedia.org/T143101) (owner: 10Addshore) [13:13:11] Urbanecm: it should be everywhere now! [13:13:21] thx addshore ! [13:13:31] (03PS1) 10Faidon Liambotis: Remove role::mirror from carbon [puppet] - 10https://gerrit.wikimedia.org/r/309306 [13:13:34] (03Merged) 10jenkins-bot: Enable mention status notifications everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/304608 (https://phabricator.wikimedia.org/T143101) (owner: 10Addshore) [13:15:16] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:304608|Enable mention status notifications everywhere]] (duration: 00m 47s) [13:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:15:24] EU swat all done! [13:15:31] :) [13:15:39] (03PS4) 10Elukey: Add the pivot.wikimedia.org VHost to stat1001 [puppet] - 10https://gerrit.wikimedia.org/r/309301 (https://phabricator.wikimedia.org/T138262) [13:15:45] hopefully that "mention" everywhere is not going to be an issue [13:16:05] hashar: it shouldnt be, it is disabled by default by a user preference :) [13:16:17] sounds ane [13:16:19] sane [13:19:59] addshore: possible to include one more patch in SWAT? [13:20:23] (03CR) 10Elukey: [C: 032] Add the pivot.wikimedia.org VHost to stat1001 [puppet] - 10https://gerrit.wikimedia.org/r/309301 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [13:20:38] (03PS1) 10Gehel: DNS configuration for maps service in eqiad [dns] - 10https://gerrit.wikimedia.org/r/309309 (https://phabricator.wikimedia.org/T142393) [13:21:15] (03PS3) 10Gehel: LVS configuration for maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/303559 (https://phabricator.wikimedia.org/T142393) [13:21:20] kart_: sure! [13:21:53] addshore: adding to calendar. [13:23:01] (03CR) 10Rush: [C: 031] "I guess this pulls $data from the `$data = add_all_users($base_data)` from init.pp class admin. seems pragmatic." [puppet] - 10https://gerrit.wikimedia.org/r/309270 (owner: 10Giuseppe Lavagetto) [13:23:31] (03PS3) 10Gilles: Upgrade to 0.1.15 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309302 [13:23:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Upgrade to 0.1.15 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309302 (owner: 10Gilles) [13:28:00] addshore: added. [13:28:10] !log upgrading cache_upload ulsfo to varnish 4, dns depooled T131502 [13:28:11] T131502: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502 [13:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:29:12] oooh, hashar this one is a bit bigger than the patches I have done before ;) [13:29:54] hashar: https://gerrit.wikimedia.org/r/#/c/309308/ would need a full scap? [13:30:19] eeeek [13:30:20] or not as they are only js i18n messages in some library? [13:31:46] seems there is at least one message being added https://gerrit.wikimedia.org/r/#/c/309308/1/lib/jquery.uls/i18n/en.json [13:32:05] so I guess it needs the l18n files to be rebuild [13:32:23] hashar: yes. New message, so need full scap. I can't recall though. [13:32:39] okay! [13:33:14] (03PS1) 10Ema: Upgrade upload ulsfo to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/309310 (https://phabricator.wikimedia.org/T131502) [13:33:15] not sure why we get the l10n update nor how it is going to play with the l10n updater [13:33:21] (03PS3) 10Rush: labstore: statistics and scratch mount definitions [puppet] - 10https://gerrit.wikimedia.org/r/309063 [13:33:41] also [13:33:46] the patch in master got reverted [13:33:47] (03CR) 10BBlack: [C: 031] Upgrade upload ulsfo to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/309310 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [13:33:50] https://gerrit.wikimedia.org/r/#/c/307504/ [13:33:54] hashar: this l10 updates are coming from jquery.uls lib, so need to add them in uls. [13:34:06] hashar: that doesn't included fix we've in this. [13:34:14] ahhhh [13:34:22] see commit msg. [13:34:24] so it is a follow up from yesterday deploy isn't it ? [13:34:33] ok ok :) [13:34:34] hashar: yes. [13:34:46] too many commits / changes referenced I eventually got lost hehe [13:34:49] waiting on jenkins gate-submit now [13:34:54] (03CR) 10Rush: [C: 032] labstore: statistics and scratch mount definitions [puppet] - 10https://gerrit.wikimedia.org/r/309063 (owner: 10Rush) [13:35:09] 06Operations, 10Traffic, 13Patch-For-Review: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2619143 (10ema) codfw is running fine with v4 routed straight to the applayer. We're going to upgrade ulsfo back to v4 routed to codfw to test v4<->v4 behavior. [13:35:16] kart_: hi [13:35:18] addshore: thanks :) This needs some careful testing, so please give some time. [13:35:19] what's the issue? [13:35:41] I am not sure how you can get it on mw1099 [13:35:42] aharoni: called you for testing :) [13:35:47] might have to refresh the l10n files on mw1099 [13:36:02] hashar: I see. [13:36:11] (03CR) 10Ema: [C: 032] Upgrade upload ulsfo to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/309310 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [13:36:17] (03PS2) 10Ema: Upgrade upload ulsfo to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/309310 (https://phabricator.wikimedia.org/T131502) [13:36:19] hashar: I cant find that in the deploy docs ;) [13:36:20] (03CR) 10Ema: [V: 032] Upgrade upload ulsfo to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/309310 (https://phabricator.wikimedia.org/T131502) (owner: 10Ema) [13:36:28] on mw1099: scap pull ; scap l10n-update [13:36:31] kart_: do you have a URL where you had an error? [13:36:33] should do it [13:36:36] addshore: nice. [13:36:47] aharoni: not deployed yet. Wait :) [13:36:48] hashar: ack! [13:37:08] pass it --verbose [13:37:17] kart_: deployed on staging / canary / beta / whatever it's called? [13:37:17] the l10n-update ? :) [13:37:28] aharoni: basically, VE breaks when enter or selection is done. See console for that. [13:37:30] yeah it will spurts a bunch of progress messages [13:37:34] hashar: ack! :) [13:37:40] aharoni: addshore is deploying in canary. [13:37:41] PROBLEM - mediawiki-installation DSH group on mw2066 is CRITICAL: Host mw2066 is not in mediawiki-installation dsh group [13:37:44] (03PS2) 10Giuseppe Lavagetto: admin: stop passing around a huge data structure [puppet] - 10https://gerrit.wikimedia.org/r/309270 [13:37:53] kart_: OK, but where exactly? I tested with a real article and didn't experience any issues. [13:38:07] Not with a table, though [13:38:50] aharoni: wait. It is being deployed. addshore will let us know. [13:38:50] (03PS1) 10Rush: labstore: misc don't shadow remounts default [puppet] - 10https://gerrit.wikimedia.org/r/309311 (https://phabricator.wikimedia.org/T126083) [13:38:52] RECOVERY - mediawiki-installation DSH group on mw2162 is OK: OK [13:39:07] kart_: aharoni just waiting for jenkins to merge it on the branch now [13:39:13] (03PS2) 10Rush: labstore: misc don't shadow remounts default [puppet] - 10https://gerrit.wikimedia.org/r/309311 (https://phabricator.wikimedia.org/T126083) [13:39:18] addshore: ack. [13:39:21] RECOVERY - mediawiki-installation DSH group on mw2161 is OK: OK [13:39:21] kart_: OK, but I'm curious where did you see an error [13:39:33] aharoni: Before the fix? [13:40:00] (03CR) 10Giuseppe Lavagetto: [C: 032] admin: stop passing around a huge data structure [puppet] - 10https://gerrit.wikimedia.org/r/309270 (owner: 10Giuseppe Lavagetto) [13:40:09] kart_: before the fix, but where? I thought that the breakage was reverted already. [13:40:23] aharoni: yes. it is reverted. [13:40:44] aharoni: so, I'm asking to test once this patch is deployed in test server. [13:40:50] (03PS3) 10Rush: labstore: misc don't shadow remounts default [puppet] - 10https://gerrit.wikimedia.org/r/309311 (https://phabricator.wikimedia.org/T126083) [13:40:59] aharoni: if anything appears broken, we can drop deployment. [13:41:01] (03CR) 10Rush: [C: 032 V: 032] labstore: misc don't shadow remounts default [puppet] - 10https://gerrit.wikimedia.org/r/309311 (https://phabricator.wikimedia.org/T126083) (owner: 10Rush) [13:41:25] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 5 failures. Last run 10 seconds ago with 5 failures. Failed resources (up to 3 shown): Exec[generate varnish.pyconf],Service[varnishxcache],Service[varnishreqstats-frontend],Service[varnishkafka-webrequest] [13:41:52] RECOVERY - mediawiki-installation DSH group on mw2082 is OK: OK [13:42:02] kart_: [ sorry, misunderstanding ] [13:42:11] RECOVERY - mediawiki-installation DSH group on mw2081 is OK: OK [13:42:22] aharoni: no worries! [13:42:56] hashar: scap l10n-update hit an issue on 1099 [13:43:15] !log powering down mw2075-mw2079 for hardware maintenance (T142726) [13:43:16] T142726: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726 [13:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:43:29] (03PS1) 10Elukey: Add the pivot.wikimedia.org domain [dns] - 10https://gerrit.wikimedia.org/r/309312 (https://phabricator.wikimedia.org/T138262) [13:43:31] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:45] the change is on mw1099 currently but without the l10n / i18n update [13:44:06] _joe_: labstore1003? [13:44:08] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2619196 (10MoritzMuehlenhoff) We're closing in, fourth batch: mw2075-mw2079 [13:44:21] <_joe_> volans: uhm [13:44:25] <_joe_> let's see [13:44:41] PROBLEM - Varnishkafka log producer on cp4005 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [13:44:50] kart_: aharoni can you check on mw1099 without the update i18n ? [13:45:03] addshore: testing. [13:45:21] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp4005 is CRITICAL: Connection refused [13:45:29] <_joe_> ema: ^^ you right? [13:45:41] PROBLEM - Varnish HTTP upload-frontend - port 80 on cp4005 is CRITICAL: Connection refused [13:45:43] _joe_: yep that's me [13:46:17] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:46:21] ulsfo is still depool for users->upload [13:46:42] PROBLEM - puppet last run on cp4007 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 40 seconds ago with 2 failures. Failed resources (up to 3 shown): Service[varnish],Service[varnish-frontend] [13:46:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] DNS configuration for maps service in eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/309309 (https://phabricator.wikimedia.org/T142393) (owner: 10Gehel) [13:47:07] somehow puppet is slower on cp4* hosts today, hence the alerts [13:47:21] I'll set some downtime [13:47:51] RECOVERY - Varnish HTTP upload-frontend - port 3127 on cp4005 is OK: HTTP OK: HTTP/1.1 200 OK - 323 bytes in 0.157 second response time [13:48:11] RECOVERY - Varnish HTTP upload-frontend - port 80 on cp4005 is OK: HTTP OK: HTTP/1.1 200 OK - 324 bytes in 0.163 second response time [13:49:12] RECOVERY - puppet last run on cp4007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:49:22] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [13:49:51] RECOVERY - Varnishkafka log producer on cp4005 is OK: PROCS OK: 1 process with command name varnishkafka [13:50:01] 06Operations, 07Puppet, 10Continuous-Integration-Config, 13Patch-For-Review: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2619197 (10hashar) Puppet has a blacklist of files pattern to not pass to rdoc: ``` name=lib/puppet/util/rdoc.rb, lang=... [13:52:21] addshore: still testing. found some issues. [13:52:38] okay kart_ so I guess your not going to want to deploy this everywhere now? :) [13:53:30] (03PS1) 10Elukey: Add the pivot.w.o domain to the stat1001 misc Varnish director [puppet] - 10https://gerrit.wikimedia.org/r/309315 (https://phabricator.wikimedia.org/T138262) [13:53:48] addshore: please wait. aharoni is also testing. [13:54:07] kart_: I just found a good article for testing, give me just a minute [13:54:19] Strage. Beta is good. [13:55:00] kart_: [13:55:01] https://ca.wikipedia.org/wiki/Joan_Maragall_i_Gorina?veaction=edit [13:55:04] worksfor me [13:55:10] with the new "All languages" title [13:55:13] and VE doesn't break. [13:55:16] PASS [13:55:18] kart_: ^ [13:55:26] addshore: ^ [13:55:28] aharoni: what's wrong with mw.org? :/ [13:55:48] cawiki is good. [13:55:50] I wanted to test with a real article, and with seeing the real feature in action [13:56:12] so, good to go or not? :) [13:56:19] good from me! [13:56:51] kart_: ? :) [13:56:57] strage. No errors now. [13:57:04] addshore: go ahead. Cleared cache. [13:57:26] !log addshore@tin Started scap: SWAT: [[gerrit:309308|Update jquery.uls from upstream]] [13:58:11] https://www.mediawiki.org/wiki/Selenium?veaction=edit is good page too as it is only templates. [13:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:30] zeljkof might be happy to see we used it as 'test page' :) [13:59:23] kart_: you are welcome :) [13:59:31] it is "templates welcome" page [13:59:35] yeah, but we're talking about interlanguage links here, and it doesn't have any [13:59:46] I prefer to test while seeing the actual feature in question [14:00:12] 06Operations, 07Puppet, 10Continuous-Integration-Config, 13Patch-For-Review, 07Upstream: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2619213 (10hashar) [14:02:29] ahh [14:02:34] env variables are magic [14:02:34] aharoni: I was testing VE there. [14:03:45] kart_: deployed? [14:05:48] aharoni: scap is still running :) [14:06:41] 06Operations, 07Puppet, 10Continuous-Integration-Config, 13Patch-For-Review, 07Upstream: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2619216 (10hashar) I have manually hacked the Jenkins job via the web UI to add `RDOCOPT='--exclude=/modu... [14:06:50] addshore: kart_ - ok, waiting patiently [14:08:06] will take time [14:08:09] gotta flee sorry [14:12:37] (03PS1) 10Gilles: Upgrade to 0.1.16 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309317 [14:13:55] 06Operations, 10ChangeProp, 10MediaWiki-API, 06Services, and 2 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2619222 (10mobrovac) [14:14:33] (03PS1) 10Rush: labstore: misc apply /srv/dumps via puppet and mounted ensures [puppet] - 10https://gerrit.wikimedia.org/r/309318 [14:14:42] (03PS2) 10Rush: labstore: misc apply /srv/dumps via puppet and mounted ensures [puppet] - 10https://gerrit.wikimedia.org/r/309318 [14:19:00] (03PS2) 10Faidon Liambotis: Remove role::mirror from carbon [puppet] - 10https://gerrit.wikimedia.org/r/309306 [14:19:07] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Remove role::mirror from carbon [puppet] - 10https://gerrit.wikimedia.org/r/309306 (owner: 10Faidon Liambotis) [14:19:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Upgrade to 0.1.16 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309317 (owner: 10Gilles) [14:19:57] (03PS2) 10Filippo Giunchedi: Upgrade to 0.1.16 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/309317 (owner: 10Gilles) [14:23:55] kart_: sorry, disconnected for a couple of minutes [14:23:58] scap still running? [14:24:23] 06Operations, 10ChangeProp, 10MediaWiki-API, 06Services, and 2 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2619265 (10mobrovac) [One debug run](https://performance.wikimedia.org/xhgui/run/view?id=57d1697facaf1d4009962560) shows that most time is spent waiting on... [14:25:18] aharoni: yes. addshore will ping when done. [14:25:36] addshore: all good, so far? [14:25:38] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 3 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2619266 (10mobrovac) [14:33:07] (03PS3) 10Jgreen: Modify secret.rb to accept a file list and use first match, like http://www.puppetcookbook.com/posts/select-a-file-based-on-a-fact.html [puppet] - 10https://gerrit.wikimedia.org/r/294331 [14:34:14] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2619285 (10Anomie) The query is simple enough: ``` SELECT tl_from,tl_namespace AS `bl_namespace`,tl_title AS `bl_title`,page_id,page_title,page_namespace,page_i... [14:34:36] bblack: I'd like to deploy https://gerrit.wikimedia.org/r/#/c/309309/, the doc warns about being potentially outdated, could check me? [14:35:44] mobrovac: I'll be bouncing one cassandra instance on restbase-test2001 to test the CA extended expiration FYI [14:35:47] (03PS2) 10Gehel: DNS configuration for maps service in eqiad [dns] - 10https://gerrit.wikimedia.org/r/309309 (https://phabricator.wikimedia.org/T142393) [14:36:08] (03CR) 10Gehel: DNS configuration for maps service in eqiad (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/309309 (https://phabricator.wikimedia.org/T142393) (owner: 10Gehel) [14:36:13] gehel: what doc? [14:36:22] (03PS1) 10Faidon Liambotis: mirrors: remove ARCH_INCLUDE from Debian [puppet] - 10https://gerrit.wikimedia.org/r/309320 [14:36:24] https://wikitech.wikimedia.org/wiki/DNS#Changing_records_in_a_zonefile [14:36:27] bblack: ^ [14:36:39] !log bounce restbase-test2001 cassandra-a instance T143044 [14:36:40] T143044: Renew RESTBase self-signed root certificate authority - https://phabricator.wikimedia.org/T143044 [14:36:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:37:14] bblack: basically, merge, run "authdns-update", check if everything is ok. Looks reasonnable to me, but what do I know... [14:37:27] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mirrors: remove ARCH_INCLUDE from Debian [puppet] - 10https://gerrit.wikimedia.org/r/309320 (owner: 10Faidon Liambotis) [14:38:05] gehel: that part is right. but be sure to "sudo -i" to a real shell before authdns-update, I think it still has problems with direct invocation. [14:38:31] bblack: thanks! I'll do that and scream if I have a problem... [14:39:13] (03CR) 10BBlack: [C: 031] DNS configuration for maps service in eqiad [dns] - 10https://gerrit.wikimedia.org/r/309309 (https://phabricator.wikimedia.org/T142393) (owner: 10Gehel) [14:39:25] PROBLEM - puppet last run on ms-be1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/gen_fingerprints] [14:39:39] (03CR) 10Gehel: [C: 032] DNS configuration for maps service in eqiad [dns] - 10https://gerrit.wikimedia.org/r/309309 (https://phabricator.wikimedia.org/T142393) (owner: 10Gehel) [14:41:45] !log deploying new DNS entries for kartotherian.svc.eqiad.wmnet [14:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:43:06] bblack: thanks, it all seems to work [14:44:47] (03PS1) 10Alexandros Kosiaris: puppetmaster100[1,2]: Assign IPs [dns] - 10https://gerrit.wikimedia.org/r/309322 (https://phabricator.wikimedia.org/T143219) [14:44:59] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster100[1,2]: Assign IPs [dns] - 10https://gerrit.wikimedia.org/r/309322 (https://phabricator.wikimedia.org/T143219) (owner: 10Alexandros Kosiaris) [14:45:25] PROBLEM - cassandra-a service on restbase-test2001 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [14:48:03] !log addshore@tin Finished scap: SWAT: [[gerrit:309308|Update jquery.uls from upstream]] (duration: 50m 36s) [14:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:49:28] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2619347 (10Ottomata) [14:51:01] (03PS1) 10Joal: Update camus jar version [puppet] - 10https://gerrit.wikimedia.org/r/309323 (https://phabricator.wikimedia.org/T144716) [14:51:08] ottomata: --^ [14:51:25] RECOVERY - Disk space on mw2203 is OK: DISK OK [14:52:33] (03PS1) 10Muehlenhoff: Update to new kernel package / drop transition package for 3.19 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/309326 [14:53:10] RECOVERY - Apache HTTP on mw2203 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.080 second response time [14:53:24] RECOVERY - Check size of conntrack table on mw2203 is OK: OK: nf_conntrack is 0 % full [14:53:33] 06Operations: Multiple servers in codfw fail to respond to IPMI commands during reimaging - https://phabricator.wikimedia.org/T142726#2619371 (10Papaul) Complete. IPMI over LAN was already enable on all 5 hosts. [14:53:35] RECOVERY - DPKG on mw2203 is OK: All packages OK [14:58:34] (03Draft1) 10Paladox: Also add --exclude=/modules/[^/]*/bin/.*$ to the rdoc [puppet] - 10https://gerrit.wikimedia.org/r/309328 (https://phabricator.wikimedia.org/T143233) [14:58:44] RECOVERY - cassandra-a service on restbase-test2001 is OK: OK - cassandra-a is active [15:00:24] (03CR) 10Nuria: [C: 031] "Looks good. Do we have documented anywhere this deployment procedure?" [puppet] - 10https://gerrit.wikimedia.org/r/309323 (https://phabricator.wikimedia.org/T144716) (owner: 10Joal) [15:00:38] 06Operations, 07Puppet, 10Continuous-Integration-Config, 13Patch-For-Review, 07Upstream: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2619387 (10hashar) Looks like previously it took several minutes to build anyway:* | #25603 | 8 min 7 se... [15:02:01] 06Operations, 10Cassandra, 06Services, 10hardware-requests: 9x or 15x additional Cassandra/RESTBase nodes - https://phabricator.wikimedia.org/T139961#2619390 (10GWicke) Here are some graphs of read latency during the test run: p50: {F4443475} p95: {F4443479} p99: {F4443482} There is no noticeable diffe... [15:02:08] (03CR) 10Muehlenhoff: [C: 032] Update to new kernel package / drop transition package for 3.19 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/309326 (owner: 10Muehlenhoff) [15:03:09] kart_: yes, all done! [15:03:20] (03PS1) 10Alexandros Kosiaris: Introduce puppetmaster100[12] to the fleet [puppet] - 10https://gerrit.wikimedia.org/r/309330 (https://phabricator.wikimedia.org/T143219) [15:03:32] apparently my IRC hung a little bit there.... [15:05:18] (03Abandoned) 10Paladox: Also add --exclude=/modules/[^/]*/bin/.*$ to the rdoc [puppet] - 10https://gerrit.wikimedia.org/r/309328 (https://phabricator.wikimedia.org/T143233) (owner: 10Paladox) [15:06:16] 06Operations, 10hardware-requests: eqiad: (4) worker servers for kubernetes - https://phabricator.wikimedia.org/T141624#2619416 (10RobH) Sounds good, I've already requested dell quotes on the linked sub-task. Once I have those back, I'll request similar quotes from HP. [15:06:19] (03PS2) 10Alexandros Kosiaris: puppetmaster100[1,2]: Assign IPs [dns] - 10https://gerrit.wikimedia.org/r/309322 (https://phabricator.wikimedia.org/T143219) [15:06:26] RECOVERY - puppet last run on ms-be1011 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [15:06:42] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2619417 (10Volans) Just to add some more info, in this specific `templatelinks ` table there are 4.8M rows with those conditions: ``` MariaDB PRODUCTION s2 loc... [15:07:09] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce puppetmaster100[12] to the fleet [puppet] - 10https://gerrit.wikimedia.org/r/309330 (https://phabricator.wikimedia.org/T143219) (owner: 10Alexandros Kosiaris) [15:08:10] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster100[1,2]: Assign IPs [dns] - 10https://gerrit.wikimedia.org/r/309322 (https://phabricator.wikimedia.org/T143219) (owner: 10Alexandros Kosiaris) [15:10:35] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2619422 (10jcrespo) The problem with the straight join (or hints in general) is that you fix things now and break it elsewhere on other wikis, or for another se... [15:11:52] (03PS1) 10Chad: group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309331 [15:19:10] (03CR) 10Ottomata: [C: 031] Add the pivot.w.o domain to the stat1001 misc Varnish director [puppet] - 10https://gerrit.wikimedia.org/r/309315 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [15:22:27] 06Operations, 07Puppet, 10Continuous-Integration-Config, 13Patch-For-Review, 07Upstream: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2619451 (10hashar) 05Open>03Resolved a:03hashar I have updated the job in JJB and refreshed it http... [15:25:27] 06Operations, 07Puppet, 10Continuous-Integration-Config, 13Patch-For-Review, 07Upstream: post build failures for operations/puppet on operations-puppet-doc - https://phabricator.wikimedia.org/T143233#2619470 (10Paladox) 05Resolved>03Open Re opening per @hashar it seems that when you click on for exam... [15:28:34] (03PS1) 10Andrew Bogott: labspuppetbackend: only allow POST from Horizon's IP [puppet] - 10https://gerrit.wikimedia.org/r/309333 [15:29:44] (03CR) 10jenkins-bot: [V: 04-1] labspuppetbackend: only allow POST from Horizon's IP [puppet] - 10https://gerrit.wikimedia.org/r/309333 (owner: 10Andrew Bogott) [15:30:41] (03PS1) 10Faidon Liambotis: mirrors: set MIRRORNAME in Debian's ftpsync config [puppet] - 10https://gerrit.wikimedia.org/r/309334 [15:30:48] addshore: forgot one thing - Thanks! [15:31:39] (03PS2) 10Andrew Bogott: labspuppetbackend: only allow POST from Horizon's IP [puppet] - 10https://gerrit.wikimedia.org/r/309333 [15:34:29] (03Abandoned) 10Elukey: Avoid cron duplicates that might lead to inconsistencies with Camus [puppet] - 10https://gerrit.wikimedia.org/r/309286 (owner: 10Elukey) [15:44:35] (03CR) 10Giuseppe Lavagetto: [C: 04-1] contint: vary ssh from= for prod slave (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/309154 (https://phabricator.wikimedia.org/T137323) (owner: 10Hashar) [15:46:06] (03PS1) 10Alex Monk: admin: allow matrix.py to output a wikitext table [puppet] - 10https://gerrit.wikimedia.org/r/309337 [15:47:13] (03CR) 10jenkins-bot: [V: 04-1] admin: allow matrix.py to output a wikitext table [puppet] - 10https://gerrit.wikimedia.org/r/309337 (owner: 10Alex Monk) [15:47:58] (03PS2) 10Alex Monk: admin: allow matrix.py to output a wikitext table [puppet] - 10https://gerrit.wikimedia.org/r/309337 [15:52:51] (03CR) 10Paladox: "@Jcrespo would you be able to review please?" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/308785 (owner: 10Paladox) [15:53:04] (03PS1) 10Alexandros Kosiaris: puppetmaster100[12]: Set different partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/309338 [15:54:19] (03CR) 10Alexandros Kosiaris: [C: 032] puppetmaster100[12]: Set different partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/309338 (owner: 10Alexandros Kosiaris) [15:54:52] (03CR) 10Giuseppe Lavagetto: [C: 031] ci::master: drop legacy definitions [puppet] - 10https://gerrit.wikimedia.org/r/309274 (owner: 10Hashar) [15:55:48] bblack: I'm ready to deploy https://gerrit.wikimedia.org/r/#/c/303559/ (LVS for kartotherian/eqiad). I should be ok on my own, but would feel better if there is someone around for when I touch LVS... [15:56:25] (03CR) 10Giuseppe Lavagetto: [C: 031] ci::master: drop mwext-sync leftover [puppet] - 10https://gerrit.wikimedia.org/r/309275 (https://phabricator.wikimedia.org/T51846) (owner: 10Hashar) [15:56:44] (03CR) 10Paladox: [C: 031] ci::master: drop mwext-sync leftover [puppet] - 10https://gerrit.wikimedia.org/r/309275 (https://phabricator.wikimedia.org/T51846) (owner: 10Hashar) [15:56:50] (03CR) 10Giuseppe Lavagetto: [C: 031] zuul: stop logging paramiko [puppet] - 10https://gerrit.wikimedia.org/r/308805 (https://phabricator.wikimedia.org/T137525) (owner: 10Hashar) [15:57:09] (03CR) 10Paladox: [C: 031] zuul: stop logging paramiko [puppet] - 10https://gerrit.wikimedia.org/r/308805 (https://phabricator.wikimedia.org/T137525) (owner: 10Hashar) [15:57:16] (03CR) 10Giuseppe Lavagetto: "not suitable for puppet-swat, will review at a later time." [puppet] - 10https://gerrit.wikimedia.org/r/308778 (https://phabricator.wikimedia.org/T139527) (owner: 10Hashar) [15:57:52] (03CR) 10Giuseppe Lavagetto: "Not suitable for puppetSWAT. Will review later." [puppet] - 10https://gerrit.wikimedia.org/r/309299 (owner: 10Hashar) [15:58:10] (03PS1) 10Elukey: Add mod_proxy_http and mod_xml2enc to the apache available modules. [puppet] - 10https://gerrit.wikimedia.org/r/309340 [15:59:50] (03Abandoned) 10Elukey: Add mod_proxy_http and mod_xml2enc to the apache available modules. [puppet] - 10https://gerrit.wikimedia.org/r/309340 (owner: 10Elukey) [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160908T1600). Please do the needful. [16:00:04] hashar: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:23] (03PS1) 10Muehlenhoff: fix typo [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/309341 [16:01:24] <_joe_> moritzm, godog I'll do the puppetswat patches tomorrow when hasharAway is around [16:01:27] (03CR) 10Muehlenhoff: [C: 032 V: 032] fix typo [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/309341 (owner: 10Muehlenhoff) [16:01:37] ok [16:01:43] (03PS1) 10Alexandros Kosiaris: Update puppetmaster::servers in hiera for puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/309342 [16:01:50] _joe_: thanks, sounds good [16:02:03] <_joe_> akosiaris: \o/ [16:03:52] (03CR) 10Alexandros Kosiaris: [C: 032] Update puppetmaster::servers in hiera for puppetmaster1001 [puppet] - 10https://gerrit.wikimedia.org/r/309342 (owner: 10Alexandros Kosiaris) [16:04:08] (03PS2) 10Faidon Liambotis: mirrors: set MIRRORNAME in Debian's ftpsync config [puppet] - 10https://gerrit.wikimedia.org/r/309334 [16:04:12] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mirrors: set MIRRORNAME in Debian's ftpsync config [puppet] - 10https://gerrit.wikimedia.org/r/309334 (owner: 10Faidon Liambotis) [16:05:05] did someone puppet-merge that? [16:06:14] (03PS1) 10Elukey: Add mod_proxy_http and mod_xml2enc to the apache available modules. [puppet] - 10https://gerrit.wikimedia.org/r/309344 [16:07:03] !log uploaded linux-meta 1.10 to carbon (pointing to the new 4.4.19 kernel image) [16:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:10] (03CR) 10BryanDavis: "This change breaks puppet on a lot of hosts in Labs: https://tools.wmflabs.org/watroles/role/role::aptly" [puppet] - 10https://gerrit.wikimedia.org/r/308152 (owner: 10Hashar) [16:10:47] gehel: ok [16:11:06] bblack: I'll just wait for puppet swat to be over... [16:11:34] bblack: thanks [16:11:44] ok [16:13:07] !log roll-restart cassandra instances on restbase-test cluster T143044 [16:13:08] T143044: Renew RESTBase self-signed root certificate authority - https://phabricator.wikimedia.org/T143044 [16:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:04] (03CR) 10Giuseppe Lavagetto: [C: 031] Add mod_proxy_http and mod_xml2enc to the apache available modules. [puppet] - 10https://gerrit.wikimedia.org/r/309344 (owner: 10Elukey) [16:15:48] (03CR) 10Elukey: [C: 032] Add mod_proxy_http and mod_xml2enc to the apache available modules. [puppet] - 10https://gerrit.wikimedia.org/r/309344 (owner: 10Elukey) [16:16:08] that commit message is wrong [16:16:10] too late [16:16:19] s/proxy_http/proxy_html/ [16:17:11] ouch you are right [16:18:14] paravoid: I also missed that in 2.2 html was a third party, going to fix it [16:18:20] thanks for catching that, need coffee [16:18:22] :) [16:20:38] (03PS1) 10Elukey: Fix mod_proxy_html Apache module availability [puppet] - 10https://gerrit.wikimedia.org/r/309345 [16:21:20] (03PS2) 10Elukey: Fix mod_proxy_html Apache module availability [puppet] - 10https://gerrit.wikimedia.org/r/309345 [16:22:16] (03CR) 10BryanDavis: "Puppet fixed for:" [puppet] - 10https://gerrit.wikimedia.org/r/308152 (owner: 10Hashar) [16:22:25] paravoid: --^ [16:23:42] (03CR) 10Elukey: [C: 032] Fix mod_proxy_html Apache module availability [puppet] - 10https://gerrit.wikimedia.org/r/309345 (owner: 10Elukey) [16:24:35] (03PS1) 10Mattflaschen: Add logging channel for NewUserMessage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309346 (https://phabricator.wikimedia.org/T131957) [16:24:40] The authenticity of host 'puppetmaster1002.eqiad.wmnet (10.64.48.45)' can't be established - mmmmm [16:25:59] (03PS9) 10BBlack: Remove geoiplookup service IPs from LVS [puppet] - 10https://gerrit.wikimedia.org/r/305420 (https://phabricator.wikimedia.org/T100902) [16:26:01] (03PS9) 10BBlack: text VCL: remove JSON output support [puppet] - 10https://gerrit.wikimedia.org/r/305421 (https://phabricator.wikimedia.org/T100902) [16:26:03] (03PS1) 10BBlack: Remove /geoiplookup, but not geoiplookup.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/309347 (https://phabricator.wikimedia.org/T100902) [16:26:31] (03Abandoned) 10BBlack: text VCL: remove geoiplookup hostname support [puppet] - 10https://gerrit.wikimedia.org/r/306309 (https://phabricator.wikimedia.org/T100902) (owner: 10BBlack) [16:26:53] (03PS1) 10Urbanecm: Limit file uploads on Ladino Wikipedia to sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309348 (https://phabricator.wikimedia.org/T145090) [16:28:36] something is weird with puppet-merge, it stops for a while sometimes [16:28:48] (03CR) 10BBlack: [C: 032] Remove /geoiplookup, but not geoiplookup.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/309347 (https://phabricator.wikimedia.org/T100902) (owner: 10BBlack) [16:29:11] and now it seems executing yes?? [16:29:55] I had to confirm a new host key [16:29:59] jouncebot: next [16:29:59] In 0 hour(s) and 30 minute(s): Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160908T1700) [16:30:03] and now it's kinda stuck [16:30:17] jouncebot: now [16:30:17] For the next 0 hour(s) and 29 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160908T1600) [16:30:22] bblack: I did it too but it seems executing 'yes' in my tmux screen [16:30:25] it is stuck [16:30:33] nice [16:30:41] no ok now it is fine [16:31:37] so the yes part is probably due to me trying to accept the new key, the command got executed afterwards? [16:32:12] (03CR) 10Alex Monk: "Puppet fixed for integration-aptly01" [puppet] - 10https://gerrit.wikimedia.org/r/308152 (owner: 10Hashar) [16:32:12] bblack: how is your merge looking? [16:32:27] still stuck or unblocked? [16:32:40] (03CR) 10Chad: [C: 032] "Going to land this on beta for a bit before syncing to production." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306471 (owner: 10Chad) [16:32:51] I gave up and ctrl+c. it looked to have merged on at least 3 masters. [16:32:57] I have no idea [16:33:23] (03PS4) 10Chad: Remove "p" symlink from WMF config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/306471 [16:34:32] (03CR) 10Thcipriani: [C: 031] "Slowly but surely, whittling away :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308789 (owner: 10Chad) [16:34:41] I need to send another code review, maybe it was only a temp problem due to new settings? [16:35:06] let [16:35:10] let's check [16:35:14] same here re: host key while syncing private.git [16:35:47] mmm maybe we need to run puppet on palladium? [16:36:13] it's me [16:36:19] it's puppetmaster1001 and puppetmaster1002 [16:36:29] I am just finishing up provisioning them [16:36:34] should be fine in like 5 [16:36:39] ah nice [16:36:53] akosiaris: I just ran puppet to pick up the new keys on palladium [16:37:04] yeah, makes sense, thanks! [16:38:19] (03PS5) 10BBlack: Remove geoiplookup DNS entries [dns] - 10https://gerrit.wikimedia.org/r/305422 (https://phabricator.wikimedia.org/T100902) [16:41:00] and done [16:41:06] (03PS1) 10Elukey: Replace mod_substitute logic with proxy_html due to perf issues [puppet] - 10https://gerrit.wikimedia.org/r/309351 (https://phabricator.wikimedia.org/T116192) [16:41:13] just in time for a test :) [16:41:18] please do [16:43:07] !log demon@tin Started scap: removing obsolete p symlink [16:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:43:50] * elukey waits Jenkins [16:44:06] (03CR) 10Elukey: [C: 032] Replace mod_substitute logic with proxy_html due to perf issues [puppet] - 10https://gerrit.wikimedia.org/r/309351 (https://phabricator.wikimedia.org/T116192) (owner: 10Elukey) [16:44:40] (03CR) 10Alexandros Kosiaris: [C: 032] role::postgres: Restrict to labs networks [puppet] - 10https://gerrit.wikimedia.org/r/308711 (owner: 10Muehlenhoff) [16:44:44] (03PS3) 10Alexandros Kosiaris: role::postgres: Restrict to labs networks [puppet] - 10https://gerrit.wikimedia.org/r/308711 (owner: 10Muehlenhoff) [16:44:46] (03CR) 10Alexandros Kosiaris: [V: 032] role::postgres: Restrict to labs networks [puppet] - 10https://gerrit.wikimedia.org/r/308711 (owner: 10Muehlenhoff) [16:46:01] akosiaris: all good [16:46:44] ok [16:46:46] thanks [16:47:12] akosiaris: one thing.. I had puppet disabled on stat1001 and after re-enabling it, I get Error: Could not request certificate: The certificate retrieved from the master does not match the agent's private key. [16:47:32] !log demon@tin Finished scap: removing obsolete p symlink (duration: 04m 25s) [16:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:47:41] elukey: you forgot sudo me thinks [16:48:05] * elukey cries in a corner [16:48:15] and you probably now have a .puppet directory in your homedir [16:48:15] 06Operations, 10Analytics: Can't log into https://piwik.wikimedia.org/ - https://phabricator.wikimedia.org/T144326#2619710 (10Nuria) Moving to radar as ops are the ones that could shed some light here. [16:48:37] akosiaris: no no I need to disconnect my hands from a keyboard [16:48:45] this is the root cause :) [16:48:53] thanks! [16:48:58] yw [16:50:40] 06Operations, 10Cassandra, 06Services: Renew RESTBase self-signed root certificate authority - https://phabricator.wikimedia.org/T143044#2619719 (10fgiunchedi) this is complete with a 50y CA in the restbase test cluster, production cluster to follow monday week ``` root@cerium:/etc/cassandra-a/tls# keytool... [16:51:55] (03CR) 10Jcrespo: "I promised to do it ASAP. I will." [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/308785 (owner: 10Paladox) [16:52:01] (03CR) 10Chad: [C: 032] getMWVersion: Unused, dubiously useful [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308789 (owner: 10Chad) [16:52:07] a 50 year CA? [16:52:09] (03PS2) 10Chad: getMWVersion: Unused, dubiously useful [mediawiki-config] - 10https://gerrit.wikimedia.org/r/308789 [16:52:19] (03CR) 10Paladox: "Ok sorry" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/308785 (owner: 10Paladox) [16:53:02] heh ok :) [16:54:18] bblack: yeah, there's some explanation earlier in task [16:54:42] it was surprisingly painless to roll over the CA anyways we could do it again with a shorter expiration [16:55:25] !log demon@tin Synchronized multiversion/: removing more junk - getMWVersion (duration: 01m 07s) [16:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:38] yeah I guess since this is entirely private to a controlled cluster, it doesn't matter much [16:55:49] in the event of compromise you'd simply delete it anyways [16:59:10] yep, also something I noticed with java keystores, you have to trust every single ca certificate, as opposed to being able to issue another ca certificate with a longer expiration but the same private key [17:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160908T1700). Please do the needful. [17:00:14] no parsoid deploys [17:00:19] Nothing for ORES [17:00:55] I've got a trivial change for Striker that I will roll out [17:04:30] (03CR) 10Gehel: [C: 032] LVS configuration for maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/303559 (https://phabricator.wikimedia.org/T142393) (owner: 10Gehel) [17:04:34] !log Updated Striker to 7d7c8ee [17:04:36] (03PS4) 10Gehel: LVS configuration for maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/303559 (https://phabricator.wikimedia.org/T142393) [17:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:04:56] (03CR) 10BBlack: [C: 031] Add the pivot.w.o domain to the stat1001 misc Varnish director [puppet] - 10https://gerrit.wikimedia.org/r/309315 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [17:05:24] (03PS2) 10Elukey: Add the pivot.w.o domain to the stat1001 misc Varnish director [puppet] - 10https://gerrit.wikimedia.org/r/309315 (https://phabricator.wikimedia.org/T138262) [17:06:38] !log deploying new LVS configuration for kartotherian.svc.eqiad.wmnet [17:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:07:19] (03CR) 10Elukey: [C: 032] Add the pivot.w.o domain to the stat1001 misc Varnish director [puppet] - 10https://gerrit.wikimedia.org/r/309315 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [17:07:26] (03PS3) 10Elukey: Add the pivot.w.o domain to the stat1001 misc Varnish director [puppet] - 10https://gerrit.wikimedia.org/r/309315 (https://phabricator.wikimedia.org/T138262) [17:08:41] PROBLEM - puppet last run on elastic2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:10:44] (03PS1) 10Gehel: Revert "LVS configuration for maps cluster in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/309354 [17:11:23] !log reverting deploying new LVS configuration for kartotherian.svc.eqiad.wmnet - puppet error, let's analyse slowly... [17:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:12:02] (03CR) 10Gehel: [C: 032] Revert "LVS configuration for maps cluster in eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/309354 (owner: 10Gehel) [17:14:15] (03PS1) 10Chad: Remove activeMWVersions wrapper [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309356 [17:14:43] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint, 13Patch-For-Review: Configure LVS in front of maps100? servers - https://phabricator.wikimedia.org/T142393#2619771 (10Gehel) Puppet error when deploying https://gerrit.wikimedia.org/r/303559: Error: Could not retrieve catalog from remote server:... [17:15:40] PROBLEM - puppet last run on lvs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:16:45] interesting [17:17:40] ^ puppet issue is me, already reverted, looking into it... [17:17:54] and damn puppet error reporting! [17:18:12] RECOVERY - puppet last run on lvs1006 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [17:18:15] 06Operations, 10Analytics-Cluster: Migrate titanium to jessie (archiva.wikimedia.org upgrade) - https://phabricator.wikimedia.org/T123725#2619776 (10Ottomata) Ok! DNS has been merged, and we just did our first refinery release using jenkins and archiva. ALL IS WELL! [17:18:33] gehel: ip_block022 is missing the eqiad IP, in hieradata/common/lvs/configuration.yaml [17:18:43] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium): Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T145057#2618640 (10elukey) @hashar let's sync and make this happen, maybe on Monday? [17:19:01] in other words, the change doesn't contain the IP you added to DNS anywhere in the patch :) [17:19:02] bblack: Thanks! I was getting to it, but you are definitely faster! [17:19:15] I missed that on review too heh [17:20:20] bblack: error is still mine... [17:20:53] bblack: my stand up is coming up, I'll send the fix, but do the actual merge later. Thanks for the help! [17:22:17] np [17:22:47] (03PS1) 10Gehel: LVS configuration for maps cluster in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/309357 (https://phabricator.wikimedia.org/T142393) [17:23:58] (03CR) 10Gehel: "This is a rework of https://gerrit.wikimedia.org/r/#/c/303559/ adding the missing ip_block022" [puppet] - 10https://gerrit.wikimedia.org/r/309357 (https://phabricator.wikimedia.org/T142393) (owner: 10Gehel) [17:24:51] (03CR) 10Jforrester: [C: 031] Update constants file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309256 (owner: 10Aaron Schulz) [17:27:32] (03PS1) 10Chad: Combine checkoutMediaWiki into one file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309359 [17:31:03] (03PS1) 10Chad: updateWikiversions: Combine into singular file, no outside uses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309361 [17:32:45] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup beryllium replacment - https://phabricator.wikimedia.org/T143902#2619843 (10Jgreen) [17:32:49] 06Operations, 06Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Install and configure new WDQS nodes on codfw - https://phabricator.wikimedia.org/T144380#2619845 (10Gehel) The RWStore.properties is not actually created during scap deployment (we get the default one). There is some issue with... [17:33:04] (03CR) 10Chad: "The fact that we have to copy this...." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309256 (owner: 10Aaron Schulz) [17:33:29] 06Operations, 10Analytics-Cluster: Migrate titanium to jessie (archiva.wikimedia.org upgrade) - https://phabricator.wikimedia.org/T123725#2619847 (10MoritzMuehlenhoff) Nice! Let's keep titanium around for another two weeks just in case we need to track something down that wasn't noticed. [17:35:22] RECOVERY - puppet last run on elastic2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:35:36] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup beryllium replacment - https://phabricator.wikimedia.org/T143902#2619850 (10Jgreen) @Cmjohnson I guess for now our only option is to unplug db1008 (which is currently powered off) and install frauth1001 on pfw1 2/0/9. [17:36:59] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup beryllium replacment - https://phabricator.wikimedia.org/T143902#2619852 (10Jgreen) [17:40:37] (03CR) 10Madhuvishy: [C: 031] labstore: misc apply /srv/dumps via puppet and mounted ensures [puppet] - 10https://gerrit.wikimedia.org/r/309318 (owner: 10Rush) [17:41:00] (03PS3) 10Rush: labstore: misc apply /srv/dumps via puppet and mounted ensures [puppet] - 10https://gerrit.wikimedia.org/r/309318 [17:41:13] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1009-a.eqiad.wmnet [17:41:15] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [17:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:41:23] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1014-a.eqiad.wmnet [17:41:24] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [17:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:41:32] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1015-a.eqiad.wmnet [17:41:33] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [17:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:03] (03CR) 10Rush: [C: 032] labstore: misc apply /srv/dumps via puppet and mounted ensures [puppet] - 10https://gerrit.wikimedia.org/r/309318 (owner: 10Rush) [17:45:29] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1007-b.eqiad.wmnet [17:45:30] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [17:45:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:45:39] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1010-b.eqiad.wmnet [17:45:40] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [17:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:45:49] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1011-b.eqiad.wmnet [17:45:50] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [17:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:46:06] (03CR) 10Greg Grossmeier: [C: 031] "Sweet." [puppet] - 10https://gerrit.wikimedia.org/r/309337 (owner: 10Alex Monk) [17:46:41] !log reboot labstore1004 & labstore1005 [17:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:47:54] (03PS1) 10Chad: Remove MWVersion, fold its two functions into MWMultiVersion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363 [17:47:57] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2619896 (10jcrespo) >>! In T145079#2619417, @Volans wrote: > Just to add some more info, in this specific `templatelinks ` table there are 4.8M rows with those... [17:49:39] (03CR) 10Chad: Remove MWVersion, fold its two functions into MWMultiVersion (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309363 (owner: 10Chad) [17:53:12] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1008-b.eqiad.wmnet [17:53:13] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [17:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:53:24] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1012-b.eqiad.wmnet [17:53:25] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [17:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:53:35] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase1013-b.eqiad.wmnet [17:53:36] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [17:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:55:01] PROBLEM - swift-object-updater on ms-be1022 is CRITICAL: Connection refused by host [17:55:01] PROBLEM - HP RAID on ms-be1022 is CRITICAL: Connection refused by host [17:55:01] PROBLEM - salt-minion processes on ms-be1022 is CRITICAL: Connection refused by host [17:55:11] PROBLEM - swift-container-server on ms-be1022 is CRITICAL: Connection refused by host [17:55:11] PROBLEM - swift-container-updater on ms-be1022 is CRITICAL: Connection refused by host [17:55:31] PROBLEM - very high load average likely xfs on ms-be1022 is CRITICAL: Connection refused by host [17:55:31] PROBLEM - MD RAID on ms-be1022 is CRITICAL: Connection refused by host [17:55:33] 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 13Patch-For-Review: decommission aluminium, replace it with frqueue1002 - https://phabricator.wikimedia.org/T140676#2619956 (10Jgreen) [17:55:40] PROBLEM - dhclient process on ms-be1022 is CRITICAL: Connection refused by host [17:55:51] PROBLEM - NTP on ms-be1022 is CRITICAL: NTP CRITICAL: No response from NTP server [17:56:02] PROBLEM - configured eth on ms-be1022 is CRITICAL: Connection refused by host [17:56:16] PROBLEM - Check size of conntrack table on ms-be1022 is CRITICAL: Connection refused by host [17:56:16] PROBLEM - swift-object-auditor on ms-be1022 is CRITICAL: Connection refused by host [17:56:31] PROBLEM - swift-account-auditor on ms-be1022 is CRITICAL: Connection refused by host [17:56:32] PROBLEM - puppet last run on ms-be1022 is CRITICAL: Connection refused by host [17:56:33] PROBLEM - swift-object-replicator on ms-be1022 is CRITICAL: Connection refused by host [17:56:33] PROBLEM - swift-container-replicator on ms-be1022 is CRITICAL: Connection refused by host [17:56:33] PROBLEM - DPKG on ms-be1022 is CRITICAL: Connection refused by host [17:56:41] PROBLEM - swift-account-reaper on ms-be1022 is CRITICAL: Connection refused by host [17:56:54] PROBLEM - swift-account-replicator on ms-be1022 is CRITICAL: Connection refused by host [17:57:01] PROBLEM - swift-account-server on ms-be1022 is CRITICAL: Connection refused by host [17:57:02] (03PS1) 10Chad: MWMultiversion cleanups [puppet] - 10https://gerrit.wikimedia.org/r/309366 [17:57:11] PROBLEM - Disk space on ms-be1022 is CRITICAL: Connection refused by host [17:57:11] PROBLEM - swift-object-server on ms-be1022 is CRITICAL: Connection refused by host [17:57:30] (03PS2) 10Chad: MWMultiversion cleanups [puppet] - 10https://gerrit.wikimedia.org/r/309366 [17:57:31] PROBLEM - swift-container-auditor on ms-be1022 is CRITICAL: Connection refused by host [17:58:52] !log T143226: Perform major compaction on local_group_wikipedia_T_parsoid_html.data, restbase200[1-9]-b.codfw.wmnet [17:58:52] (03PS12) 10Mobrovac: service::node: Compile the file holding puppet-controlled vars [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) [17:58:54] T143226: Cluster-wide major compactions: parsoid.html table - https://phabricator.wikimedia.org/T143226 [17:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:00:04] anomie, ostriches, thcipriani, hashar, twentyafterfour, and aude: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160908T1800). Please do the needful. [18:00:21] no swats? [18:00:25] I'll do it! [18:00:27] ;-) [18:00:46] thcipriani: wanna review all that multiversion stuff and take over swat for it? :P [18:01:14] oh boy [18:01:16] (03CR) 10Mobrovac: "PS 12 adds the ability to perform a Scap3 config deploy and service restart when the contents of the variables file changes." [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) (owner: 10Mobrovac) [18:04:08] go team? [18:05:05] European Mid-Day SWAT is handling a lot of patches [18:07:54] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: rack and setup new fundraising queue servers frqueue1001 and frqueue1002 - https://phabricator.wikimedia.org/T136882#2620031 (10Jgreen) [18:08:46] thcipriani: Only 309363 is risky [18:08:48] Others are boring :) [18:10:30] (03PS13) 10Mobrovac: service::node: Compile the file holding puppet-controlled vars [puppet] - 10https://gerrit.wikimedia.org/r/308021 (https://phabricator.wikimedia.org/T144542) [18:11:08] ostriches: I can get these as part of SWAT: https://gerrit.wikimedia.org/r/#/c/309356/ https://gerrit.wikimedia.org/r/#/c/309359/ https://gerrit.wikimedia.org/r/#/c/309361/ all seem innocuous [18:11:23] Yeah those 3 are safe. [18:11:30] The 4th one is a little scarier and needs a close review [18:11:33] Final one is puppet [18:11:36] cool, doing [18:12:05] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309356 (owner: 10Chad) [18:12:32] (03Merged) 10jenkins-bot: Remove activeMWVersions wrapper [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309356 (owner: 10Chad) [18:13:04] I'm just gonna sync them all at once when they've merged. [18:13:45] (03PS1) 10Urbanecm: Allow sysops/'crats to assign massmessage-sender in urwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309368 (https://phabricator.wikimedia.org/T144701) [18:13:47] (03PS1) 10Yuvipanda: labs: Use ENC for labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/309369 [18:14:05] andrewbogott ^ for the moving to labs puppetmaster [18:14:09] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309359 (owner: 10Chad) [18:14:29] (03CR) 10Thcipriani: Combine checkoutMediaWiki into one file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309359 (owner: 10Chad) [18:14:33] (03PS2) 10Thcipriani: Combine checkoutMediaWiki into one file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309359 (owner: 10Chad) [18:14:41] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309359 (owner: 10Chad) [18:15:16] (03Merged) 10jenkins-bot: Combine checkoutMediaWiki into one file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309359 (owner: 10Chad) [18:16:24] I know this one is going to say: Cannot Merge but won't let me rebase. Damn you gerrit! [18:16:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309361 (owner: 10Chad) [18:16:39] (03CR) 10Thcipriani: updateWikiversions: Combine into singular file, no outside uses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309361 (owner: 10Chad) [18:16:42] (03PS2) 10Thcipriani: updateWikiversions: Combine into singular file, no outside uses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309361 (owner: 10Chad) [18:16:48] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309361 (owner: 10Chad) [18:16:54] ugh [18:17:16] (03Merged) 10jenkins-bot: updateWikiversions: Combine into singular file, no outside uses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309361 (owner: 10Chad) [18:17:38] ostriches: you said you're syncing? [18:19:22] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup beryllium replacement - https://phabricator.wikimedia.org/T143902#2620104 (10Jgreen) [18:22:03] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup beryllium replacement frauth1001 - https://phabricator.wikimedia.org/T143902#2620125 (10Jgreen) [18:22:29] thcipriani: Yep [18:22:56] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup beryllium replacement frauth1001 - https://phabricator.wikimedia.org/T143902#2582606 (10Jgreen) [18:23:00] ostriches: kk, already pulled on tin, and staged on mw1099 if there's anything to check there. [18:23:13] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup beryllium replacement frauth1001 - https://phabricator.wikimedia.org/T143902#2582606 (10Jgreen) [18:23:14] Not really. These are all cli scripts [18:23:17] Web stuff won't notice. [18:23:34] Doing sync-dir now [18:23:57] !log demon@tin Synchronized multiversion/: So much junk to remove (duration: 01m 06s) [18:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:26:27] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2620173 (10Volans) @jcrespo out of curiosity and learning, is this related to the fact that `templatelinks` is partitioned? Also if the partition key is fixed h... [18:26:47] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack/Setup pay-lvs1001 and pay-lvs1002 - https://phabricator.wikimedia.org/T143900#2620174 (10Jgreen) [18:28:15] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: rack/setup beryllium replacement frauth1001 - https://phabricator.wikimedia.org/T143902#2620179 (10Jgreen) [18:28:18] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack/Setup pay-lvs1001 and pay-lvs1002 - https://phabricator.wikimedia.org/T143900#2582571 (10Jgreen) [18:29:44] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack/Setup pay-lvs1001 and pay-lvs1002 - https://phabricator.wikimedia.org/T143900#2620188 (10Jgreen) [18:29:57] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium): Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T145057#2620190 (10hashar) Sounds good thanks! [18:32:28] (03PS1) 10Chad: activeMWVersions.php: Remove script and get info for noc from scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309372 [18:40:47] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup Fundraising DB frdb1001 - https://phabricator.wikimedia.org/T136200#2620277 (10Jgreen) [18:45:05] 06Operations, 10netops: configure port for frdb1001 - https://phabricator.wikimedia.org/T143248#2620310 (10Jgreen) [18:45:09] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup Fundraising DB frdb1001 - https://phabricator.wikimedia.org/T136200#2620313 (10Jgreen) [18:46:17] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup Fundraising DB frdb1001 - https://phabricator.wikimedia.org/T136200#2326525 (10Jgreen) [18:48:34] (03PS1) 10Urbanecm: Add HD logos for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309375 (https://phabricator.wikimedia.org/T145017) [18:56:32] (03PS2) 10Urbanecm: Add HD logos for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309375 (https://phabricator.wikimedia.org/T145017) [18:57:06] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2620384 (10RobH) First off, that is easily one of the best damned requests ever (in terms of populated system info.) If everyone provided the full purc... [18:57:10] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2620387 (10RobH) a:03RobH [18:58:27] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2620407 (10Ottomata) Perfect, thank you! [19:00:05] ostriches: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160908T1900). Please do the needful. [19:00:31] (03PS2) 10Chad: group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309331 [19:01:08] ottomata: dude if everyone filled out requests like https://phabricator.wikimedia.org/T145082 i'd be fucking thrilled [19:01:13] hehe :) [19:01:17] =] [19:01:36] well, i was able to find a nice link in wikitech that told me exactly what to do [19:01:40] saved me 20 minutes of purchase history searching =] [19:01:43] (03CR) 10Chad: [C: 032] group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309331 (owner: 10Chad) [19:01:46] and luckily my phabricator search was able to find the tickets i was looking for [19:01:48] which is not always the cas! [19:02:10] with typos like that, no wonder! [19:02:11] (03Merged) 10jenkins-bot: group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/309331 (owner: 10Chad) [19:02:11] so yeah we have the system in eqiad, i'll get a quote from dell today (and hp either later today or tomorrow) for it [19:02:11] ;) [19:02:23] im hoping dell is cheaper so we can just go all dell, seems odd to have 3 dells and 1 hp in a service cluster [19:02:56] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.18 [19:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:19:25] (03PS1) 10Ppchelko: Change-Prop: Bump transcludes concurrency once again. [puppet] - 10https://gerrit.wikimedia.org/r/309377 [19:20:29] our average HTTP request-rate on the text cluster bumped up with group2->wmf.18 [19:20:42] by ~10% [19:20:59] possibly expected [19:23:02] I'm guessing that means something like "the average pageview now fetches 1 additional URL it never did before", among the several it usually does [19:27:13] bblack: I got distracted preparing dinner... I'll restart that LVS deployment tomorrow... [19:27:24] gehel: ok [19:28:21] (03PS4) 10Ppchelko: Change-Prop: Switch to new events. [puppet] - 10https://gerrit.wikimedia.org/r/308077 [19:37:15] 06Operations, 10ChangeProp, 10DBA, 10MediaWiki-API, and 4 others: Investigate slow transcludedin query - https://phabricator.wikimedia.org/T145079#2620594 (10Anomie) >>! In T145079#2620173, @Volans wrote: > @jcrespo out of curiosity and learning, is this related to the fact that `templatelinks` is partitio... [19:37:20] (03CR) 10Hashar: "I have added it to European SWAT window of Tuesday, September 13 at 13:00–14:00" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/301339 (https://phabricator.wikimedia.org/T129982) (owner: 10Hashar) [19:37:32] 06Operations, 06Labs, 10Labs-Infrastructure, 10Traffic, 07HTTPS: update *.wmflabs.org certificate - https://phabricator.wikimedia.org/T145120#2620595 (10RobH) [19:38:20] (03CR) 10Andrew Bogott: [C: 031] labs: Use ENC for labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/309369 (owner: 10Yuvipanda) [19:38:26] (03PS2) 10Andrew Bogott: labs: Use ENC for labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/309369 (owner: 10Yuvipanda) [19:38:40] (03PS1) 10RobH: updated *.wmflabs.org certificate [puppet] - 10https://gerrit.wikimedia.org/r/309379 (https://phabricator.wikimedia.org/T145120) [19:40:22] 06Operations, 06Labs, 10Labs-Infrastructure, 10Traffic, and 2 others: update *.wmflabs.org certificate (existing expires on 2016-09-16) - https://phabricator.wikimedia.org/T145120#2620630 (10RobH) a:05RobH>03chasemp [19:41:48] 06Operations, 06Labs, 10Labs-Infrastructure, 10Traffic, and 2 others: update *.wmflabs.org certificate (existing expires on 2016-09-16) - https://phabricator.wikimedia.org/T145120#2620595 (10RobH) p:05Normal>03High [19:42:53] jouncebot: next [19:42:53] In 3 hour(s) and 17 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160908T2300) [19:42:57] perfect [19:43:03] RoanKattouw: I wanna swat that cache fix [19:43:18] And by swat I mean use some unallocated deploy time now :) [19:47:02] RoanKattouw: https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/Echo+branch:wmf/1.28.0-wmf.18 :) [19:48:05] (03PS1) 10Rush: labstore: nfs-exportd refactor and updates [puppet] - 10https://gerrit.wikimedia.org/r/309382 [19:49:17] (03PS1) 10BBlack: Revert "depool upload in ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/309383 (https://phabricator.wikimedia.org/T131502) [19:55:38] (03CR) 10BBlack: [C: 032] Revert "depool upload in ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/309383 (https://phabricator.wikimedia.org/T131502) (owner: 10BBlack) [19:57:44] !log repooling normal traffic to cache_upload in ulsfo [19:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:02:13] (03Draft1) 10Paladox: Workaround a bug in gerrit on Microsoft Edge [puppet] - 10https://gerrit.wikimedia.org/r/309385 [20:03:52] !log demon@tin Synchronized php-1.28.0-wmf.18/extensions/Echo/includes/SeenTime.php: Trying to stop some duplicate redis fetches (duration: 00m 52s) [20:03:52] (03PS2) 10Paladox: Workaround a bug in gerrit on Microsoft Edge [puppet] - 10https://gerrit.wikimedia.org/r/309385 [20:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:04:12] ostriches ^^ theres a bug, lol they do display none [20:04:18] on microsoft edge [20:04:52] but the commands arnt even generate in edge so you have to manually flic through the select box from ssh to http back to ssh for it to show [20:05:07] what that ^^ does is it shows the select box again [20:05:13] (03PS1) 10Rush: labstore: nfs-manage-binds a sync-exports replacments [puppet] - 10https://gerrit.wikimedia.org/r/309387 [20:06:40] RoanKattouw: Yay we win :) [20:08:27] Hehe https://logstash.wikimedia.org/goto/91308632c90c1534db210af6a6d6cb67 [20:13:47] (03PS3) 10Paladox: Workaround a bug in gerrit on Microsoft Edge [puppet] - 10https://gerrit.wikimedia.org/r/309385 (https://phabricator.wikimedia.org/T145130) [20:16:45] (03PS4) 10Paladox: Workaround a bug in gerrit on Microsoft Edge [puppet] - 10https://gerrit.wikimedia.org/r/309385 (https://phabricator.wikimedia.org/T145130) [20:31:14] *yawns* [20:31:27] 06Operations, 06Analytics-Kanban, 06Performance-Team, 10Traffic: Preliminary Design document for A/B testing - https://phabricator.wikimedia.org/T143694#2620857 (10Nuria) [20:31:32] thcipriani around? :) [20:31:43] addshore: yeah, what's up? [20:31:51] quick pm :) got some questions :D [20:40:18] 06Operations, 10Analytics: Can't log into https://piwik.wikimedia.org/ - https://phabricator.wikimedia.org/T144326#2620872 (10Tbayer) @Milimetric Are you able to log in with the credentials for the user "piwik", too? [20:46:20] (03PS3) 10Hashar: node deletion delay is now configurable [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/252953 [20:46:57] (03CR) 10Hashar: "Rebased to https://review.openstack.org/#/c/245220/5/" [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/252953 (owner: 10Hashar) [20:47:27] !log redeploy wdqs on wdqs2001.codfw.wmnet [20:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:55:38] (03CR) 10Andrew Bogott: [C: 032] labs: Use ENC for labs puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/309369 (owner: 10Yuvipanda) [20:56:34] (03CR) 10Andrew Bogott: [C: 032] labspuppetbackend: only allow POST from Horizon's IP [puppet] - 10https://gerrit.wikimedia.org/r/309333 (owner: 10Andrew Bogott) [20:56:39] (03PS3) 10Andrew Bogott: labspuppetbackend: only allow POST from Horizon's IP [puppet] - 10https://gerrit.wikimedia.org/r/309333 [20:59:28] 06Operations, 10Analytics: Can't log into https://piwik.wikimedia.org/ - https://phabricator.wikimedia.org/T144326#2620944 (10Milimetric) @Tbayer: curious, I didn't see a piwik user. So I created one, gave it view access to all sites, and stored the credentials in the same file: /home/milimetric/piwik.creden... [21:10:08] (03PS1) 10Andrew Bogott: Include labspuppetbackend on all Labs puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/309398 [21:14:43] (03PS2) 10Andrew Bogott: Include labspuppetbackend on all Labs puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/309398 [21:14:45] (03PS1) 10BryanDavis: Make pdebuild build the package cleanly [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/309404 [21:21:31] (03PS1) 10Hashar: WMF: stop triggering ListFloatingIPsTask entirely [debs/nodepool] (patch-queue/debian) - 10https://gerrit.wikimedia.org/r/309406 (https://phabricator.wikimedia.org/T143943) [21:31:13] (03PS1) 10Andrew Bogott: Grants for labspuppet (user for the labspuppetbackend tool) [puppet] - 10https://gerrit.wikimedia.org/r/309414 [21:31:25] (03Abandoned) 10Andrew Bogott: labspuppetbackend: Rudimentary security [puppet] - 10https://gerrit.wikimedia.org/r/309101 (owner: 10Andrew Bogott) [21:35:52] (03PS2) 10Andrew Bogott: Grants for labspuppet (user for the labspuppetbackend tool) [puppet] - 10https://gerrit.wikimedia.org/r/309414 [21:37:15] (03CR) 10BryanDavis: [C: 032] Make pdebuild build the package cleanly [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/309404 (owner: 10BryanDavis) [21:37:42] (03Merged) 10jenkins-bot: Make pdebuild build the package cleanly [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/309404 (owner: 10BryanDavis) [21:38:13] (03PS3) 10BryanDavis: webservice: Warn when using lighttpd-precise [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/307463 (https://phabricator.wikimedia.org/T143282) [21:40:52] (03PS2) 10Hashar: debian/gbp.conf upstream-tag = %(version)s [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/301891 [21:43:18] (03PS1) 10Hashar: Add patch stop triggering ListFloatingIPsTask entirely [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/309435 (https://phabricator.wikimedia.org/T143943) [21:43:30] (03CR) 10BryanDavis: [C: 032] webservice: Warn when using lighttpd-precise [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/307463 (https://phabricator.wikimedia.org/T143282) (owner: 10BryanDavis) [21:43:55] (03Merged) 10jenkins-bot: webservice: Warn when using lighttpd-precise [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/307463 (https://phabricator.wikimedia.org/T143282) (owner: 10BryanDavis) [21:47:15] (03PS1) 10Madhuvishy: dynamicproxy: Override nginx worker_connections default [puppet] - 10https://gerrit.wikimedia.org/r/309450 (https://phabricator.wikimedia.org/T143637) [21:47:17] (03CR) 10Hashar: [C: 032] "From debian glue / cowbuilder output:" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/301891 (owner: 10Hashar) [21:58:51] (03PS1) 10Hashar: Pre-Depends: python2.7 [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/309459 [22:02:09] (03CR) 10Paladox: [C: 031] Pre-Depends: python2.7 [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/309459 (owner: 10Hashar) [22:10:48] (03Abandoned) 10Hashar: Pre-Depends: python2.7 [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/309459 (owner: 10Hashar) [22:16:52] (03CR) 10Hashar: [C: 032] Add patch stop triggering ListFloatingIPsTask entirely [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/309435 (https://phabricator.wikimedia.org/T143943) (owner: 10Hashar) [22:18:58] PROBLEM - MD RAID on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:19:01] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch: Decrease time required to fully restart the Cirrus elasticsearch clusters - https://phabricator.wikimedia.org/T145065#2621168 (10debt) p:05Triage>03Normal [22:20:19] PROBLEM - HP RAID on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [22:20:27] 06Operations, 06Discovery, 06Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: Upgrade elasticsearch to 2.4.0 - https://phabricator.wikimedia.org/T145058#2621172 (10debt) p:05Triage>03Normal [22:21:20] PROBLEM - very high load average likely xfs on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:21:34] 06Operations, 10ops-ulsfo: atlas-ulsfo missing asset tag info in racktables - https://phabricator.wikimedia.org/T145141#2621174 (10RobH) [22:21:54] (03PS1) 10Hashar: 0.1.1-wmf5: gbp.conf / ListFloatingIPsTask [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/309464 (https://phabricator.wikimedia.org/T143943) [22:22:02] 06Operations, 10Traffic, 07HTTPS: wmflabs.org should enforce HTTPS - https://phabricator.wikimedia.org/T144790#2621191 (10AlexMonk-WMF) [22:22:13] 06Operations, 06Labs, 10Traffic, 07HTTPS: wmflabs.org should enforce HTTPS - https://phabricator.wikimedia.org/T144790#2610285 (10AlexMonk-WMF) [22:23:15] 06Operations, 06Labs, 10Traffic, 07HTTPS: wmflabs.org should enforce HTTPS - https://phabricator.wikimedia.org/T144790#2610285 (10AlexMonk-WMF) There's also T102367, but that's specific to tools [22:23:47] PROBLEM - puppet last run on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:24:36] PROBLEM - swift-container-updater on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:24:55] PROBLEM - swift-account-auditor on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:24:56] PROBLEM - Check size of conntrack table on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:24:56] PROBLEM - configured eth on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:24:56] PROBLEM - swift-object-auditor on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:25:16] PROBLEM - swift-container-server on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:25:35] PROBLEM - DPKG on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:25:35] PROBLEM - swift-account-replicator on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:25:47] PROBLEM - swift-object-replicator on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:25:57] PROBLEM - swift-account-server on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:25:57] (03CR) 10Hashar: [C: 032] "Build and published at https://people.wikimedia.org/~hashar/debs/nodepool_0.1.1-wmf5/" [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/309464 (https://phabricator.wikimedia.org/T143943) (owner: 10Hashar) [22:26:02] PROBLEM - Disk space on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:02] PROBLEM - swift-object-server on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:05] PROBLEM - swift-container-auditor on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:26] PROBLEM - swift-account-reaper on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:27] PROBLEM - swift-container-replicator on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:27] PROBLEM - swift-object-updater on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:26:47] PROBLEM - dhclient process on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:07] PROBLEM - salt-minion processes on ms-be2019 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:27:27] nancy!.... [22:27:34] er, minus nancy! [22:31:31] 06Operations, 10Continuous-Integration-Infrastructure, 06Labs, 07Nodepool: Upgrade Nodepool to 0.1.1-wmf5 to reduce requests made to OpenStack API - https://phabricator.wikimedia.org/T145142#2621216 (10hashar) [22:43:57] jouncebot: next [22:43:57] In 0 hour(s) and 16 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160908T2300) [22:47:55] o/ [22:48:44] 06Operations, 10ops-eqiad: ripe-atlas should be renamed atlas-eqiad - https://phabricator.wikimedia.org/T145145#2621319 (10RobH) [22:52:17] PROBLEM - SSH on ms-be2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:52:56] !log demon@tin Synchronized php-1.28.0-wmf.18/includes/jobqueue/jobs: [22:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:53:08] Whoops, chopped my message off... [22:53:13] AaronSchulz: done ^ [22:53:36] lol [22:56:47] ...scaping kartotherian [22:58:47] yurik_: why? [22:59:02] ok [22:59:10] greg-g, we ran into some obscure issues with geoshapes service :() [22:59:13] done [22:59:39] try to do it semi-sync with swat - need to deploy a minor patch there too [22:59:45] AndyRussG: do you want we cherrypick both patches or a commit to rebase wmf18 against 61d8f9a6d1f78561601 Account for DB lag when refreshing cached ChoiceData ? [23:00:02] AaronSchulz Krinkle the WAN cache object and GeoIP patches for CentralNotice are expected to go out nowish fyi :) [23:00:04] RoanKattouw, ostriches, MaxSem, awight, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160908T2300). [23:00:04] AndyRussG, Krinkle, and Krenair: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:16] Dereckson: hi! I'm just making an update to wmf_deploy [23:00:29] branch, just merged master in [23:00:33] ok [23:00:36] I can SWAT this evening so. [23:01:08] Dereckson: just gimme a couple minutes to verify a Gerrit permissions issue we had with the wmf_deploy branch, now that I have a change up there... [23:01:16] thx in advance [23:02:06] Krenair: https://gerrit.wikimedia.org/r/#/c/309350/ is v+2 but https://gerrit.wikimedia.org/r/#/c/309469/ has a test failure: 22:42:24 cucumber features/popups_settings.feature:33 # Scenario: Popups can be enabled via the "Enable previews" footer link [23:02:14] AndyRussG: no problem [23:02:20] !log scaped kartotherian https://gerrit.wikimedia.org/r/#/c/309473/ [23:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:02:39] ostriches: wtf does phabricator have a confirmation for "mark as read"? [23:02:46] it's such a minor action [23:03:13] AaronSchulz: they like non useful confirmation [23:03:20] Dereckson, I know [23:03:33] AaronSchulz: https://secure.phabricator.com/T11384 [23:04:31] AaronSchulz: I'm not a big fan to ask confirmation instead of offering capability to undo action [23:05:55] Krenair: and that's fine? [23:06:14] Dereckson: done with the thing I had to check, but now I see the patch is not passing CI [23:06:17] https://gerrit.wikimedia.org/r/#/c/309479/ [23:06:26] DBQueryError: A database error has occurred. Did you forget to run maintenance/update.php after upgrading? [23:06:42] I imagine it's a CI issue... [23:06:48] Dereckson, I don't know anything about the selenium tests, I don't know. Krinkle? [23:06:58] 22:59:41 Query: UPDATE `unittest_cn_templates` SET tmp_display_anon = '1',tmp_display_account = '0',tmp_archived = '',tmp_category = 'fundraising' WHERE tmp_id = '94' [23:07:02] 22:59:41 Function: Banner::saveBasicData [23:07:07] 22:59:41 Error: 1366 Incorrect integer value: '' for column 'tmp_archived' at row 1 (127.0.0.1:3306) [23:07:34] AndyRussG: that's one of your test [23:07:43] did we switch to strict mode in CI? [23:08:00] OuKB: there was discussion about that [23:08:03] or that is just too loose? :P [23:08:16] greg-g: here? [23:08:20] Dereckson: yeah but these patches didn't move any schemas, and were passing fine on the master branch [23:09:29] OuKB: https://phabricator.wikimedia.org/T108255 [23:09:34] An update to package.json did also go through, but it's just grunt stuff in devDependencies [23:10:28] OuKB: Yes, we switched to strict mode for MySQL in CI this week. (Aaron's patch) [23:10:30] AndyRussG: open a task against your project to update queries and ensure they're more correct [23:10:44] Only for CI right now, not yet in prod. [23:10:46] Dereckson, I've added another patch to SWAT btw [23:10:53] AndyRussG: if you've an integer field for example, "" isn't an acceptable value in strict mode [23:11:07] AndyRussG: but yes, we can deploy [23:11:38] Dereckson, the feature it's testing still works for me [23:11:38] so [23:12:39] PROBLEM - NTP on ms-be2019 is CRITICAL: NTP CRITICAL: No response from NTP server [23:12:41] AndyRussG: The strict mode failure is due to the custom branching you have. The strict mode landed in master, not in wmf/* (that'll happen next week). But since CentralNotice has custom wmf_branches it tests with mediawiki-core master. [23:13:01] Otherwise it'd have passed. [23:13:30] Krinkle: if you're pretty sure that's the issue then I guess we're good :) [23:13:45] sounds reasonable [23:14:23] Krinkle: could you check https://integration.wikimedia.org/ci/job/mwext-mw-selenium/9879/console? [23:14:55] Dereckson: I don't know the Popups extension or its selenium tests well enough to judge that [23:15:07] At first, the error seems genuine. [23:15:45] hoo: yeah [23:16:15] greg-g: I will need to either revert Wikidata or push a partial revert [23:16:27] yuck, task? [23:16:28] there's no one from my time around right now [23:16:35] https://phabricator.wikimedia.org/T145138 [23:16:37] Krinkle: Dereckson: So to confirm, https://gerrit.wikimedia.org/r/#/c/309479/ CI error is due to CI config change and we can safely override CI, merge and deploy, correct? [23:16:37] AndyRussG: that means you need to fix as an high priority these SQL queries, http://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sql-mode-strict [23:16:52] AndyRussG: yes, right [23:17:00] OK cool [23:17:02] but not in a long term, not middle term [23:17:21] yep [23:17:38] hoo: ok, do the needful [23:18:18] I'll do the quick'n'dirty fix on the branch [23:19:30] K just waiting for it to merge then [23:19:51] Dereckson: there are no new i18n keys in this one, btw [23:20:28] Krinkle: 309388 RollbackAction: Allow 'from' to be an empty string live on mw1099 [23:20:34] AndyRussG: ack'ed [23:21:15] Dereckson: OK. Verifying now on test2 with mw10199 [23:21:28] Dereckson: Confirmed. All good. [23:21:57] Hmm not even sure if Jenkins is gonna be willing to merge this one... sez, "Gate pipeline build failed." [23:22:06] (03CR) 10Andrew Bogott: [C: 032] Include labspuppetbackend on all Labs puppetmasters. [puppet] - 10https://gerrit.wikimedia.org/r/309398 (owner: 10Andrew Bogott) [23:22:24] Krinkle: okay, syncing [23:22:35] Dereckson: ^ [23:23:37] 06Operations, 06Labs, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2621490 (10yuvipanda) Only things left are etcd and integration. [23:23:40] AndyRussG: Dereckson: You'll have to [x] Jenkins's V-1 first before you add your V+2 [23:23:52] and then after publishing those votes, refresh and click Submit [23:23:56] AndyRussG: no, gate-and-submit pipeline won't merge patches with test failures, but we can still merge them manually, as indicated by Krinkle [23:23:56] (which won't appear until then) [23:23:57] !log dereckson@tin Synchronized php-1.28.0-wmf.18/includes/actions/RollbackAction.php: RollbackAction: Allow 'from' to be an empty string (T141985, 1/2) (duration: 00m 46s) [23:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:09] Ah I see [23:25:12] !log dereckson@tin Synchronized php-1.28.0-wmf.18/resources/src/mediawiki/page/rollback.js: RollbackAction: Allow 'from' to be an empty string (T141985, 2/2) (duration: 00m 46s) [23:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:24] Krinkle: here you are ^ [23:25:31] Krinkle: Dereckson: K looks OK now https://gerrit.wikimedia.org/r/#/c/309479/ [23:25:46] (03PS3) 10Andrew Bogott: Grants for labspuppet (user for the labspuppetbackend tool) [puppet] - 10https://gerrit.wikimedia.org/r/309414 [23:25:48] (03PS1) 10Andrew Bogott: Quote the labspuppetbackend hiera settings. [puppet] - 10https://gerrit.wikimedia.org/r/309484 [23:27:36] Krenair: if we merge it, do you have an idea to also test the feature touched by the failed test? [23:28:42] (03CR) 10Andrew Bogott: [C: 032] Grants for labspuppet (user for the labspuppetbackend tool) [puppet] - 10https://gerrit.wikimedia.org/r/309414 (owner: 10Andrew Bogott) [23:28:53] (03CR) 10Andrew Bogott: [C: 032] Quote the labspuppetbackend hiera settings. [puppet] - 10https://gerrit.wikimedia.org/r/309484 (owner: 10Andrew Bogott) [23:29:29] Dereckson, I intend to test it in prod if that's what you mean, yes [23:29:36] it is a betafeature on enwiki [23:30:02] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:30:43] (03PS1) 10GWicke: Add security header filters [puppet] - 10https://gerrit.wikimedia.org/r/309486 [23:31:42] Is SWAT done? [23:31:53] hoo: no, we're at 33% [23:31:58] Dereckson: ah, ok [23:32:01] thanks for the heads up [23:32:15] hoo: I'll ping you after the SWAT if you wish [23:32:30] Dereckson: That would be great [23:32:54] 06Operations, 06Labs, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2621517 (10yuvipanda) Hi @joe. Do you still intend on using the etcd project as is? I tried looking through it, puppet has... [23:34:58] (03CR) 10Ppchelko: [C: 031] "+1 needs to be merged/deployed after https://github.com/wikimedia/restbase/pull/665" [puppet] - 10https://gerrit.wikimedia.org/r/309486 (owner: 10GWicke) [23:35:20] hoo: do you have any UBN emergency? [23:35:44] Dereckson: not sure how many pages are affected, but things went south for us [23:38:26] yurik_: Krenair: OuKB: live on mw1099 [23:38:33] AndyRussG: ^ [23:38:44] Dereckson: K checking [23:38:52] checking [23:39:14] Dereckson, works [23:39:25] OuKB, i have an example on the help page [23:39:48] yurik_: ack'ed [23:40:07] Dereckson, works [23:40:56] yurik_: OuKB: syncing [23:41:35] (03PS1) 10Yuvipanda: labs: move hiera settings around for the hiera god [puppet] - 10https://gerrit.wikimedia.org/r/309489 [23:41:38] !log dereckson@tin Synchronized php-1.28.0-wmf.18/extensions/Kartographer/modules/box/Map.js: Switch to geojson for geoshapes srv (T144777) (duration: 00m 48s) [23:41:39] andrewbogott: let's try ^? [23:41:39] T144777: Geojson object on -180/180 longitude draws incorrect view - https://phabricator.wikimedia.org/T144777 [23:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:42:06] andrewbogott: mostly as a 'I think this will work and confirm my fears that I do not fundamentally understand how hiera works' [23:42:42] but... [23:42:52] those settings are already set in hiera for labtestcontrol2001 [23:42:54] which will win? [23:42:56] Dereckson: K the front-end stuff is working fine, just lemme try something that'll jiggle the DB [23:43:08] andrewbogott: ah, I see. ok, let me rejig it a bit [23:43:58] (03PS2) 10Yuvipanda: labs: move hiera settings around for the hiera god [puppet] - 10https://gerrit.wikimedia.org/r/309489 [23:43:59] andrewbogott: how about ^? [23:44:55] (03CR) 10Andrew Bogott: [C: 032] "I can almost understand why this would work" [puppet] - 10https://gerrit.wikimedia.org/r/309489 (owner: 10Yuvipanda) [23:45:18] Krenair: syncing [23:45:46] I've opened T145152 for the test failure. [23:45:46] T145152: Scenario failure in Selenium tests: Popups can be enabled via the "Enable previews" footer link - https://phabricator.wikimedia.org/T145152 [23:47:09] !log dereckson@tin Synchronized php-1.28.0-wmf.18/extensions/Popups/extension.json: ext.popups.core depends on mediawiki.storage ([[Gerrit:309469]]) (duration: 00m 46s) [23:48:38] yuvipanda: no change [23:48:51] andrewbogott: ok! so now I have no idea wtf is going on :) [23:49:01] right on [23:49:03] Dereckson: looks fine! [23:49:25] AndyRussG: okay, syncing [23:50:11] Dereckson: \o/ [23:52:59] !log dereckson@tin Synchronized php-1.28.0-wmf.18/extensions/CentralNotice: Bump production version to 4dbd3f9 (duration: 00m 51s) [23:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:54:53] hoo: okay, I'm done [23:54:58] Dereckson: Thanks [23:55:44] Krinkle: AaronSchulz when was the dealine for https://phabricator.wikimedia.org/T108255 MariaDB strict mode on prod? [23:55:50] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:55:58] (or should I ask someone else?) [23:56:01] Dereckson: thanks much! [23:56:23] AndyRussG: it's in master as of a few days ago for unit tests. [23:56:48] So future commits to master will fail accordingly if not addressed [23:57:11] We fixed most violations we could find in extensions before merging the patch in mater [23:57:12] maste [23:59:01] Krinkle: right... mmm ETA for when would the config change be enabled on prod?