[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151211T0000). Please do the needful. [00:00:04] ebernhardson jdlrobson yurik jgirault Krinkle: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:16] * yurik yep [00:00:23] * yurik is ready [00:00:37] ready too [00:00:52] \o [00:01:13] whoever is doing the SWAT - should i +2 the backported patch? [00:01:38] o/ [00:02:20] \o [00:02:20] I'm doing it [00:02:22] Looking [00:02:45] yurik: No worries, I've +2ed it for you [00:02:52] thx ) [00:04:42] (03CR) 10Catrope: [C: 032] Bump portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258246 (owner: 10EBernhardson) [00:05:00] First doing jgirault's portals change while Jenkins works on the other patches [00:05:30] ok [00:05:31] (03Merged) 10jenkins-bot: Bump portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258246 (owner: 10EBernhardson) [00:05:44] (03CR) 10CSteipp: "And because 'staff' is a global group, that policy should have been added to $wgCentralAuthGlobalPasswordPolicies." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222057 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [00:05:59] (03PS1) 10CSteipp: Set initial Staff password policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258385 (https://phabricator.wikimedia.org/T104370) [00:06:28] (03CR) 10CSteipp: "Done in Ie906bb646f8b8675e994432996b569f05ceff0be" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222057 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [00:06:34] (03CR) 10Catrope: [C: 032] Set initial Staff password policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258385 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [00:06:38] And also csteipp's patch [00:06:43] !log catrope@tin Synchronized portals/: SWAT (duration: 00m 30s) [00:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:06:59] jgirault: Yours is done, please verify [00:07:04] (03Merged) 10jenkins-bot: Set initial Staff password policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258385 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [00:08:21] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Staff password policy, take 2 (duration: 00m 28s) [00:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:28] RoanKattouw: all good [00:11:22] speaking of centralauth problems...https://meta.wikimedia.org/wiki/Special:CentralAuth/Lesabendio --> Exception [00:11:27] csteipp: Yours is done too, hopefully it works this time [00:11:29] :) [00:11:39] thedj: I think legoktm investigated that earlier [00:11:52] Something about a lock timeout I think [00:12:16] it was reported at meta already, so can be that someone already is on it. [00:18:04] RECOVERY - puppet last run on mw2196 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:19:08] !log catrope@tin Synchronized php-1.27.0-wmf.8/extensions/Graph: SWAT (duration: 00m 30s) [00:19:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:19:23] (03PS7) 10Chad: Fix redirections in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [00:19:37] !log catrope@tin Synchronized php-1.27.0-wmf.8/extensions/ImageMap: SWAT (duration: 00m 29s) [00:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:20:07] !log catrope@tin Synchronized php-1.27.0-wmf.8/extensions/RelatedArticles: SWAT (duration: 00m 29s) [00:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:20:19] (03CR) 10Chad: [C: 031] Fix viewing raw in phabricator on a pc [puppet] - 10https://gerrit.wikimedia.org/r/257629 (owner: 10Paladox) [00:20:38] !log catrope@tin Synchronized php-1.27.0-wmf.8/extensions/CirrusSearch: SWAT (duration: 00m 30s) [00:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:22:06] yurik: Krinkle: jdlrobson: ebernhardson: Your wmf8 changes are done ---^^ [00:22:10] Config changes next [00:22:45] kk [00:22:45] RoanKattouw, did you scap? [00:22:56] yurik: No, should I have? [00:23:00] new i18n keys [00:23:03] I don't scap unless someone explicitly tells me to [00:23:04] RECOVERY - puppet last run on db2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:23:06] not a big deal though [00:23:24] new tracking category [00:23:24] (03CR) 10Catrope: [C: 032] A/B test for search lang detect via accept-language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258084 (https://phabricator.wikimedia.org/T119528) (owner: 10EBernhardson) [00:23:26] The auto-scap is in an hour and a half. [00:23:32] Right [00:23:41] RoanKattouw: mm... not seeing the desired impact [00:23:45] ebernhardson: I'm not gonna touch https://gerrit.wikimedia.org/r/#/c/255135/5 since it creates a new submodule, I'd prefer someone from releng do that [00:23:53] jdlrobson: FYI I haven't done https://gerrit.wikimedia.org/r/#/c/257435/ yet [00:23:55] Maybe that's why? [00:24:00] ahh okay that's why then yeh :) [00:24:04] (03Merged) 10jenkins-bot: A/B test for search lang detect via accept-language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258084 (https://phabricator.wikimedia.org/T119528) (owner: 10EBernhardson) [00:24:18] (03CR) 10Catrope: [C: 032] Enable RelatedArticles on all wikipedias in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257435 (https://phabricator.wikimedia.org/T116676) (owner: 10Jdlrobson) [00:24:28] RoanKattouw, actually i think you do need to scap - https://en.wikipedia.org/wiki/Special:TrackingCategories [00:24:33] "category is disablbed" [00:24:46] RoanKattouw: shouldn't be any different than a normal deploy...but ok [00:25:02] (03Merged) 10jenkins-bot: Enable RelatedArticles on all wikipedias in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257435 (https://phabricator.wikimedia.org/T116676) (owner: 10Jdlrobson) [00:27:01] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Search language detection A/B test and RelatedArticles on all Wikipedias in beta (duration: 00m 29s) [00:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:27:35] And all done [00:28:01] (03PS1) 10CSteipp: Set initial Staff password policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258387 (https://phabricator.wikimedia.org/T104370) [00:28:53] 6operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-Requests, 5codfw-appserver-setup, 5wikis-in-codfw: Configure mediawiki to operate in the Dallas DC - https://phabricator.wikimedia.org/T91754#1871776 (10demon) [[ https://gerrit.wikimedia.org/r/#/c/197499/ | Gerrit 197499 ]] was never merged... [00:29:09] (03CR) 10CSteipp: "And the CentralAuth.php default overrides this, since I set it too early. Fixed in Ief95dd1e40c0fd5b9631bd854a17f30a17f0684b." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258385 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [00:30:50] (03CR) 10Paladox: [C: 031] "Looks good. Haven't tested but please merge this." [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [00:31:02] (03CR) 10Paladox: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [00:31:16] RoanKattouw_away, so are you scaping? [00:32:59] RoanKattouw_away: mm something weird is going on. Trying to understand what [00:35:18] jdlrobson: he left to a meeting upstairs, whats up? [00:35:30] ebernhardson: well the feature is showing in beta features [00:35:32] but not working [00:35:43] yurik: i don't see a scap in the process list for tin [00:35:47] so nothing super problematic [00:35:51] but looks bad [00:35:57] jdlrobson: you want to revert it? [00:36:03] ebernhardson, i think RoanKattouw_away didn't see my comment [00:36:07] trying to understand what's going on first [00:36:20] yurik: unfortunatly i have to leave in ~15 minutes so i can't scap [00:36:20] oh wait it's working [00:36:21] phew [00:36:23] jdlrobson: :) [00:36:39] ebernhardson, do you know if anyone else is doing scap today? [00:37:01] on mobile at least.. [00:37:09] yurik: you could try ostriches, but i havn't seen him do one in awhile :) [00:37:25] Krenair is also listed as a swatter ) [00:37:34] hi [00:37:35] 6operations, 6Analytics-Backlog: Create email alias that will send emails to all Analytics Engineers - https://phabricator.wikimedia.org/T121180#1871791 (10madhuvishy) 3NEW a:3yuvipanda [00:37:45] hey Krenair, seems like RoanKattouw_away ran away ) [00:37:53] without scaping [00:38:10] Krenair, could you do that? [00:38:16] putting something that needs a scap into swat is jsut mean [00:38:29] where are we in the list? [00:38:40] Krenair: roan finished and ran away to wine tasting [00:38:47] for varying definitions of finished ;) [00:38:50] * yurik gives himself a Mean Medal (it has lots of spikes) [00:38:57] It's still not clear what's missing. [00:39:06] Krenair: yuri's patch that he deployed requires scap [00:39:08] Krenair, https://en.wikipedia.org/wiki/Special:TrackingCategories [00:39:15] ebernhardson: Roan drinks wine? [00:39:22] Reedy: no, but he socializes [00:39:28] Ugh, because i18n [00:39:33] Is there any reason why `class_exists( 'BetaFeatures' ) && BetaFeatures::isFeatureEnabled( $out->getUser(), 'read-more' )` would not be true if the beta feature was enabled? [00:39:38] yurik, I don't think this patch is eligible for swat. [00:39:39] Krenair, the tracking category is disable :( [00:40:05] both those things should be true for me but the module that should be loaded is not being loaded [00:41:20] Krenair, it has already been deployed. We could just wait for i18n to auto-sync everything tomorrow [00:41:32] whenever the i18n is autoupdated [00:41:38] greg-g, ^ [00:42:30] l10nupdate will run in ~3 hours [00:42:40] totally fine with that [00:43:10] That doesn't run scap though, does it bd808? [00:43:19] oh, but it will update l10n [00:43:22] right [00:43:34] which I think is all yurik needs at this point [00:43:59] i think so [00:44:16] actually it will kick off at 02:00 so just over an hour [00:44:57] Is there any reason why `class_exists( 'BetaFeatures' ) && BetaFeatures::isFeatureEnabled( $out->getUser(), 'read-more' )` would not be true if the beta feature was enabled? [00:45:04] i don't understand why this works on beta labs but not production [00:45:12] where does the code you pasted live? [00:45:23] thx bd808 [00:45:51] 6operations, 6Analytics-Backlog: Create email alias that will send emails to all Analytics Engineers - https://phabricator.wikimedia.org/T121180#1871815 (10yuvipanda) 5Open>3Resolved Done [00:46:13] jdlrobson: ^ [00:46:13] In an extension called RelatedArticles [00:46:16] lemme get link ori [00:46:38] https://github.com/wikimedia/mediawiki-extensions-RelatedArticles/blob/master/includes/FooterHooks.php#L72 [00:47:00] it's working fine on mobile beta (handled by first part of if statement but module isn't loading in desktop for some reason) [00:47:14] it works for me [00:47:18] oh [00:47:21] i'm on mobile beta [00:47:37] yeh not seeing it in desktop despite beta feature enabled [00:47:39] let me try desktop with the beta feature enabled [00:47:41] which is puzzling [00:47:51] i don't see the beta feature [00:48:00] It's called "Read more" [00:49:12] since it's showing up in beta features I can deduce that $showReadMore is true [00:49:30] jdlrobson: because get_class( $skin ) === 'SkinMinervaBeta' [00:49:37] oh wait, that's the || [00:49:39] yeh.. [00:49:50] so either class_exists( 'BetaFeatures' ) doesn't exist for some reason [00:50:00] or beta features are not saving correctly [00:51:19] argg wtf is BetaFeaturesWhitelist [00:51:43] so apparently you can make something show up in beta features but isFeatureEnabled will fail if it's not listed in BetaFeaturesWhitelist ? [00:51:54] and it says "// DO NOT add entries here without OK from Greg Grossmeier or James Forrester." [00:52:46] So.. I don't know what to do here. [00:53:19] jdlrobson: add it [00:54:28] (03PS1) 10Madhuvishy: Change list of analytics engineers emails to new alias [puppet] - 10https://gerrit.wikimedia.org/r/258390 [00:54:43] (03PS1) 10Jdlrobson: Whitelist read more which is showing up in beta features but not working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258391 [00:54:57] that list also has dates that say when they should be automatically undeployed and everything except VE is past expiration [00:55:01] bd808: well that's one way ... ^ [00:55:15] oh it's not the day it was added? [00:55:42] (03PS2) 10Jdlrobson: Whitelist read more which is showing up in beta features but not working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258391 [00:55:42] looks like you are supposed to add 6 months [00:55:46] bd808: done [00:56:02] * jdlrobson wishes he got wine... [00:56:05] I'll sync it for you [00:56:15] only reason i came into office :-( [00:56:25] thanks bd808 [00:56:25] (03CR) 10BryanDavis: [C: 032] Whitelist read more which is showing up in beta features but not working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258391 (owner: 10Jdlrobson) [00:56:47] (03Merged) 10jenkins-bot: Whitelist read more which is showing up in beta features but not working [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258391 (owner: 10Jdlrobson) [00:56:48] i will send annoyed email about this whitelist which seems to have been clumsily put together [00:57:04] let's just clear out the expired entries [00:57:43] actually, send an e-mail [00:57:48] ori: yeh lemme [00:57:55] * ori lets you [00:58:00] ori: 'cause it's bad... the beta feature will show up in the preferences... but won't work [00:58:12] !log bd808@tin Synchronized wmf-config/InitialiseSettings.php: Add read-more to beta features whitelist (duration: 00m 30s) [00:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:58:31] jdlrobson: test please [00:58:35] bd808: that did it [00:58:42] what a frigging stupid bit of code/config [00:58:43] victory! [00:59:06] boom. [00:59:08] thanks bd808 and ori [00:59:13] beta features got a lot of drama back in the day [00:59:18] i had nothing to do with it, thank bd808 [00:59:28] ori thanks for looking [00:59:29] you rubber ducked for him [00:59:35] * jdlrobson will now go beat roan with a wine bottle :P [00:59:47] !log rebooting promethium [00:59:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:00:48] greg-g: FYI I made a executive decision to allow read-more to be enabled in the beta features whitelist. Send the goon squad to get me if that was the wrong thing to do. [01:01:00] we have a goon squad? [01:01:06] and i'm not in it? [01:01:10] * ori pouts [01:01:18] your too puny [01:01:39] Max and Joe do all the heavy beatings [01:13:20] !log ori@tin Synchronized php-1.27.0-wmf.8/includes/jobqueue: Ia44ec5ed4: Add per-partition JobQueueRedis aggregation + Fix bad regex in 6fe2f48df (duration: 00m 31s) [01:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:14:32] AaronSchulz: ^ [01:16:52] I see [01:17:53] ori: so https://logstash.wikimedia.org/#/dashboard/elasticsearch/redis show "used eval()", which confirms the switch. Looks quite. [01:18:36] (03CR) 10Brian Wolff: [C: 04-1] Set initial Staff password policy (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258387 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [01:21:45] ori: https://gerrit.wikimedia.org/r/#/c/258399/1 just for you [01:21:50] (03PS1) 10Andrew Bogott: Fix installation of sudo-ldap on labs instances. [puppet] - 10https://gerrit.wikimedia.org/r/258400 (https://phabricator.wikimedia.org/T120262) [01:30:44] (03PS2) 10Andrew Bogott: Fix installation of sudo-ldap on labs instances. [puppet] - 10https://gerrit.wikimedia.org/r/258400 (https://phabricator.wikimedia.org/T120262) [01:40:42] (03PS2) 10BBlack: new unified RSA and ECC certs [puppet] - 10https://gerrit.wikimedia.org/r/258362 (https://phabricator.wikimedia.org/T116618) (owner: 10RobH) [01:42:41] !log starting unified cert upgrade process [01:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:44:31] (03CR) 10BBlack: [C: 032] new unified RSA and ECC certs [puppet] - 10https://gerrit.wikimedia.org/r/258362 (https://phabricator.wikimedia.org/T116618) (owner: 10RobH) [01:45:54] PROBLEM - puppet last run on es2003 is CRITICAL: CRITICAL: puppet fail [01:52:30] jdlrobson: Meh. Yeah, ideally this would have gone through the process. [01:52:50] jdlrobson: And yeah, I need to just de-productise the dead ones. All of them have special exemptions, but it's not cool. [01:54:02] (03CR) 10Tim Landscheidt: Fix installation of sudo-ldap on labs instances. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/258400 (https://phabricator.wikimedia.org/T120262) (owner: 10Andrew Bogott) [01:55:57] (03CR) 10Andrew Bogott: Fix installation of sudo-ldap on labs instances. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/258400 (https://phabricator.wikimedia.org/T120262) (owner: 10Andrew Bogott) [02:00:03] PROBLEM - HHVM rendering on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:00:14] PROBLEM - Apache HTTP on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:01:04] PROBLEM - puppet last run on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:01:24] PROBLEM - HHVM processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:01:34] PROBLEM - nutcracker port on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:01:34] PROBLEM - SSH on mw1135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:01:54] PROBLEM - RAID on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:02:04] PROBLEM - configured eth on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:02:14] PROBLEM - salt-minion processes on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:02:24] PROBLEM - dhclient process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:02:24] PROBLEM - nutcracker process on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:02:34] PROBLEM - Check size of conntrack table on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:02:44] bblack: someone just reported that they got a no certificate error on wikipedia.org, is that expected? [02:02:54] PROBLEM - DPKG on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:03:04] PROBLEM - Disk space on mw1135 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:03:55] "The OCSP response does not include a status for the certificate being verified." [02:04:02] same here :/ [02:04:06] and I'm getting that too now [02:04:15] * Romaine gets the same no certificate errors [02:04:22] on Wikipedia, Wikidata, etc [02:04:31] ah that's going to be Firefox-only, it can be fixed pretty quickly [02:04:43] it's a process bug with how we do OCSP stapling and timing with the cert upgrade [02:05:21] I have to wait for the other salt command to complete first, please hold [02:07:59] ok, thanks. I updated the topic in -tech [02:08:28] (03PS3) 10Andrew Bogott: Fix installation of sudo-ldap on labs instances. [puppet] - 10https://gerrit.wikimedia.org/r/258400 (https://phabricator.wikimedia.org/T120262) [02:08:56] (03CR) 10Andrew Bogott: "PS3 works but it's messy as it runs the exec every time." [puppet] - 10https://gerrit.wikimedia.org/r/258400 (https://phabricator.wikimedia.org/T120262) (owner: 10Andrew Bogott) [02:09:02] and everything works here again [02:09:24] thanks [02:09:25] bblack: just saw it on phab, yes, firefox [02:09:30] works on reload [02:09:37] not for me yet :/ [02:10:04] oh, there we go [02:10:09] (03PS1) 10Jforrester: In VisualEditor on single edit tab wikis, set the default editor appropriately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258403 [02:10:11] (03PS1) 10Jforrester: VisualEditor: Provide framework for enabling an A/B test for IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258404 [02:10:13] (03PS1) 10Jforrester: VisualEditor: Don't set ShowBetaWelcome, now set in repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258405 [02:10:16] it's going server by server, depends on your client IP which you're mapped to, still running [02:10:57] Fixed in here now [02:11:12] (Just when I was going to write that no) :) [02:11:57] (03CR) 10jenkins-bot: [V: 04-1] VisualEditor: Provide framework for enabling an A/B test for IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258404 (owner: 10Jforrester) [02:12:25] (03CR) 10jenkins-bot: [V: 04-1] VisualEditor: Don't set ShowBetaWelcome, now set in repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258405 (owner: 10Jforrester) [02:13:35] RECOVERY - puppet last run on es2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:13:52] fixed for all now [02:15:04] !log unified cert upgrades complete [02:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:15:24] bblack: I haven't been following along. Is the new cert expected to have any kind of performance impact? (Apart from the temporary effects of the rollout) [02:15:26] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1871948 (10BBlack) [02:15:28] 6operations, 10Traffic, 7HTTPS: acquire SSL certificate for w.wiki - https://phabricator.wikimedia.org/T91612#1871949 (10BBlack) [02:15:48] ori: no, it's exactly the same in perf terms [02:16:05] nod [02:16:14] other than the minor diff of adding a few bytes to add the "w.wiki" domainname to the SAN list (but that's not enough to bump an extra packet or anything) [02:16:52] right [02:17:11] 6operations, 10Traffic, 7HTTPS: acquire SSL certificate for w.wiki - https://phabricator.wikimedia.org/T91612#1871955 (10BBlack) 5Open>3Resolved a:3BBlack The cert is deployed now [02:18:43] ooh ohh [02:19:43] RECOVERY - configured eth on mw1135 is OK: OK - interfaces up [02:19:44] RECOVERY - salt-minion processes on mw1135 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:19:55] RECOVERY - dhclient process on mw1135 is OK: PROCS OK: 0 processes with command name dhclient [02:20:04] RECOVERY - nutcracker process on mw1135 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [02:20:04] RECOVERY - Check size of conntrack table on mw1135 is OK: OK: nf_conntrack is 0 % full [02:20:33] RECOVERY - DPKG on mw1135 is OK: All packages OK [02:20:34] RECOVERY - Disk space on mw1135 is OK: DISK OK [02:20:43] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 53 minutes ago with 0 failures [02:20:55] RECOVERY - HHVM processes on mw1135 is OK: PROCS OK: 6 processes with command name hhvm [02:21:04] RECOVERY - nutcracker port on mw1135 is OK: TCP OK - 0.000 second response time on port 11212 [02:21:05] RECOVERY - SSH on mw1135 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [02:21:25] RECOVERY - RAID on mw1135 is OK: OK: no RAID installed [02:21:43] RECOVERY - HHVM rendering on mw1135 is OK: HTTP OK: HTTP/1.1 200 OK - 70212 bytes in 1.179 second response time [02:21:45] RECOVERY - Apache HTTP on mw1135 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.074 second response time [02:22:06] (03PS1) 10BBlack: add w.wiki to VCL TLS redirect/HSTS [puppet] - 10https://gerrit.wikimedia.org/r/258406 (https://phabricator.wikimedia.org/T108649) [02:23:36] (03PS1) 10BBlack: w.wiki: unlink from wp.o, remove all hostnames within (unused) [dns] - 10https://gerrit.wikimedia.org/r/258407 (https://phabricator.wikimedia.org/T108649) [02:32:28] bd808: s'allgood (cc jdlrobson ) re read-more [02:32:33] (03PS5) 10Andrew Bogott: Fix installation of sudo-ldap on labs instances. [puppet] - 10https://gerrit.wikimedia.org/r/258400 (https://phabricator.wikimedia.org/T120262) [02:33:27] (03CR) 10BBlack: [C: 032] w.wiki: unlink from wp.o, remove all hostnames within (unused) [dns] - 10https://gerrit.wikimedia.org/r/258407 (https://phabricator.wikimedia.org/T108649) (owner: 10BBlack) [02:33:42] (03CR) 10Andrew Bogott: "It took four engineers to come up with that onlyif statement" [puppet] - 10https://gerrit.wikimedia.org/r/258400 (https://phabricator.wikimedia.org/T120262) (owner: 10Andrew Bogott) [02:34:13] !log restarted nodetool decommission on restbase1007, as I had to restart one of the stream targets [02:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:35:28] (03CR) 10BBlack: [C: 032] add w.wiki to VCL TLS redirect/HSTS [puppet] - 10https://gerrit.wikimedia.org/r/258406 (https://phabricator.wikimedia.org/T108649) (owner: 10BBlack) [02:36:46] bblack: yaay thank you :D [02:37:01] legoktm: :) [02:39:27] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.8) (duration: 17m 01s) [02:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:40:20] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS, 5Patch-For-Review: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1872008 (10Legoktm) 5stalled>3Open No longer stalled! [02:40:47] legoktm: is the shortener going live? [02:41:00] well, not now :P [02:41:08] Probably in January after the deploy freeze is over [02:41:13] neat [02:41:21] We still have some blockers to finish up [02:42:45] 6operations, 10Traffic, 10Wikimedia-General-or-Unknown, 7HTTPS, 5Patch-For-Review: Set up "w.wiki" domain for usage with UrlShortener - https://phabricator.wikimedia.org/T108649#1872022 (10BBlack) 5Open>3Resolved a:3BBlack The domain/Traffic part is resolved here. The rest is up to apache/MW confi... [02:43:29] Gerrit does not like that unlink commit [02:43:57] It shows "D templates/w.wiki" and "A templates/w.wiki" [02:44:05] (03PS1) 10Jforrester: BetaFeatures: Update language and dates of 'retirement' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258409 [02:44:06] but the diffs go to the same place [02:44:08] which shows the A [02:44:18] yeah gerrit and softlinks don't get along well, but the underlying git commit is sane [02:46:13] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Dec 11 02:46:13 UTC 2015 (duration 6m 46s) [02:46:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:32] Reedy: ^ [03:21:11] !log upgraded codfw nodes to 2.1.12 [03:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:24:46] 6operations, 10SEO: secure.wikimedia.org entries still showing up in Google search results - https://phabricator.wikimedia.org/T93531#1872091 (10Krinkle) These google searches don't mean anything. See also https://phabricator.wikimedia.org/T87561#1673543, https://phabricator.wikimedia.org/T67402#1571061, and... [03:25:02] ^ we could just remove it :) [03:28:58] 6operations, 5Patch-For-Review: Remove secure.wikimedia.org - https://phabricator.wikimedia.org/T120790#1872113 (10Krinkle) Traffic is irrelevant imho. This was a user-facing service for many years. See also . It's not like bits.wikimedia.org or other int... [04:13:15] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 25.93% of data above the critical threshold [100000000.0] [04:52:33] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [05:15:33] PROBLEM - HHVM rendering on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:15:33] PROBLEM - Apache HTTP on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:17:04] PROBLEM - RAID on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:17:04] PROBLEM - nutcracker port on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:17:14] PROBLEM - configured eth on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:17:24] PROBLEM - Check size of conntrack table on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:17:33] PROBLEM - nutcracker process on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:17:44] PROBLEM - DPKG on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:17:45] PROBLEM - Disk space on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:04] PROBLEM - salt-minion processes on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:05] PROBLEM - puppet last run on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:15] PROBLEM - HHVM processes on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:24] PROBLEM - dhclient process on mw1122 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:18:25] (03PS2) 10Yuvipanda: Change list of analytics engineers emails to new alias [puppet] - 10https://gerrit.wikimedia.org/r/258390 (owner: 10Madhuvishy) [05:18:33] (03CR) 10Yuvipanda: [C: 032 V: 032] Change list of analytics engineers emails to new alias [puppet] - 10https://gerrit.wikimedia.org/r/258390 (owner: 10Madhuvishy) [05:18:34] PROBLEM - SSH on mw1122 is CRITICAL: Server answer [05:25:35] RECOVERY - Disk space on mw1122 is OK: DISK OK [05:25:43] RECOVERY - salt-minion processes on mw1122 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:25:55] RECOVERY - HHVM processes on mw1122 is OK: PROCS OK: 6 processes with command name hhvm [05:26:04] RECOVERY - dhclient process on mw1122 is OK: PROCS OK: 0 processes with command name dhclient [05:26:15] RECOVERY - SSH on mw1122 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [05:26:35] RECOVERY - RAID on mw1122 is OK: OK: no RAID installed [05:26:36] RECOVERY - nutcracker port on mw1122 is OK: TCP OK - 0.000 second response time on port 11212 [05:26:45] RECOVERY - configured eth on mw1122 is OK: OK - interfaces up [05:26:55] RECOVERY - Check size of conntrack table on mw1122 is OK: OK: nf_conntrack is 0 % full [05:27:05] RECOVERY - nutcracker process on mw1122 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:27:23] RECOVERY - DPKG on mw1122 is OK: All packages OK [05:27:44] RECOVERY - puppet last run on mw1122 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:30:33] PROBLEM - Disk space on elastic1012 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%) [06:31:04] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=95%) [06:31:05] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:05] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:13] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on mw2136 is CRITICAL: CRITICAL: puppet fail [06:31:24] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:14] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:14] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:03] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:23] PROBLEM - puppet last run on mw2052 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:24] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 89 failures [06:46:54] !log deploy restbase c67a41e9d52: add an expensive title to re-render blacklist, per subbu's request [06:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:56:35] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:35] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:56:36] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:43] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:56:44] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:56:55] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:03] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:25] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:58:44] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:44] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:44] RECOVERY - puppet last run on mw2052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:10:57] !log deleting some indexing slow logs on elastic1012 and elastic1016 [07:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:11:45] RECOVERY - Disk space on elastic1012 is OK: DISK OK [07:12:24] RECOVERY - Disk space on elastic1016 is OK: DISK OK [07:36:56] 6operations, 10OCG-General-or-Unknown, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#1872312 (10Aklapper) Any idea who should be assigned to this / whose responsibility this might be? [08:12:41] PROBLEM - RAID on mw1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:12:58] (03PS2) 10Giuseppe Lavagetto: Use system-wide etcd configurations for the etcd driver [software/conftool] - 10https://gerrit.wikimedia.org/r/256480 [08:13:00] (03PS1) 10Giuseppe Lavagetto: Add confctl the ability to find all instances of an entity [software/conftool] - 10https://gerrit.wikimedia.org/r/258428 [08:13:31] (03CR) 10jenkins-bot: [V: 04-1] Use system-wide etcd configurations for the etcd driver [software/conftool] - 10https://gerrit.wikimedia.org/r/256480 (owner: 10Giuseppe Lavagetto) [08:13:42] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: Puppet has 8 failures [08:13:44] (03CR) 10jenkins-bot: [V: 04-1] Add confctl the ability to find all instances of an entity [software/conftool] - 10https://gerrit.wikimedia.org/r/258428 (owner: 10Giuseppe Lavagetto) [08:14:31] RECOVERY - RAID on mw1002 is OK: OK: no RAID installed [08:15:40] (03PS1) 10Aaron Schulz: Set a high but finite "maxjobs" default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258429 [08:16:41] (03CR) 10Ori.livneh: [C: 031] "Makes sense." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258429 (owner: 10Aaron Schulz) [08:17:08] (03Abandoned) 10Giuseppe Lavagetto: Clarified the error message since we're in a multi-host setup now. [software/conftool] - 10https://gerrit.wikimedia.org/r/256481 (owner: 10Giuseppe Lavagetto) [08:17:28] (03Abandoned) 10Giuseppe Lavagetto: Fix tests [software/conftool] - 10https://gerrit.wikimedia.org/r/256482 (owner: 10Giuseppe Lavagetto) [08:26:11] 6operations, 10OCG-General-or-Unknown, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#1872381 (10ori) If we moved ocg to one of the Jessie boxes, which have redis 2.8, we'd be able to iterate on keys using [[ http://redis.io/commands/hscan | `HSCAN` ]]. [08:27:32] 6operations, 10OCG-General-or-Unknown, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#1872382 (10Joe) @Aklapper apparently no one wants to take responsibility for this. And it's a problem of course. [08:28:15] (03CR) 10Giuseppe Lavagetto: [C: 031] Set a high but finite "maxjobs" default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258429 (owner: 10Aaron Schulz) [08:39:11] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:42:20] 6operations, 10Wikimedia-Mailing-lists, 6Wiktionary: wiktionary-l: assign new moderators - https://phabricator.wikimedia.org/T110969#1872396 (10Nemo_bis) > Listadministrators seem inactiveListadministrators seem inactive Not on wiki: https://en.wiktionary.org/w/index.php?title=eneuch&diff=prev&oldid=32884229 [08:42:52] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [5000000.0] [08:52:01] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: puppet fail [08:54:32] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [5000000.0] [08:57:03] PROBLEM - HHVM rendering on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:57:41] PROBLEM - Apache HTTP on mw1145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:58:50] PROBLEM - puppet last run on mw1145 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:00:41] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [09:00:42] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:08:02] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [09:08:10] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [09:08:41] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 1 failures [09:09:00] PROBLEM - configured eth on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:09:11] PROBLEM - puppet last run on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:09:12] PROBLEM - DPKG on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:09:30] PROBLEM - Check size of conntrack table on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:09:41] PROBLEM - Labs LDAP on serpens is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:09:52] PROBLEM - Disk space on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:10:01] PROBLEM - dhclient process on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:10:20] PROBLEM - RAID on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:10:40] PROBLEM - salt-minion processes on serpens is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:13:51] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.049 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [09:18:31] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [09:19:10] 6operations, 10DBA, 5Patch-For-Review, 5WMF-deploy-2015-12-15_(1.27.0-wmf.9): Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1872466 (10jcrespo) [09:19:39] (03CR) 10Merlijn van Deen: "See also the discussion on https://phabricator.wikimedia.org/T116813 -- a (maybe better) alternative is to run unattended-upgrades more re" [puppet] - 10https://gerrit.wikimedia.org/r/249489 (https://phabricator.wikimedia.org/T116813) (owner: 10Merlijn van Deen) [09:21:35] <_joe_> is serpens down? [09:26:19] <_joe_> !log ganeti-rebooting serpens, cannot get into console [09:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:29:21] RECOVERY - Labs LDAP on serpens is OK: LDAP OK - 0.131 seconds response time [09:29:31] RECOVERY - Disk space on serpens is OK: DISK OK [09:29:42] RECOVERY - dhclient process on serpens is OK: PROCS OK: 0 processes with command name dhclient [09:29:47] 6operations, 7Mail: Mails from MediaWiki seem to get (partially) lost - https://phabricator.wikimedia.org/T121105#1872474 (10Lydia_Pintscher) I should have gotten for example an email about this edit to lydia.pintscher@wikimedia.de: https://www.wikidata.org/w/index.php?title=Wikidata:Requests_for_comment/Impro... [09:29:51] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [09:30:01] RECOVERY - RAID on serpens is OK: OK: no RAID installed [09:30:30] RECOVERY - salt-minion processes on serpens is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:30:31] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [09:30:41] RECOVERY - configured eth on serpens is OK: OK - interfaces up [09:31:01] RECOVERY - puppet last run on serpens is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:31:01] RECOVERY - DPKG on serpens is OK: All packages OK [09:31:11] RECOVERY - Check size of conntrack table on serpens is OK: OK: nf_conntrack is 0 % full [09:31:51] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.030 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [09:35:32] PROBLEM - puppet last run on mw2177 is CRITICAL: CRITICAL: Puppet has 1 failures [09:39:34] (03PS2) 10Giuseppe Lavagetto: Add confctl the ability to find all instances of an entity [software/conftool] - 10https://gerrit.wikimedia.org/r/258428 [09:39:55] (03CR) 10jenkins-bot: [V: 04-1] Add confctl the ability to find all instances of an entity [software/conftool] - 10https://gerrit.wikimedia.org/r/258428 (owner: 10Giuseppe Lavagetto) [09:41:36] <_joe_> bbiab [09:43:52] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [09:45:50] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.053 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [09:45:50] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.059 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.155.135 [09:50:00] !log importing several tables from master to dbstore1002 and labs, lag will will be slightly affected [09:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:53:56] I will restart in a second the couple of HHVM failed processes unless you tell me not to [09:54:05] gwicke: what happened in "02:34 gwicke: restarted nodetool decommission on restbase1007, as I had to restart one of the stream targets [09:59:21] (03CR) 10Alexandros Kosiaris: [C: 04-1] service-runner migration for cxserver (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) (owner: 10KartikMistry) [09:59:32] RECOVERY - puppet last run on mw2177 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [10:11:24] (03CR) 10Mobrovac: [C: 04-1] service-runner migration for cxserver (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) (owner: 10KartikMistry) [10:12:38] godog: dunno for sure, but might be related to the recent big-page blacklisting [10:13:03] godog: so i figure the restart was needed to stop some superfluous compactions [10:13:42] mobrovac: ah ok, thanks! [10:20:00] 6operations, 10DBA: Implement slave_run_triggers_for_rbr at sanitarium for labs filtering - https://phabricator.wikimedia.org/T121207#1872581 (10jcrespo) 3NEW [10:20:57] 6operations, 10DBA: Implement slave_run_triggers_for_rbr at sanitarium for labs filtering - https://phabricator.wikimedia.org/T121207#1872590 (10jcrespo) [10:20:59] 6operations, 10DBA, 5Patch-For-Review, 7Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#1872589 (10jcrespo) [10:21:13] ^this is better than christmas [10:22:40] 6operations, 10DBA, 7Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#1542524 (10jcrespo) [10:23:02] (03CR) 10KartikMistry: service-runner migration for cxserver (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) (owner: 10KartikMistry) [10:23:29] (03PS22) 10KartikMistry: service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) [10:25:09] (03PS3) 10Giuseppe Lavagetto: Add confctl the ability to find all instances of an entity [software/conftool] - 10https://gerrit.wikimedia.org/r/258428 [10:25:32] (03CR) 10jenkins-bot: [V: 04-1] Add confctl the ability to find all instances of an entity [software/conftool] - 10https://gerrit.wikimedia.org/r/258428 (owner: 10Giuseppe Lavagetto) [10:35:14] !log restarting HHVM on mw1122, mw1145 [10:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:37:01] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.048 second response time [10:37:32] RECOVERY - HHVM rendering on mw1122 is OK: HTTP OK: HTTP/1.1 200 OK - 69912 bytes in 1.995 second response time [10:38:58] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM overall, minor question in node.pp" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [10:41:31] PROBLEM - RAID on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:41:52] PROBLEM - nutcracker port on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:41:53] PROBLEM - configured eth on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:42:11] PROBLEM - nutcracker process on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:42:41] PROBLEM - puppet last run on mw1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:43:22] RECOVERY - RAID on mw1005 is OK: OK: no RAID installed [10:43:41] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.000 second response time on port 11212 [10:43:41] RECOVERY - configured eth on mw1005 is OK: OK - interfaces up [10:43:52] RECOVERY - Apache HTTP on mw1145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.290 second response time [10:44:01] RECOVERY - HHVM rendering on mw1145 is OK: HTTP OK: HTTP/1.1 200 OK - 69875 bytes in 1.210 second response time [10:44:01] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:44:10] mmmmh [10:45:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] service-runner migration for cxserver (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) (owner: 10KartikMistry) [10:49:50] (03PS23) 10KartikMistry: service-runner migration for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) [10:53:14] <_joe_> !log restarting hhvm on mw1122 in order to collect heap dumps [10:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:54:05] (03CR) 10KartikMistry: service-runner migration for cxserver (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) (owner: 10KartikMistry) [10:57:34] <_joe_> !log rolling restart of HHVM across the jobrunners to ease memory consumption (this will go on during the day, hhvm restarts will be distantiated by 2 hours each) [10:57:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:11:38] (03PS6) 10Mobrovac: RESTBase: Switch to service::node [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) [11:12:14] (03CR) 10Alexandros Kosiaris: [C: 031] "This one finally compiles nicely." [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) (owner: 10KartikMistry) [11:14:49] Phabricator is down at the moment [11:14:50] "Our servers are currently experiencing a technical problem. " [11:15:45] mobrovac: btw please avoid rebasing PSs for ongoing reviews, it makes looking at diffs between PSs harder [11:15:55] hm, works again now [11:16:00] the "reference version" as gerrit calls it [11:16:56] ah k godog, i usually rebase to make sure nothing has changed underneath my feet [11:17:05] ostriches: Are you here at the moment? [11:17:39] Luke081515: he must be sleeping, ostriches is in San Francisco area so it is like 3am for him :-} [11:17:44] ah, ok [11:17:55] nothing very important, so I can wait ;) [11:17:57] (03CR) 10Mobrovac: [C: 031] "+1 from me too. Before c-p it on beta, we need to merge https://gerrit.wikimedia.org/r/244145 and build the deploy repo" [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) (owner: 10KartikMistry) [11:18:01] mobrovac: yeah I usually do a final rebase after the patch is ready to go [11:18:04] 6operations, 6Reading-Admin, 10Reading-Community-Engagement: UX strategic test: redirect small portion of unauthenticated desktop users to mobile web - https://phabricator.wikimedia.org/T117826#1872734 (10dr0ptp4kt) [11:18:39] (03PS4) 10Dereckson: filebackend: add configuration for codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197499 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [11:19:01] (03CR) 10Dereckson: "PS4: rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197499 (https://phabricator.wikimedia.org/T91754) (owner: 10Giuseppe Lavagetto) [11:19:51] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:19:57] 6operations, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-Requests, 5codfw-appserver-setup, 5wikis-in-codfw: Configure mediawiki to operate in the Dallas DC - https://phabricator.wikimedia.org/T91754#1872742 (10Dereckson) Rebased. The patch complains about a missing configuration for Labs environment. [11:20:20] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:21:33] (03CR) 10Mobrovac: "https://puppet-compiler.wmflabs.org/1473/ still looking good." [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [11:22:10] (03CR) 10Mobrovac: RESTBase: Switch to service::node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/257898 (https://phabricator.wikimedia.org/T118401) (owner: 10Mobrovac) [11:25:52] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:26:20] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:32:47] mobrovac: This also need review and merge, https://gerrit.wikimedia.org/r/#/c/258122/ (cxserver) [11:34:44] mobrovac: btw did you see my question re: labs/beta testing in https://gerrit.wikimedia.org/r/257898 ? [11:34:44] akosiaris: hm, see ^, might be that it'd be better to move cxserver to scb ? [11:35:03] godog: dang, missed it [11:35:18] * mobrovac blames gerrit and its emails [11:37:17] akosiaris: actually, no, strike that, let's minimise the possible impact for now [11:37:52] godog: no, haven't tested in labs, there are cherry-picks there related to scap3 which make it basically impossible to do it [11:37:58] godog: will devise a strategy [11:39:14] mobrovac: ah, which cherry-picks btw? [11:39:36] (03PS4) 10Giuseppe Lavagetto: Add confctl the ability to find all instances of an entity [software/conftool] - 10https://gerrit.wikimedia.org/r/258428 [11:39:58] (03CR) 10jenkins-bot: [V: 04-1] Add confctl the ability to find all instances of an entity [software/conftool] - 10https://gerrit.wikimedia.org/r/258428 (owner: 10Giuseppe Lavagetto) [11:40:50] kart_: re the cert patch, is it node 0.10 in general or node 0.10 specifically on trusty? [11:40:58] (important distinction) [11:41:22] godog: https://gerrit.wikimedia.org/r/252887 [11:42:02] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [11:42:18] mobrovac: for older node in trusty. [11:42:38] mobrovac: I merged it. [11:42:53] mobrovac: heh, yeah tricky indeed [11:43:11] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: puppet fail [11:43:34] godog: we'll probably remove that one, as now with switching to service::node extensive changes will be needed anyhow [11:43:57] mobrovac: oh. forgot. we need to update node_modules. [11:44:37] kart_: i'm wortking on it now [11:44:58] kart_: in the deploy repo you mean? [11:45:05] (that's what i'm worinkg on) [11:45:18] * mobrovac is apparently dyslexic today [11:46:06] yep [11:46:15] mobrovac: I'm also submitting patch :) [11:46:30] I'll drop then. [11:47:09] kart_: please, drop, with service-runner we have an automated way of doing it [11:47:13] kk thnx [11:47:26] cool. [11:47:28] Thanks! [11:47:47] mobrovac: When you're done; explain me. [11:47:58] godog: looks like that patch I made for the statsd stuff in mwcore really made a difference... [11:49:01] PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:01] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:02] PROBLEM - puppet last run on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:11] went from around 85kpps on rx to around 30 with the end of the train last night [11:49:21] PROBLEM - SSH on mw1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:49:41] PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:00] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:11] PROBLEM - nutcracker port on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:20] PROBLEM - configured eth on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:22] addshore: \o/ thanks a lot! yeah looks like >50% reduction [11:50:40] I wasn't expecting it to be anywhere near that :P [11:50:40] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:51] PROBLEM - RAID on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:51:01] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:51:10] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [11:51:57] kart_: https://gerrit.wikimedia.org/r/258435 [11:52:49] addshore: hehehe yeah sending one metric per packet is terribly inefficient [11:53:48] (03CR) 10Mobrovac: "The cxserver change has been merged. The build-repo update patch is https://gerrit.wikimedia.org/r/#/c/258435 ." [puppet] - 10https://gerrit.wikimedia.org/r/250910 (https://phabricator.wikimedia.org/T117657) (owner: 10KartikMistry) [11:54:16] akosiaris: kart_: for testing in beta, https://gerrit.wikimedia.org/r/#/c/258435/ needs to be merged [11:54:40] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [11:54:57] akosiaris: kart_: i need to leave early today, so can't be around to assist in testing in beta, if you want feel free to merge it and test, otherwise we can do it monday [11:55:10] mobrovac: ok noted. thanks for letting us know [11:56:01] RECOVERY - Disk space on mw1006 is OK: DISK OK [11:56:37] 6operations, 10DBA, 6Labs, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1872856 (10jcrespo) [12:00:16] kart_: wrt building the deploy repo, cf https://github.com/wikimedia/service-template-node/blob/master/doc/deployment.md and https://wikitech.wikimedia.org/wiki/User:Mobrovac/Service_Deployment [12:00:30] RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212 [12:01:23] PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:01:30] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:01:43] kart_: note that for that you need docker installed - https://github.com/wikimedia/service-template-node/blob/master/doc/commands.md#docker [12:01:50] mobrovac: Thanks! [12:02:01] [12:02:40] kart_: you can experiment with this stuff and create your own patches for the deploy repo and compare it to mine to see if all looks good [12:03:02] PROBLEM - dhclient process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:03:32] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [12:04:06] (03PS1) 10Dereckson: Namespace configuration on pa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258436 (https://phabricator.wikimedia.org/T120936) [12:04:28] (03CR) 10jenkins-bot: [V: 04-1] Namespace configuration on pa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258436 (https://phabricator.wikimedia.org/T120936) (owner: 10Dereckson) [12:05:30] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:05:30] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:05:44] (03CR) 10Dereckson: [C: 031] Enable global AbuseFilter at French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257868 (https://phabricator.wikimedia.org/T120568) (owner: 10Glaisher) [12:06:08] mobrovac: sure [12:06:32] PROBLEM - nutcracker port on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:09:32] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:11:30] PROBLEM - nutcracker process on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:11:31] PROBLEM - salt-minion processes on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:20] PROBLEM - Disk space on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:13:45] (03PS2) 10Dereckson: Namespace configuration on pa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258436 (https://phabricator.wikimedia.org/T120936) [12:13:48] (03CR) 10Florianschmidtwelzow: [C: 04-1] Namespace configuration on pa.wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258436 (https://phabricator.wikimedia.org/T120936) (owner: 10Dereckson) [12:14:43] (03CR) 10Florianschmidtwelzow: [C: 031] Namespace configuration on pa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258436 (https://phabricator.wikimedia.org/T120936) (owner: 10Dereckson) [12:15:04] (03CR) 10Luke081515: [C: 031] Namespace configuration on pa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258436 (https://phabricator.wikimedia.org/T120936) (owner: 10Dereckson) [12:15:50] (03CR) 10Florianschmidtwelzow: [C: 031] Enable global AbuseFilter at French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257868 (https://phabricator.wikimedia.org/T120568) (owner: 10Glaisher) [12:16:58] 6operations, 10DBA, 6Phabricator, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1872917 (10Aklapper) This is happening aga... [12:18:07] (03CR) 10Florianschmidtwelzow: [C: 031] BetaFeatures: Update language and dates of 'retirement' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258409 (owner: 10Jforrester) [12:18:11] RECOVERY - DPKG on mw1006 is OK: All packages OK [12:18:32] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [12:19:01] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [12:20:30] (03CR) 10Florianschmidtwelzow: VisualEditor: Don't set ShowBetaWelcome, now set in repo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258405 (owner: 10Jforrester) [12:20:37] (03CR) 10Florianschmidtwelzow: [C: 04-1] VisualEditor: Provide framework for enabling an A/B test for IPs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258404 (owner: 10Jforrester) [12:22:09] (03PS1) 10Jcrespo: Applying configuration changes to S1 codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/258438 [12:22:31] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:23:37] (03PS1) 10Alexandros Kosiaris: varnish: allow etherpad to use websockets [puppet] - 10https://gerrit.wikimedia.org/r/258439 [12:24:11] PROBLEM - DPKG on mw1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:24:21] RECOVERY - Disk space on mw1006 is OK: DISK OK [12:24:40] RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212 [12:24:40] RECOVERY - configured eth on mw1006 is OK: OK - interfaces up [12:24:52] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [12:24:52] RECOVERY - dhclient process on mw1006 is OK: PROCS OK: 0 processes with command name dhclient [12:25:02] RECOVERY - RAID on mw1006 is OK: OK: no RAID installed [12:25:21] RECOVERY - salt-minion processes on mw1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:25:21] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [12:25:22] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 51 minutes ago with 0 failures [12:25:40] RECOVERY - SSH on mw1006 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [12:26:01] RECOVERY - DPKG on mw1006 is OK: All packages OK [12:26:53] (03CR) 10Florianschmidtwelzow: [C: 04-1] Remove $wmgUseCirrus variable (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258026 (owner: 10EBernhardson) [12:30:30] 6operations, 10DBA, 6Phabricator, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1872947 (10jcrespo) 5Resolved>3Open So... [12:33:01] (03PS1) 10Dereckson: Import sources on gu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258441 (https://phabricator.wikimedia.org/T120346) [12:34:32] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 49 failures [12:35:39] (03CR) 10Florianschmidtwelzow: [C: 031] Update Image Area to 100MP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255822 (owner: 10Reedy) [12:36:18] 6operations, 10DBA, 6Phabricator, 5Patch-For-Review, 7WorkType-Maintenance: Phabricator creates MySQL connection spikes: Attempt to connect to phuser@m3-master.eqiad.wmnet failed with error #1040: Too many connections. - https://phabricator.wikimedia.org/T109279#1872963 (10jcrespo) I've set `SET GLOBAL w... [12:37:08] (03CR) 10Florianschmidtwelzow: [C: 04-1] Namespace config change on de.wikivoyage.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255361 (https://phabricator.wikimedia.org/T119420) (owner: 10Mdann52) [12:38:38] (03PS1) 10Dereckson: Namespace configuration on my.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258442 (https://phabricator.wikimedia.org/T119807) [13:02:51] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:08:51] (03PS1) 10Dereckson: Template editor group on hi.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258444 (https://phabricator.wikimedia.org/T120342) [13:32:14] 6operations, 10Traffic, 5Patch-For-Review, 7Pybal: pybal fails to detect dead servers under production lb IPs for port 80 - https://phabricator.wikimedia.org/T113151#1873108 (10mark) >>! In T113151#1861965, @mark wrote: >>>! In T113151#1858288, @Joe wrote: >> So testing this with a single apache in a pool,... [13:37:10] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 11 failures [13:38:12] _joe_: there's an lvs1002 alert for parsoidlb, I think it's that pybal bug [13:39:12] 6operations, 10RESTBase, 7Graphite: restbase should send metrics in batches - https://phabricator.wikimedia.org/T121231#1873124 (10fgiunchedi) 3NEW [13:43:52] pybal has no bugs [13:45:21] Just features [13:47:54] (03PS1) 10coren: Labs: monitor getent timing only on active fileserver [puppet] - 10https://gerrit.wikimedia.org/r/258448 [14:02:22] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:20:01] (03PS6) 10Rush: Fix installation of sudo-ldap on labs instances. [puppet] - 10https://gerrit.wikimedia.org/r/258400 (https://phabricator.wikimedia.org/T120262) (owner: 10Andrew Bogott) [14:20:43] (03CR) 10Rush: [C: 031] "yep this is needed" [puppet] - 10https://gerrit.wikimedia.org/r/258400 (https://phabricator.wikimedia.org/T120262) (owner: 10Andrew Bogott) [14:21:25] !log re-imaging promethium [14:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:22:31] (03CR) 10Andrew Bogott: [C: 032] Fix installation of sudo-ldap on labs instances. [puppet] - 10https://gerrit.wikimedia.org/r/258400 (https://phabricator.wikimedia.org/T120262) (owner: 10Andrew Bogott) [14:29:15] (03PS2) 10Jcrespo: Applying configuration changes to S1 codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/258438 [14:30:37] (03CR) 10Jcrespo: [C: 032] Applying configuration changes to S1 codfw servers [puppet] - 10https://gerrit.wikimedia.org/r/258438 (owner: 10Jcrespo) [14:31:38] "Rush: Fix installation of sudo-ldap on labs instances. (08e2f0c)" [14:32:08] andrewbogott ^? [14:32:52] it's andrewbogott's, did you catch on it on palladium or something? it's legit [14:33:04] oh sorry, forgot the ‘yes' [14:33:08] jynus: merged now [14:33:10] yes, will not merge without his permission [14:33:14] :-) [14:33:47] permission, approval, etc. [14:34:11] PROBLEM - puppet last run on mc2006 is CRITICAL: CRITICAL: puppet fail [14:36:01] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 7 failures [14:40:26] !log root@pybal-test2001:~# apt-get install quagga [14:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:40:50] awesome :) [14:40:56] although bird is all the rage nowadays :P [14:41:13] hehe [14:41:21] just need to have SOME credible bgp speaker [14:41:24] it's not actually gonna do anything [14:41:29] we can test with bird on another host perhaps [14:41:41] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1873187 (10BBlack) I've successfully tested loading mobile content through the text cache (local DNS hack of m-dot hostname to text cluster), and used the same URLs through both... [14:54:58] !log restarting and configuring S1:codfw mysqls (db2016,34,42,48,55,62,69,70) [14:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:55:07] they have been downtimed [14:55:45] but there may be increasing 5XX errors on codfw, but not from real traffic, only probes [14:57:01] (03CR) 10Alexandros Kosiaris: "just answering the" [puppet] - 10https://gerrit.wikimedia.org/r/258193 (owner: 10Andrew Bogott) [14:59:50] RECOVERY - puppet last run on mc2006 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:00:34] there are around 1000 errors/min from 10.192.16.4, ignore those [15:03:07] (03PS2) 10Ottomata: Initial debianization and release 1.3.1 [debs/python-sprockets-mixins-statsd] (debian) - 10https://gerrit.wikimedia.org/r/258189 (https://phabricator.wikimedia.org/T121112) [15:05:13] (03CR) 10Andrew Bogott: "oh, of course. Thanks Alex!" [puppet] - 10https://gerrit.wikimedia.org/r/258193 (owner: 10Andrew Bogott) [15:06:05] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1873204 (10Ottomata) @JAllemandou, @nuria ^ :) (We'll talk about this at standup today) [15:10:17] errors should be going down now [15:11:07] and it seems in case of an emergency, it takes 10 minutes to upgrade, restart a server and restart mysql [15:11:27] jynus: 14:55 < jynus> but there may be increasing 5XX errors on codfw, but not from real traffic, only probes ? [15:11:31] why only probes? [15:11:38] (03PS1) 10Luke081515: Enable interface-editor group at ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258453 (https://phabricator.wikimedia.org/T120348) [15:11:44] oh nevermind, I'm up to speed now! [15:11:47] there is no traffic there, bblack [15:11:54] well there is, just not at that layer heh [15:12:04] not on mine ;-) [15:12:05] paravoid: Wanna give https://gerrit.wikimedia.org/r/#/c/258448/ a quick look? [15:12:12] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet has 7 failures [15:12:13] even I get confused about the degree to which codfw is "live" sometimes :) [15:12:21] OH IOS... how i didn't miss you... [15:12:40] on a normal state, I would depool one by one, but not doing it saves me around 500 hours [15:13:07] that's a full quarter heh [15:13:08] of performing 3 commits per server and wait [15:13:20] (03PS1) 10Ottomata: Initial debianization and release 1.3.1 [debs/python-sprockets-mixins-statsd] - 10https://gerrit.wikimedia.org/r/258455 (https://phabricator.wikimedia.org/T121112) [15:13:32] (03PS1) 10Ottomata: Initial debianization and 1.2.1 release [debs/python-sprockets-clients-statsd] - 10https://gerrit.wikimedia.org/r/258456 (https://phabricator.wikimedia.org/T121112) [15:13:37] (03PS1) 10Ottomata: Initial debianization and release [debs/python-sprockets] - 10https://gerrit.wikimedia.org/r/258457 (https://phabricator.wikimedia.org/T121112) [15:13:54] I am also able to do it now that I am more "confident" about the infrastructure [15:14:32] (meaning that I understand a bit better how it works, unlike some months ago) [15:15:32] (03CR) 10Ottomata: [C: 032 V: 032] Initial debianization and release [debs/python-sprockets] (debian) - 10https://gerrit.wikimedia.org/r/258169 (https://phabricator.wikimedia.org/T121112) (owner: 10Ottomata) [15:15:37] (03CR) 10Ottomata: [C: 032 V: 032] Initial debianization and 1.2.1 release [debs/python-sprockets-clients-statsd] (debian) - 10https://gerrit.wikimedia.org/r/258187 (https://phabricator.wikimedia.org/T121112) (owner: 10Ottomata) [15:15:45] (03CR) 10Ottomata: [C: 032 V: 032] Initial debianization and release 1.3.1 [debs/python-sprockets-mixins-statsd] (debian) - 10https://gerrit.wikimedia.org/r/258189 (https://phabricator.wikimedia.org/T121112) (owner: 10Ottomata) [15:16:02] (03PS1) 10BBlack: cache_text: add mobile IPs to loopback [puppet] - 10https://gerrit.wikimedia.org/r/258458 (https://phabricator.wikimedia.org/T109286) [15:16:04] (03PS1) 10BBlack: mobile-lb: use text caches as LVS backends [puppet] - 10https://gerrit.wikimedia.org/r/258459 (https://phabricator.wikimedia.org/T109286) [15:23:01] (03PS2) 10coren: Labs: monitor getent timing only on active fileserver [puppet] - 10https://gerrit.wikimedia.org/r/258448 [15:29:07] BTW, yesterday at around 19 UTC there was a 20% reduction in the number of selects on enwiki codfw [15:31:35] akosiaris: heya, yt? i always seem to have these reprepro problems... [15:31:52] this time i know exactly what I've did, looks like I'm getting key errors? [15:31:53] ottomata: yeah but gimme 5 mins [15:32:04] k [15:32:45] ottomata, did you try https://wikitech.wikimedia.org/wiki/Reprepro#If_signing_fails [15:33:00] NO [15:33:02] riHhghghht [15:33:05] thanks, i am sudo -s right now [15:33:06] ok... [15:33:48] Yes, much better [15:33:48] DOH [15:33:51] will try to remember that [15:33:53] thanks jynus [15:33:57] (problem solved akosiaris) [15:34:05] I do not like "su -", what is the right parameter for sudo -s, but reload env? [15:34:12] sudo -E [15:34:20] sudo -i? [15:34:21] preserve environment [15:34:23] I will change it to that [15:34:32] no, actually we want the opposite [15:34:41] change the env to root's [15:34:57] aye k [15:35:11] sudo -E reprepro --ignore=wrongdistribution -C main include jessie-wikimedia [15:35:15] that's what I usually use [15:35:17] also [15:35:30] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [15:35:45] export HISTFILESIZE=50000 [15:35:45] export HISTSIZE=50000 [15:35:45] export REPREPRO_BASE_DIR=/srv/wikimedia # This is just for apt.wikimedia.org [15:35:45] export GNUPGHOME=/root/.gnupg # Not like I am ever going to put my GPG homedir on the cluster anyway [15:36:01] that's my .variables files, sourced from my .bashrc over in carbon [15:36:02] :) [15:36:10] yes, akosiaris's option is better [15:36:13] ahhh k [15:36:15] doing that [15:36:16] and my .profile [15:36:19] only export the minimal options [15:36:30] akosiaris: does that let you reprepro add as yourself, or just sudo the commands? [15:36:54] always sudoe the commands [15:37:08] what it does it set the reprepro base directory env [15:37:18] so I can do it from practically anywhere in the filesystem [15:37:40] so I usually just ran it from my homedir [15:37:53] as long as sudo -E is the invocation I am just fine [15:38:12] ok great [15:38:16] perfect [15:38:22] no more need to cd /srv/wikmiedia, i like [15:38:23] using [15:39:33] (03CR) 10Ottomata: [C: 032 V: 032] Initial debianization and release 1.3.1 [debs/python-sprockets-mixins-statsd] - 10https://gerrit.wikimedia.org/r/258455 (https://phabricator.wikimedia.org/T121112) (owner: 10Ottomata) [15:39:44] (03CR) 10Ottomata: [C: 032 V: 032] Initial debianization and 1.2.1 release [debs/python-sprockets-clients-statsd] - 10https://gerrit.wikimedia.org/r/258456 (https://phabricator.wikimedia.org/T121112) (owner: 10Ottomata) [15:40:50] I've updated it, change if you can express it better: https://wikitech.wikimedia.org/wiki/Reprepro#If_signing_fails [15:41:44] also please join me on documenting common mistakes! [15:44:15] jynus: you mean sudo -E, not -i? ja? [15:44:27] (03PS2) 10BBlack: VCL: no special handling for CentralAutoLogin [puppet] - 10https://gerrit.wikimedia.org/r/258207 (https://phabricator.wikimedia.org/T96847) [15:44:29] (03PS1) 10BBlack: cache_misc: send randomized pass traffic directly to t1 backends [puppet] - 10https://gerrit.wikimedia.org/r/258463 (https://phabricator.wikimedia.org/T96847) [15:44:31] (03PS1) 10BBlack: cache_upload: send randomized pass traffic directly to t1 backends [puppet] - 10https://gerrit.wikimedia.org/r/258464 (https://phabricator.wikimedia.org/T96847) [15:44:33] (03PS1) 10BBlack: cache_text/mobile: send randomized pass traffic directly to t1 backends [puppet] - 10https://gerrit.wikimedia.org/r/258465 (https://phabricator.wikimedia.org/T96847) [15:44:44] well, either you export the right variables and sudo -E [15:45:16] (03Abandoned) 10Ottomata: Initial debianization and release [debs/python-sprockets] - 10https://gerrit.wikimedia.org/r/258457 (https://phabricator.wikimedia.org/T121112) (owner: 10Ottomata) [15:45:28] or you set your home as root's, which is what sudo -i dies (not preserving the env) [15:45:31] PROBLEM - Apache HTTP on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:31] PROBLEM - HHVM rendering on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:45:33] *does [15:46:31] PROBLEM - DPKG on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:46:31] PROBLEM - puppet last run on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:46:40] PROBLEM - Disk space on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:46:41] PROBLEM - salt-minion processes on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:46:41] PROBLEM - RAID on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:47:00] PROBLEM - configured eth on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:47:01] as I said, if you can express it better, feel free to correct me :-) [15:47:11] PROBLEM - dhclient process on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:47:31] PROBLEM - nutcracker port on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:47:52] PROBLEM - nutcracker process on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:47:52] PROBLEM - SSH on mw1147 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:56] is this clearer? https://wikitech.wikimedia.org/wiki/Reprepro#If_signing_fails [15:47:58] FWIW, I always use "sudo -i" anywhere in WMF stuff, and I never have any issues with anything related. It's just my normal way to sudo [15:48:00] PROBLEM - Check size of conntrack table on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:48:10] PROBLEM - HHVM processes on mw1147 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:48:21] I use sudo -s, want to preserv my env when possible [15:48:41] well, I try to use sudo, so it gets registered [15:49:04] can be slightly more dangerous with "-s" though, probably not necc or good as general advice [15:49:38] (03PS1) 10Ottomata: Revert "Initial debianization and 1.2.1 release" [debs/python-sprockets-clients-statsd] - 10https://gerrit.wikimedia.org/r/258466 [15:50:00] (03PS1) 10Ottomata: Revert "Initial debianization and release 1.3.1" [debs/python-sprockets-mixins-statsd] - 10https://gerrit.wikimedia.org/r/258467 [15:50:17] (03CR) 10Ottomata: [C: 032 V: 032] Revert "Initial debianization and 1.2.1 release" [debs/python-sprockets-clients-statsd] - 10https://gerrit.wikimedia.org/r/258466 (owner: 10Ottomata) [15:50:20] (03CR) 10Ottomata: [C: 032 V: 032] Revert "Initial debianization and release 1.3.1" [debs/python-sprockets-mixins-statsd] - 10https://gerrit.wikimedia.org/r/258467 (owner: 10Ottomata) [15:52:41] RECOVERY - Disk space on mw1147 is OK: DISK OK [15:52:41] RECOVERY - DPKG on mw1147 is OK: All packages OK [15:52:42] RECOVERY - salt-minion processes on mw1147 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:52:50] RECOVERY - RAID on mw1147 is OK: OK: no RAID installed [15:53:01] RECOVERY - configured eth on mw1147 is OK: OK - interfaces up [15:53:21] RECOVERY - dhclient process on mw1147 is OK: PROCS OK: 0 processes with command name dhclient [15:53:32] RECOVERY - nutcracker port on mw1147 is OK: TCP OK - 0.000 second response time on port 11212 [15:54:00] RECOVERY - SSH on mw1147 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [15:54:00] RECOVERY - nutcracker process on mw1147 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:54:01] RECOVERY - Check size of conntrack table on mw1147 is OK: OK: nf_conntrack is 0 % full [15:54:11] RECOVERY - HHVM processes on mw1147 is OK: PROCS OK: 6 processes with command name hhvm [15:54:22] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [5000000.0] [15:54:41] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:42] RECOVERY - Apache HTTP on mw1147 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.121 second response time [15:55:42] RECOVERY - HHVM rendering on mw1147 is OK: HTTP OK: HTTP/1.1 200 OK - 70129 bytes in 2.233 second response time [15:58:32] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [16:01:15] is gerrit spitting up 5xx errors for anyone else? [16:02:30] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:02:43] urandom, when visiting the home, any URL in particular? It works for me [16:03:37] jynus: anywhere, yeah, but it doesn't seem to happen w/ chrome [16:04:03] urandom, check with firefox in safe mode (supposing that was the issue) [16:06:16] for everyone else: you will get now lower level of 500 (100/s) on codfw mediawikis, at intervals, see last !log [16:06:37] sorry, I meant 100/minute [16:06:44] jynus: yeah, that works. what does that indicate, a problem with an extension? [16:06:46] 100/s is a lot of errors [16:06:50] 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1873287 (10BBlack) [16:06:59] urandom, most probably, plugin, extension, config [16:09:11] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [16:15:31] PROBLEM - HHVM rendering on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:17:12] PROBLEM - Apache HTTP on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:17:21] PROBLEM - SSH on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:17:50] PROBLEM - configured eth on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:17:52] PROBLEM - Check size of conntrack table on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:18:04] PROBLEM - dhclient process on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:18:22] PROBLEM - nutcracker port on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:18:31] PROBLEM - salt-minion processes on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:18:41] PROBLEM - DPKG on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:18:42] PROBLEM - nutcracker process on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:18:51] PROBLEM - Disk space on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:18:52] PROBLEM - RAID on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:19:11] PROBLEM - HHVM processes on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:19:11] PROBLEM - puppet last run on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:23:05] !log moved quagga install from pybal-test2001 to pybal-test2002 [16:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:34] <_joe_> paravoid: heh, it seems almost impossible not to incur in race conditions in adding/removing alerts in pybal; I'll recalculate them each time when responding [16:27:40] RECOVERY - SSH on mw1130 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [16:27:46] I'm not even sure if pybal should have an alerts endpoint [16:28:09] it seems more logical to me for pybal to expose its state and then the icinga check actually calculating whether it's something worth alerting or not [16:28:10] RECOVERY - configured eth on mw1130 is OK: OK - interfaces up [16:28:12] RECOVERY - Check size of conntrack table on mw1130 is OK: OK: nf_conntrack is 0 % full [16:28:16] but I don't feel strongly about it, ymmv :) [16:28:18] _joe_: ^ [16:28:31] RECOVERY - dhclient process on mw1130 is OK: PROCS OK: 0 processes with command name dhclient [16:28:41] RECOVERY - nutcracker port on mw1130 is OK: TCP OK - 0.000 second response time on port 11212 [16:28:51] RECOVERY - salt-minion processes on mw1130 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:29:01] RECOVERY - DPKG on mw1130 is OK: All packages OK [16:29:02] RECOVERY - nutcracker process on mw1130 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [16:29:11] RECOVERY - Disk space on mw1130 is OK: DISK OK [16:29:11] RECOVERY - RAID on mw1130 is OK: OK: no RAID installed [16:29:31] RECOVERY - HHVM processes on mw1130 is OK: PROCS OK: 6 processes with command name hhvm [16:29:32] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.065 second response time [16:29:32] RECOVERY - puppet last run on mw1130 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:29:59] (03PS2) 10Mark Bergsma: Add BGP MED support [debs/pybal] (bgp-med) - 10https://gerrit.wikimedia.org/r/255544 [16:30:02] RECOVERY - HHVM rendering on mw1130 is OK: HTTP OK: HTTP/1.1 200 OK - 70122 bytes in 1.321 second response time [16:30:45] paravoid: I think the alerts endpoint was my idea. I like not having the configuration of functionality and alert logic separated. This way pybal configuration/code also encapsulates what the alert logic/limits should be independently of whatever monitor script/system we use. [16:31:09] obviously if icinga can't contact the alerts endpoint, that's a higher-level issue to be alerted on as well [16:32:27] the alternative would be to have pybal expose its internal state for all the pools, and then put logic in an icinga check to parse that and make logical sense of it and configure it for limits to alert on, etc (which might vary by-monitored-service) [16:34:28] stepping out more broadly than just pybal: in the general case with anything like this, I would expect there's a lot of overlap between functional service logic/config and alert logic/config, and it's simpler for the functional thing to know its own alerting and report it at a higher level instead of as raw data requiring logic/config Elsewhere. [16:34:54] it's either that or we have to factor out the same in puppet, so that it's DRY in puppet and then duplicated out to functional + alerting config [16:35:26] (03CR) 10Mark Bergsma: [C: 032] Add BGP MED support [debs/pybal] (bgp-med) - 10https://gerrit.wikimedia.org/r/255544 (owner: 10Mark Bergsma) [16:36:54] !log Installed bgp-med branch of operations/debs/pybal on pybal-test2001 [16:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:13] this is a bit academic and I can see both sides [16:37:39] in this case it doesn't matter very much -- the problem with in general embedding alerting like that is that you can easily miss alerts due to software bugs [16:38:05] if say pybal's etcd code is buggy and doesn't load any backends or loads half the backends or something, this isn't something it can alert on [16:43:00] <_joe_> paravoid: btw that's what I wanted to do :) [16:43:28] <_joe_> we tried to use a simpler, less invasive way of inspecting pybal, it's too complex, bummer [16:44:28] <_joe_> bblack: so, given all the config for both pybal and its monitoring is in puppet, twhat you ask is easy to do [16:49:53] 6operations, 10ops-eqiad, 5Patch-For-Review: rack/setup/deploy rdb1005 & rdb1006 - https://phabricator.wikimedia.org/T119543#1873362 (10Cmjohnson) [16:50:16] hello [16:50:28] I had a question regarding how chinese is handled on chinese wikipedia [16:50:31] as in how it is stored [16:50:42] chinese has two main variants (character sets) [16:51:41] utf8 ? [16:51:43] White_Cat, that is probably more a question for #wikimedia-tech or #mediawiki-help, maybe? [16:51:57] on the database there are only binary strings [16:52:05] (not even utf8) [16:52:09] 6operations, 10ops-eqiad, 5Patch-For-Review: rack/setup/deploy rdb1005 & rdb1006 - https://phabricator.wikimedia.org/T119543#1873371 (10Cmjohnson) a:5Cmjohnson>3aaron @aaron db1005 and rdb1006 are all set and installed, ssh accessible and ready to be put to work. [16:52:19] jynus: ah yes, I remember that now that you say it [16:52:48] so all encoding is solved at application side [16:53:04] jynus probably [16:53:35] I was hoping mybe it was a server setting or something since the chracters are converted to either sets [16:54:42] paravoid: back on the academic argument though: it's impossible even for the separately-configured icinga check to catch all relevant code bugs. The purpose of monitoring isn't to catch code bugs in general, it's to catch when pybal is basically working ok and there's problems with pybal's backends that pybal can accurately see. [16:55:15] it is done by mediawiki transparently so operators do not see much of that, maybe someone else can help you on mediawiki-specific channels, I am sorry I am not very helpful [16:55:48] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Cirrus-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1873393 (10EBernhardson) [16:56:19] jynus no you have been very helpful [16:56:26] at least I know where to look now :) [16:56:29] Thanks [16:58:10] (03PS1) 10Filippo Giunchedi: ganglia: add ganglia::cluster exported resource [puppet] - 10https://gerrit.wikimedia.org/r/258473 (https://phabricator.wikimedia.org/T119520) [16:58:20] (03PS1) 10Dereckson: Allow sysop to grant and revoke transwiki on gu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258474 (https://phabricator.wikimedia.org/T120346) [16:59:11] (03CR) 10jenkins-bot: [V: 04-1] ganglia: add ganglia::cluster exported resource [puppet] - 10https://gerrit.wikimedia.org/r/258473 (https://phabricator.wikimedia.org/T119520) (owner: 10Filippo Giunchedi) [17:02:45] (03PS2) 10Filippo Giunchedi: ganglia: add ganglia::cluster exported resource [puppet] - 10https://gerrit.wikimedia.org/r/258473 (https://phabricator.wikimedia.org/T119520) [17:10:34] gwicke, mobrovac - do you know why TreatAsUntrusted is not working for the analytics api? [17:11:18] https://www.mediawiki.org/wiki/Extension:Graph/pageviews works because graphoid does not send that header, but client side api request is not working [17:16:32] 6operations: Provide a means to run production and semi-production services on separate vlans - https://phabricator.wikimedia.org/T121240#1873437 (10GWicke) 3NEW [17:16:45] (03PS1) 10Luke081515: Enable flood group at lvwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258477 (https://phabricator.wikimedia.org/T121238) [17:24:24] 6operations: Provide a means to run production and semi-production services without access to the catch-all production networking environment - https://phabricator.wikimedia.org/T121240#1873473 (10GWicke) [17:35:07] (03CR) 10Chad: "Good catch inline from Florian. Otherwise lgtm." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258026 (owner: 10EBernhardson) [17:37:10] (03PS2) 10EBernhardson: Remove $wmgUseCirrus variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258026 [17:38:38] 6operations: Provide a means to run production and semi-production services without access to the catch-all production networking environment - https://phabricator.wikimedia.org/T121240#1873550 (10GWicke) [17:40:42] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet has 1 failures [17:42:11] (03PS2) 10Chad: Replace deprecated wgConf->localVHosts with wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226971 (https://phabricator.wikimedia.org/T106206) (owner: 10Alex Monk) [17:44:38] !log rolling restart of codfw's S1 mysqls finished [17:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:48:40] 6operations, 6Services: Provide a means to run production and semi-production services without access to the catch-all production networking environment - https://phabricator.wikimedia.org/T121240#1873591 (10GWicke) [17:49:01] 6operations, 6Services: Provide a means to run production and semi-production services without access to the catch-all production networking environment - https://phabricator.wikimedia.org/T121240#1873437 (10GWicke) [17:49:16] (03PS1) 10Giuseppe Lavagetto: hhvm::debug: change owner of /tmp/heaps [puppet] - 10https://gerrit.wikimedia.org/r/258480 [17:49:33] (03PS1) 10Jcrespo: Depool db1057 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258481 [17:49:36] Krenair: I rebased your vhosts thingie ^ [17:49:50] I grepped, that lgtm as well, we can do whenever. [17:50:21] thanks [17:51:05] (03CR) 10Chad: [C: 031] Remove $wmgUseCirrus variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258026 (owner: 10EBernhardson) [17:51:07] (03PS1) 10Papaul: Add mgmt DNS asset tag entries for new 8 misc servers Removed mgmt DNS asset tag for pollux Bug:T120885 T117423 [dns] - 10https://gerrit.wikimedia.org/r/258483 [17:51:27] (03PS2) 10Jcrespo: Depool db1057 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258481 [17:52:10] 6operations, 6Services: Provide a means to run production and semi-production services without access to the catch-all production networking environment - https://phabricator.wikimedia.org/T121240#1873597 (10GWicke) [17:52:39] (03CR) 10Jcrespo: [C: 032] Depool db1057 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258481 (owner: 10Jcrespo) [17:53:22] (03PS2) 10Dzahn: Fix viewing raw in phabricator on a pc [puppet] - 10https://gerrit.wikimedia.org/r/257629 (owner: 10Paladox) [17:53:23] * ostriches looks around for someone for an easy puppet review. [17:53:40] 6operations, 6Services: Provide a means to run production and semi-production services without access to the catch-all production networking environment - https://phabricator.wikimedia.org/T121240#1873437 (10GWicke) [17:53:59] (03CR) 10Dzahn: [C: 032] Fix viewing raw in phabricator on a pc [puppet] - 10https://gerrit.wikimedia.org/r/257629 (owner: 10Paladox) [17:54:19] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1057 for maintenance (duration: 00m 35s) [17:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:54:27] (03PS2) 10Giuseppe Lavagetto: hhvm::debug: change owner of /tmp/heaps [puppet] - 10https://gerrit.wikimedia.org/r/258480 [17:55:48] 6operations, 10ops-codfw: rack 8 new misc systems - https://phabricator.wikimedia.org/T120885#1873611 (10Papaul) [17:55:57] mutante: I fixed https://gerrit.wikimedia.org/r/#/c/257193/ with PS7. You can ignore the previous patches. [17:56:27] ostriches: i was looking at that patch set in the same moment :) [17:56:31] yea [17:56:35] :) [17:56:36] ty [17:56:38] User::loadFromDatabase, that is an error that I had never seen before [17:56:44] so let's do that, will cause a moment of gerrit restart [17:56:58] mutante: Peeps will live :) [17:57:06] jynus: What about it? [17:57:09] (03PS8) 10Dzahn: Fix redirections in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [17:57:09] (do not worry, it is a spike of 50 only [17:57:27] and from API [17:57:34] (03CR) 10Paladox: "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/257629 (owner: 10Paladox) [17:57:45] but on the master, that is new [17:57:50] (03CR) 10Florianschmidtwelzow: [C: 031] Remove $wmgUseCirrus variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258026 (owner: 10EBernhardson) [17:57:52] (03CR) 10Dzahn: [C: 032] Fix redirections in gerrit [puppet] - 10https://gerrit.wikimedia.org/r/257193 (owner: 10Paladox) [17:59:14] (03CR) 10Dzahn: "has been applied on iridium" [puppet] - 10https://gerrit.wikimedia.org/r/257629 (owner: 10Paladox) [17:59:25] it could be one of those errors where signaling an error is the right thing, user is trying to edit a single page several times per second [18:00:02] so I am not worried [18:00:33] gerrit website just went down [18:00:58] nevermind seems to be working again [18:01:14] kaldari: the service restarted due to a config change [18:01:17] kaldari: I think it was this one https://gerrit.wikimedia.org/r/257193 [18:01:21] that should have fixed some redirects [18:01:34] ostriches: can we get the bot restarted .. [18:01:35] thanks [18:01:48] YuviPanda: Can you kick grrrit-wm? [18:02:03] mutante: so I doubt it work [18:02:04] and I was wondering if I missed a planned maintenance window. Thanks for explanation! :) [18:02:12] it worked* [18:02:14] ostriches, 2000 errors in a month, I think we can ignore it [18:02:27] mutante: Whee: https://phabricator.wikimedia.org/r/revision/operations/puppet;6809baaa12b40fb4471eda73def9c381f90d915c :D [18:02:28] no I am being redirected to https://phabricator.wikimedia.org/diffusion/OPUP/history/refs/heads/production/ [18:02:47] Ok, that's half of the bug fixed. [18:02:50] lol [18:02:51] ok [18:02:58] at least something works better [18:02:59] nice [18:03:00] :p better half than nothing :p [18:03:01] Phab expects just bare references. [18:03:08] Not fully qualified refs/* [18:03:10] btw, also on the phabricator side [18:03:15] the change got applied [18:03:19] just before this [18:03:24] That's a different thing, unrelated. [18:03:34] We still need to fix the refs/* handling. [18:03:39] 6operations, 6Services: Provide a means to run production and semi-production services without access to the catch-all production networking environment - https://phabricator.wikimedia.org/T121240#1873625 (10GWicke) [18:05:11] akosiaris: It was "working" before because the branch was getting stripped and you just ended up at the project's main page. Which looked *ok* but wasn't necessarily right, especially if you were following a link to a non-HEAD branch. [18:05:12] :) [18:05:27] ostriches: hehe, indeed [18:05:32] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:05:34] We're keeping the branch, just now in a format Phab needs to munge. We have a change up, gonna finish up today :) [18:07:20] 6operations, 6Services: Provide a means to run production and semi-production services without access to the catch-all production networking environment - https://phabricator.wikimedia.org/T121240#1873632 (10GWicke) [18:10:11] 6operations, 6Services: Provide a means to run production and semi-production services without access to the catch-all production networking environment - https://phabricator.wikimedia.org/T121240#1873638 (10GWicke) [18:13:38] !log demon@tin Synchronized wmf-config/: Remove ability to turn off Cirrus completely. It's a scary switch that would Break Everything. There's much better options to use if you need to tune its load or behavior. (duration: 00m 29s) [18:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:13:46] ebernhardson: Hehehe ^^^ [18:14:04] 6operations, 6Performance-Team, 5Patch-For-Review: Use cgroups to limit thumbor & subprocesses resource usage - https://phabricator.wikimedia.org/T120940#1873643 (10fgiunchedi) +#operations we'll need to figure out the cgroup bits for production, the vagrant part is at https://gerrit.wikimedia.org/r/#/c/258... [18:19:32] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 13.04% of data above the critical threshold [100000000.0] [18:28:14] !log demon@tin Synchronized wmf-config/: cleanup wgconf vhost config (duration: 00m 29s) [18:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:29:03] !log demon@tin Synchronized docroot/: cleanup wgconf vhost config (the sequel) (duration: 00m 29s) [18:29:07] Krenair: All done ^ :) [18:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:31:03] ostriches, oh, thanks [18:31:07] that's been on my todo list for months [18:31:40] I'm trying to clean out some of the weird stuff like that from the wmf-config backlog. [18:31:50] Like, non-wiki-config config debt. [18:35:14] 6operations, 10procurement: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1873721 (10RobH) [18:36:53] 6operations, 10procurement: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1840448 (10RobH) [18:40:36] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [18:48:53] 6operations, 10procurement: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1840448 (10RobH) [18:51:29] 6operations, 10hardware-requests: Upgrade restbase100[7-9] to match restbase100[1-6] hardware - https://phabricator.wikimedia.org/T119935#1840448 (10RobH) [18:54:18] ostriches: yeah I'll kick it [18:54:26] tyvm [18:54:49] !log demon@tin Synchronized wmf-config/CommonSettings.php: rm some deprecated profiling config (duration: 00m 28s) [18:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:57:41] (03PS1) 10Dzahn: introduce cygnus.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/258494 (https://phabricator.wikimedia.org/T118763) [18:59:31] YuviPanda: 253801/253802 : ldap config in wikitech.php. Want I can merge? [18:59:47] (not setting default thingies anymore b/c hiera) [19:00:15] ostriches: \o/ yes, should be noops as well [19:00:44] (03CR) 10Chad: [C: 032] wikitech: Do not set realm in ldap by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253801 (https://phabricator.wikimedia.org/T101447) (owner: 10Yuvipanda) [19:00:53] (03CR) 10Chad: [C: 032] wikitech: Stop setting default classes for new instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253802 (https://phabricator.wikimedia.org/T101447) (owner: 10Yuvipanda) [19:01:42] 6operations, 10ops-codfw, 5Patch-For-Review: return pollux to spares - https://phabricator.wikimedia.org/T117423#1873910 (10Dzahn) the DRAC of pollux still appears to be up on 10.193.1.50 i get a reply there and a DRAC login. the patch above would use that IP for a different server though to avoid conflict... [19:01:44] (03Merged) 10jenkins-bot: wikitech: Do not set realm in ldap by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253801 (https://phabricator.wikimedia.org/T101447) (owner: 10Yuvipanda) [19:02:03] (03Merged) 10jenkins-bot: wikitech: Stop setting default classes for new instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253802 (https://phabricator.wikimedia.org/T101447) (owner: 10Yuvipanda) [19:03:46] (03PS2) 10Dzahn: introduce cygnus.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/258494 (https://phabricator.wikimedia.org/T118763) [19:04:01] (03CR) 10Dzahn: [C: 032] introduce cygnus.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/258494 (https://phabricator.wikimedia.org/T118763) (owner: 10Dzahn) [19:05:56] (03PS1) 10Legoktm: Re-enable CentralAuth-Bug39996 debug log group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258496 (https://phabricator.wikimedia.org/T119736) [19:06:06] !log demon@tin Synchronized wmf-config/wikitech.php: rm some old ldap stuff. yay hiera! (duration: 00m 28s) [19:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:06:13] YuviPanda: ^^^ [19:06:40] ostriches is having a fun cleanup day :) [19:07:00] array_pop( $backlog ); [19:07:23] Eh, it's a little more than first off the list, but same idea :) [19:07:23] :D [19:07:26] ostriches: checking [19:07:28] * ebernhardson keeps dreaming of the $backlog->pop() upgrade to php [19:08:03] greg-g: could I get that ^ debug log addition deployed today? it's for a UBN in CentralAuth [19:09:17] (03Abandoned) 10Chad: Close wikimania2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224770 (https://phabricator.wikimedia.org/T105675) (owner: 10Dereckson) [19:09:40] legoktm: Which one? [19:09:44] Ah I see it [19:09:45] https://gerrit.wikimedia.org/r/258496 [19:09:52] (03PS1) 10Dzahn: installserver: add cygnus.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/258497 (https://phabricator.wikimedia.org/T118763) [19:10:22] (03CR) 10Chad: [C: 031] Re-enable CentralAuth-Bug39996 debug log group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258496 (https://phabricator.wikimedia.org/T119736) (owner: 10Legoktm) [19:11:23] legoktm: yah [19:11:42] and, from now on until the new year, ostriches is now the official "hey, can I?" person :) [19:11:52] ah, ok :D [19:11:58] akosiaris: ganeti: in eqiad i use 1003 but in codfw i dont use 2003 but 2001? [19:12:01] * ostriches puts on the hat [19:12:16] ostriches: any chance you want to deploy that for me? :) [19:12:36] Sure, one sec [19:12:42] (03CR) 10Chad: [C: 032] Re-enable CentralAuth-Bug39996 debug log group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258496 (https://phabricator.wikimedia.org/T119736) (owner: 10Legoktm) [19:12:51] thanks :D [19:13:07] (03Merged) 10jenkins-bot: Re-enable CentralAuth-Bug39996 debug log group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258496 (https://phabricator.wikimedia.org/T119736) (owner: 10Legoktm) [19:13:07] !log creating cygnus.codfw.wmnet on ganeti2001 [19:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:22] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: CentralAuth logging for Lego (duration: 00m 28s) [19:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:27] legoktm: there you go ^^^ [19:14:37] ty :) [19:14:56] (03CR) 10Dzahn: [C: 032] site: introduce cygnus.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/258493 (https://phabricator.wikimedia.org/T118763) (owner: 10Dzahn) [19:17:52] legoktm: Is this related to the Could not find local user data for.... errors I've been spotting? [19:18:02] ostriches: yes [19:18:15] Ok just double checking [19:20:33] Reedy: yo, you about? [19:23:42] (03PS2) 10Dzahn: installserver: add cygnus.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/258497 (https://phabricator.wikimedia.org/T118763) [19:26:00] (03PS3) 10Dzahn: installserver: add cygnus.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/258497 (https://phabricator.wikimedia.org/T118763) [19:26:42] (03CR) 10Dzahn: [C: 032] installserver: add cygnus.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/258497 (https://phabricator.wikimedia.org/T118763) (owner: 10Dzahn) [19:26:54] !log started foreachwiki checkLocalUser.php (CentralAuth) on terbium [19:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:27:14] 6operations, 10Deployment-Systems: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1874032 (10demon) [19:28:01] 6operations, 10Deployment-Systems: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1874037 (10Dzahn) woohoo :) all blockers closed? when are we deploying from mira ? [19:28:38] 6operations, 5wikis-in-codfw: Document what is left for having a full cluster installation in codfw - https://phabricator.wikimedia.org/T97322#1874042 (10demon) [19:28:42] 6operations, 10Deployment-Systems: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1874040 (10demon) 5Open>3Resolved And mira's happy too, afaict. [19:28:47] 6operations, 10Deployment-Systems: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1874045 (10demon) [19:28:49] :)) [19:28:57] ostriches: yay! [19:29:01] that's very cool [19:29:34] I was mainly giving it like a week to see if permissions blew up again, but I think we're set. [19:29:49] And mira's as warm as can be, it gets updated every scap/sync-* [19:33:29] !log demon@mira Synchronized README: testing (duration: 01m 35s) [19:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:33:40] weee [19:33:57] mutante: So, no reason we can't :p [19:34:26] Although, T120157. I'd like to remove the delta between mira/tin in /srv/deployment. [19:34:53] ostriches: greg-g: :)) maybe we can schedule one upcoming deploy to actually use mira? [19:34:55] ostriches: security patches auto-synced across? [19:36:36] Mhmm [19:36:59] The .git directories are sync'd, so the current checkout ref is the same, etc. [19:39:02] (03PS3) 10Ori.livneh: hhvm::debug: change owner of /tmp/heaps [puppet] - 10https://gerrit.wikimedia.org/r/258480 (owner: 10Giuseppe Lavagetto) [19:39:09] (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm::debug: change owner of /tmp/heaps [puppet] - 10https://gerrit.wikimedia.org/r/258480 (owner: 10Giuseppe Lavagetto) [19:39:27] mutante: hehe, https://phabricator.wikimedia.org/T118602 :p [19:39:56] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: puppet fail [19:40:27] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: puppet fail [19:40:59] ostriches: easy now, we got a script,i'll do it [19:42:39] [fermium:~] $ sudo /usr/local/sbin/disable_list wikibugs-l [19:42:39] wikibugs-l disabled. Archives should be available at current location, all mail should be moderated and the list should not be on the listinfo page. [19:42:42] done [19:43:09] (this used to be much more annoying before that script) [19:44:15] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: puppet fail [19:44:17] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: puppet fail [19:44:25] PROBLEM - puppet last run on mw2099 is CRITICAL: CRITICAL: puppet fail [19:44:45] PROBLEM - puppet last run on mw1124 is CRITICAL: CRITICAL: puppet fail [19:44:54] uhm [19:44:56] PROBLEM - puppet last run on mw2065 is CRITICAL: CRITICAL: puppet fail [19:44:56] PROBLEM - puppet last run on mw2160 is CRITICAL: CRITICAL: puppet fail [19:44:58] checks puppetmaster [19:45:05] PROBLEM - puppet last run on mw2041 is CRITICAL: CRITICAL: puppet fail [19:45:05] PROBLEM - puppet last run on mw1087 is CRITICAL: CRITICAL: puppet fail [19:45:06] PROBLEM - puppet last run on mw2112 is CRITICAL: CRITICAL: puppet fail [19:45:06] PROBLEM - puppet last run on mw2201 is CRITICAL: CRITICAL: puppet fail [19:45:16] PROBLEM - puppet last run on mw2197 is CRITICAL: CRITICAL: puppet fail [19:45:16] PROBLEM - puppet last run on mw1032 is CRITICAL: CRITICAL: puppet fail [19:45:16] PROBLEM - puppet last run on mw1038 is CRITICAL: CRITICAL: puppet fail [19:45:16] PROBLEM - puppet last run on mw2103 is CRITICAL: CRITICAL: puppet fail [19:45:21] Duplicate declaration: Class[Hhvm] is already declared; [19:45:23] ^ [19:45:26] PROBLEM - puppet last run on mw1130 is CRITICAL: CRITICAL: puppet fail [19:45:30] kills bot temp. [19:45:35] PROBLEM - puppet last run on mw2170 is CRITICAL: CRITICAL: puppet fail [19:45:36] PROBLEM - puppet last run on mw1001 is CRITICAL: CRITICAL: puppet fail [19:45:36] PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: puppet fail [19:45:46] PROBLEM - puppet last run on mw2139 is CRITICAL: CRITICAL: puppet fail [19:45:46] PROBLEM - puppet last run on mw1115 is CRITICAL: CRITICAL: puppet fail [19:45:47] PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: puppet fail [19:45:47] PROBLEM - puppet last run on mw2153 is CRITICAL: CRITICAL: puppet fail [19:46:00] ori: ? [19:46:01] ori: [19:46:07] duplicate definition ..hhvm [19:46:10] thanks, fixing [19:46:32] i should have spotted that, sorry [19:47:09] (03PS1) 10GWicke: Make /api/rest_v1/ work for test.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/258498 [19:47:33] 8 unreviewed patches and they're all puppet lol. [19:50:34] (03PS1) 10Ori.livneh: hhvm: declare /tmp/heaps in init.pp [puppet] - 10https://gerrit.wikimedia.org/r/258499 [19:50:49] (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm: declare /tmp/heaps in init.pp [puppet] - 10https://gerrit.wikimedia.org/r/258499 (owner: 10Ori.livneh) [19:51:27] (03PS2) 10Yurik: Make /api/rest_v1/ work for test.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/258498 (https://phabricator.wikimedia.org/T120977) (owner: 10GWicke) [19:51:36] (03PS3) 10Yurik: Make /api/rest_v1/ work for test.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/258498 (https://phabricator.wikimedia.org/T120977) (owner: 10GWicke) [19:52:22] mutante: okay, fixed now, but log spam will probably continue for another 30 mins or so [19:52:38] i'll run puppet on all app servers via salt with a 10% batch size [19:53:07] yurik: thanks! [19:53:08] ori: thanks. the next puppet run will bring the bot back [19:55:07] gwicke, there is a empty space there - https://gerrit.wikimedia.org/r/#/c/258498/3/templates/varnish/text-backend.inc.vcl.erb [19:55:13] (03PS1) 10Jcrespo: Reconfigure db1057 (ferm, performance_schema, ssl) [puppet] - 10https://gerrit.wikimedia.org/r/258500 [19:55:50] gwicke, your editor usually can save it without trailing spaces [19:56:26] (03PS2) 10Jcrespo: Reconfigure db1057 (ferm, performance_schema, ssl) [puppet] - 10https://gerrit.wikimedia.org/r/258500 [19:57:51] yurik: yeah, but I think I don't have that enabled for puppet currently [19:58:08] let me fix in any case [19:59:40] (03CR) 10Jcrespo: [C: 032] Reconfigure db1057 (ferm, performance_schema, ssl) [puppet] - 10https://gerrit.wikimedia.org/r/258500 (owner: 10Jcrespo) [20:02:36] RECOVERY - puppet last run on mw2058 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:06:17] !log restarting and reloading mysql for upgrade and reconfiguration at db1057 [20:06:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:30] (03PS1) 10Paladox: Fix replicating open patches [puppet] - 10https://gerrit.wikimedia.org/r/258503 [20:07:07] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:07:08] (03PS2) 10Paladox: Fix replicating open patches [puppet] - 10https://gerrit.wikimedia.org/r/258503 [20:08:27] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:08:49] (03PS3) 10Paladox: Fix replicating open patches [puppet] - 10https://gerrit.wikimedia.org/r/258503 [20:09:47] (03PS4) 10GWicke: Make /api/rest_v1/ work for test.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/258498 (https://phabricator.wikimedia.org/T120977) [20:11:39] ostriches: forgot to tell you, it's all good (Re: wikitech patches). thanks [20:11:47] RECOVERY - puppet last run on mw2197 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [20:11:56] RECOVERY - puppet last run on mw1130 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [20:12:10] YuviPanda: no problem :) [20:12:25] RECOVERY - puppet last run on mw2179 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:12:26] RECOVERY - puppet last run on mw1089 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:12:46] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:12:46] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:12:46] RECOVERY - puppet last run on mw1188 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:12:55] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:12:56] RECOVERY - puppet last run on mw2099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:13:06] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:13:16] RECOVERY - puppet last run on mw1063 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [20:13:17] RECOVERY - puppet last run on mw1124 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:13:35] RECOVERY - puppet last run on mw2065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:13:36] RECOVERY - puppet last run on mw2160 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:13:36] RECOVERY - puppet last run on mw2116 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [20:13:36] RECOVERY - puppet last run on mw1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:13:36] RECOVERY - puppet last run on mw1087 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:13:36] RECOVERY - puppet last run on mw2041 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:13:36] RECOVERY - puppet last run on mw2112 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [20:13:40] (03PS3) 10Papaul: Add mgmt DNS asset tag entries for new 8 misc servers Re-enter mgmt DNS asset tag for pollux since it is in the spare list Bug:T120885 Bug:T117423 Change-Id: I51ab2cdd2f864e6a13115700635f85d2dee2fb1a [dns] - 10https://gerrit.wikimedia.org/r/258483 (https://phabricator.wikimedia.org/T120885) [20:13:46] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:13:46] RECOVERY - puppet last run on mw1225 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:13:47] RECOVERY - puppet last run on mw1032 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:13:47] RECOVERY - puppet last run on mw1038 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:13:55] RECOVERY - puppet last run on mw2100 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [20:13:55] RECOVERY - puppet last run on mw1106 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [20:13:56] RECOVERY - puppet last run on mw2103 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:07] RECOVERY - puppet last run on mw2170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:07] RECOVERY - puppet last run on mw1024 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:14:07] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:16] RECOVERY - puppet last run on mw2064 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [20:14:25] RECOVERY - puppet last run on mw1115 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:14:25] RECOVERY - puppet last run on mw2139 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:26] RECOVERY - puppet last run on mw2153 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [20:14:26] RECOVERY - puppet last run on mw2161 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:14:35] RECOVERY - puppet last run on mw2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:36] RECOVERY - puppet last run on mw2164 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:14:37] RECOVERY - puppet last run on mw1254 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:14:46] RECOVERY - puppet last run on mw2031 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [20:14:46] RECOVERY - puppet last run on mw1197 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:46] RECOVERY - puppet last run on mw2140 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:46] RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:56] RECOVERY - puppet last run on mw2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:56] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:56] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:57] RECOVERY - puppet last run on mw1033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:14:57] RECOVERY - puppet last run on mw2199 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [20:15:06] 6operations, 6Services: Provide a means to run production and semi-production services without access to the catch-all production networking environment - https://phabricator.wikimedia.org/T121240#1874177 (10GWicke) [20:15:17] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:15:17] RECOVERY - puppet last run on mw2010 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [20:15:26] RECOVERY - puppet last run on mw1069 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:15:35] RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:15:36] RECOVERY - puppet last run on mw1027 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [20:15:37] RECOVERY - puppet last run on mw2087 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [20:15:38] 6operations, 6Services: Provide a means to run production and semi-production services without access to the catch-all production networking environment - https://phabricator.wikimedia.org/T121240#1873437 (10GWicke) [20:15:45] RECOVERY - puppet last run on mw2083 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:15:46] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:15:46] RECOVERY - puppet last run on mw2080 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [20:15:46] RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:15:55] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [20:15:55] RECOVERY - puppet last run on mw1091 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:15:56] RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:16:06] RECOVERY - puppet last run on mw1154 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [20:16:25] RECOVERY - puppet last run on mw2109 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:16:26] RECOVERY - puppet last run on mw2105 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:16:27] RECOVERY - puppet last run on mw2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:16:36] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:16:36] RECOVERY - puppet last run on mw1104 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [20:16:36] RECOVERY - puppet last run on mw2082 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:16:36] RECOVERY - puppet last run on mw2075 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:16:36] RECOVERY - puppet last run on mw1068 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:16:36] RECOVERY - puppet last run on mw1155 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:16:37] RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:16:37] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:16:37] RECOVERY - puppet last run on mw1107 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:16:46] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:16:46] RECOVERY - puppet last run on mw2004 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:16:55] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:16:56] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:16:56] RECOVERY - puppet last run on mw2212 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:16:56] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:16:56] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:16:56] RECOVERY - puppet last run on mw2188 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:16:57] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:16:59] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:15] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:15] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [20:17:16] RECOVERY - puppet last run on mw2114 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:17] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:17:23] 6operations, 6Services: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874187 (10GWicke) [20:17:27] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:27] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:35] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:17:36] RECOVERY - puppet last run on mw2184 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:36] RECOVERY - puppet last run on mw1066 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:17:37] RECOVERY - puppet last run on mw2079 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [20:17:45] RECOVERY - puppet last run on mw2117 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:45] RECOVERY - puppet last run on mw2067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:46] RECOVERY - puppet last run on mw1021 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [20:17:46] RECOVERY - puppet last run on mw2176 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:47] RECOVERY - puppet last run on mw2142 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [20:17:55] RECOVERY - puppet last run on mw2070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:55] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [20:17:56] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:17:56] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:56] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:17:57] RECOVERY - puppet last run on mw2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:57] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:57] RECOVERY - puppet last run on mw2030 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:18:05] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:18:07] RECOVERY - puppet last run on mw2092 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:18:16] RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [20:18:16] RECOVERY - puppet last run on mw2084 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [20:18:16] RECOVERY - puppet last run on mw2039 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:18:17] RECOVERY - puppet last run on mw2093 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:18:17] RECOVERY - puppet last run on mw2096 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:18:25] RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [20:18:27] RECOVERY - puppet last run on mw2123 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:18:27] RECOVERY - puppet last run on mw2090 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [20:18:35] RECOVERY - puppet last run on mw2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:18:36] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:18:36] RECOVERY - puppet last run on mw1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:18:36] RECOVERY - puppet last run on mw2196 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:18:36] RECOVERY - puppet last run on mw2113 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:18:37] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:18:37] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:18:37] RECOVERY - puppet last run on mw1078 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:18:37] RECOVERY - puppet last run on mw2134 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:18:38] RECOVERY - puppet last run on mw2055 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [20:18:46] RECOVERY - puppet last run on mw2049 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [20:18:46] RECOVERY - puppet last run on mw1075 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:18:55] RECOVERY - puppet last run on mw1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:19:15] RECOVERY - puppet last run on mw2101 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:19:17] RECOVERY - puppet last run on mw2085 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:19:17] RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:19:17] RECOVERY - puppet last run on mw2047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:19:28] RECOVERY - puppet last run on mw2056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:19:36] RECOVERY - puppet last run on mw1049 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:19:36] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [20:19:45] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:19:45] RECOVERY - puppet last run on mw2062 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [20:19:46] RECOVERY - puppet last run on mw2131 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:19:46] RECOVERY - puppet last run on mw2182 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:19:47] RECOVERY - puppet last run on mw1094 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [20:19:47] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:19:57] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:20:17] RECOVERY - puppet last run on mw2200 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:20:27] RECOVERY - puppet last run on mw2098 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:20:35] RECOVERY - puppet last run on mw2130 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:20:36] RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:20:37] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:20:46] RECOVERY - puppet last run on mw1180 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:20:56] RECOVERY - puppet last run on mw2174 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:20:56] RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:06] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:21:06] RECOVERY - puppet last run on mw2111 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:06] RECOVERY - puppet last run on mw1084 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:06] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:07] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:16] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:25] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:26] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:21:26] RECOVERY - puppet last run on mw2048 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [20:21:45] RECOVERY - puppet last run on mw1062 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:21:46] RECOVERY - puppet last run on mw2107 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:55] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:56] RECOVERY - puppet last run on mw2133 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [20:21:57] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:22:05] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:22:26] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:22:27] RECOVERY - puppet last run on mw1233 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [20:22:30] if you can read this you have won 1 million dollars [20:22:35] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:22:45] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:22:46] RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:22:46] RECOVERY - puppet last run on mw2046 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:22:46] RECOVERY - puppet last run on mw2202 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:22:48] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:22:48] ostriches: Can I cherry-pick + deploy https://gerrit.wikimedia.org/r/258504 so the maintenance script that I'm trying to run will stop fataling? [20:22:56] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:22:56] RECOVERY - puppet last run on mw2053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:22:57] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:22:57] RECOVERY - puppet last run on mw1030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:23:06] RECOVERY - puppet last run on mw1256 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:23:06] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:23:06] RECOVERY - puppet last run on mw1080 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:23:17] RECOVERY - puppet last run on mw2026 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:23:19] RoanKattouw: lgtm. [20:23:23] Thanks [20:23:27] RECOVERY - puppet last run on mw1072 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:23:27] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:23:29] (03CR) 10Dzahn: "changes that have comments like "this is a test, i dont know if this works" are not making reviewers exactly trigger-happy" [puppet] - 10https://gerrit.wikimedia.org/r/258503 (owner: 10Paladox) [20:23:36] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:23:45] RECOVERY - puppet last run on mw1074 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:23:46] RECOVERY - puppet last run on mw2144 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [20:23:47] RECOVERY - puppet last run on mw2032 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:23:56] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:23:56] RECOVERY - puppet last run on mw2027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:24:05] RECOVERY - puppet last run on mw2068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:24:07] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:24:36] RECOVERY - puppet last run on mw2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:26:05] if you can read this you have won 1 million dollars p858snake, 1 millon of https://en.wikipedia.org/wiki/D%C3%B3lar [20:28:55] lol [20:29:00] !log catrope@tin Synchronized php-1.27.0-wmf.8/extensions/Flow/maintenance/FlowPopulateRefId.php: Fix fatals due to missing wiki condition (T117786) (duration: 00m 29s) [20:29:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:29:11] (03CR) 10Paladox: "The comment message will be updated one the code looks good to the reviewer." [puppet] - 10https://gerrit.wikimedia.org/r/258503 (owner: 10Paladox) [20:29:22] (03PS4) 10Paladox: Fix replicating open patches [puppet] - 10https://gerrit.wikimedia.org/r/258503 [20:29:54] (03PS5) 10Paladox: Fix replicating open patches [puppet] - 10https://gerrit.wikimedia.org/r/258503 [20:31:52] 6operations, 6Services: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874218 (10GWicke) [20:32:22] (03PS3) 10coren: Labs: monitor getent timing only on active fileserver [puppet] - 10https://gerrit.wikimedia.org/r/258448 [20:33:14] 6operations, 6Services: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1873437 (10GWicke) [20:33:51] (03CR) 10coren: [C: 032] "Self review - trivial change (just change what server does the test) and we want to quiet the false alarms." [puppet] - 10https://gerrit.wikimedia.org/r/258448 (owner: 10coren) [20:33:56] 6operations, 6Services: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874227 (10GWicke) [20:34:23] greg-g: hi! There's a CentralNotice patch that there's been some talk of asking for a Friday deploy for. Not yet +2'd even though it could probably be pretty soon. May or may not be important for FR, it's actually not possible to be sure. Just to check, where does official policy stand on this sort of thing? Thx in advance! [20:36:45] AndyRussG: Official policy is roughly "in case of dire emergency break policy" [20:36:46] ostriches: ---^^ [20:37:02] AndyRussG: greg-g is on vacation as of ~an hour ago, ostriches is his delegate [20:37:13] bd808: that's what I thought [20:37:17] RoanKattouw: ah K thanks! [20:37:38] Also if you want me to deploy that, I will, as long as you can get Someone Important to say yes [20:37:42] (e.g. ostriches or K4) [20:37:56] ostriches: I need to know what counts as 'emergency' for wikitech, we have a patch to deploy as well to fix a breaking bug. [20:38:39] RoanKattouw: cool K, thanks much, that's great to know :) [20:40:46] (03PS1) 10Dzahn: cygnus/technetium: add empty admin group [puppet] - 10https://gerrit.wikimedia.org/r/258508 (https://phabricator.wikimedia.org/T118763) [20:41:22] (03CR) 10Chad: [C: 04-1] "The patches are being replicated. This patch isn't necessary." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/258503 (owner: 10Paladox) [20:43:21] RoanKattouw: Do I get a t-shirt that says Someone Important? [20:43:24] Coren: link? [20:43:53] ostriches: https://gerrit.wikimedia.org/r/#/c/258462/ [20:44:05] ostriches: i.e. project creation is broken atm [20:44:34] (03PS2) 10Dzahn: cygnus/technetium: add empty admin group [puppet] - 10https://gerrit.wikimedia.org/r/258508 (https://phabricator.wikimedia.org/T118763) [20:44:46] (03CR) 10Paladox: Fix replicating open patches (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/258503 (owner: 10Paladox) [20:44:51] (03Abandoned) 10Paladox: Fix replicating open patches [puppet] - 10https://gerrit.wikimedia.org/r/258503 (owner: 10Paladox) [20:45:03] (03PS3) 10Dzahn: cygnus/technetium: add empty admin group [puppet] - 10https://gerrit.wikimedia.org/r/258508 (https://phabricator.wikimedia.org/T118763) [20:45:04] Coren: I'm guessing the parent too? [20:45:20] (03CR) 10Paladox: "'The patches are being replicated. This patch isn't necessary.'" [puppet] - 10https://gerrit.wikimedia.org/r/258503 (owner: 10Paladox) [20:46:04] ostriches: Aye; they're related. [20:46:19] * ostriches stamps Coren's forehead [20:46:20] go forth [20:46:42] ostriches: Danke schoen. [20:46:56] (03CR) 10Dzahn: [C: 032] cygnus/technetium: add empty admin group [puppet] - 10https://gerrit.wikimedia.org/r/258508 (https://phabricator.wikimedia.org/T118763) (owner: 10Dzahn) [20:47:00] ostriches: hi! and WRT CN? [20:47:34] If it's CN/Fundraising I want K4 to say ok :) [20:48:09] !log setting db1057 as the parent mysql of db2016 for s1 replication [20:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:48:32] ostriches: K gotcha, makes sense :) thx!! [20:50:22] clearly like that, https://dbtree.wikimedia.org/ is prettier [20:51:03] PROBLEM - DPKG on multatuli is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:51:38] (that's me) [20:52:04] thanks for the heads up [20:53:03] RECOVERY - DPKG on multatuli is OK: All packages OK [20:53:55] (03PS4) 10Alex Monk: Add mgmt DNS asset tag entries for new 8 misc servers [dns] - 10https://gerrit.wikimedia.org/r/258483 (https://phabricator.wikimedia.org/T120885) (owner: 10Papaul) [20:55:18] !log stopping and reconfiguring replication on db2016 (only for some minutes) [20:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:55:30] ostriches: RoanKattouw: Just FYI, K4 is not online just now, we'll let you know if we have something... Thanks again! :) [20:56:18] (03CR) 10Yurik: [C: 031] Make /api/rest_v1/ work for test.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/258498 (https://phabricator.wikimedia.org/T120977) (owner: 10GWicke) [21:03:24] (03Abandoned) 10Yuvipanda: Revert "Labs: make labs-ip-alias-dump ignore addressless hosts" [puppet] - 10https://gerrit.wikimedia.org/r/257368 (owner: 10Yuvipanda) [21:07:49] anyone deploying right now? I need to deploy https://gerrit.wikimedia.org/r/#/c/258515/ [21:10:21] (03PS1) 10Ottomata: Adding Yubikey ssh key for otto [puppet] - 10https://gerrit.wikimedia.org/r/258517 [21:12:18] (03CR) 10Ottomata: [C: 032] Adding Yubikey ssh key for otto [puppet] - 10https://gerrit.wikimedia.org/r/258517 (owner: 10Ottomata) [21:16:09] (03PS1) 10BBlack: cache_text: do not rewrite zero.wm.o like a zero domain [puppet] - 10https://gerrit.wikimedia.org/r/258518 [21:16:53] (03PS2) 10BBlack: cache_text: do not rewrite zero.wm.o like a zero domain [puppet] - 10https://gerrit.wikimedia.org/r/258518 [21:17:00] (03CR) 10BBlack: [C: 032 V: 032] cache_text: do not rewrite zero.wm.o like a zero domain [puppet] - 10https://gerrit.wikimedia.org/r/258518 (owner: 10BBlack) [21:17:53] Coren or chasemp: I'm ready to sync OpenStackManager, are one of you able to test it to be sure the fix works? [21:18:11] twentyafterfour: Yep. Just say when. [21:18:28] syncing now [21:18:50] !log twentyafterfour@tin Synchronized php-1.27.0-wmf.8/extensions/OpenStackManager/: deploying https://gerrit.wikimedia.org/r/#/c/258516/ (duration: 00m 28s) [21:18:51] twentyafterfour: sure [21:18:51] done. [21:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:19:02] * Coren tests! [21:19:10] 6operations, 10DBA, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1874294 (10jcrespo) Enwiki channel (s1) is now also encrypted, using db1057 as a bridge. I am also testing ROW based replication on codfw, but that is out of the scope of this ticket. [21:19:57] twentyafterfour: I can haz success. [21:20:02] awesome [21:20:04] twentyafterfour: TY, dude. [21:20:12] Coren: you're welcome, no problem at all [21:20:39] (03CR) 10Yuvipanda: [C: 04-1] [WIP] apache: Add role to serve static sites on multiple hosts using apache (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/258096 (owner: 10Madhuvishy) [21:20:54] 6operations, 10DBA, 7Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#1874300 (10jcrespo) Enwiki on codfw is now using ROW-based replication as a test, to check regressions and compare its performance to mixed (mostly, statement) on eqiad. [21:21:16] Coren: should we hand-create groups for the projects that have been created recently [21:21:44] YuviPanda: I did it as a test, and it worked, but it's a pain at best. It's probably easier to trash the projects and recreate them. [21:21:57] bd808: ^ [21:22:12] Coren: we need to find out new service groups created in the meantime too [21:22:24] *nod* I can do that [21:23:16] YuviPanda: That, otoh, I'm not sure where to find. Unlike projects, there isn't an entry in ldap /besides/ the actual group I can match against. [21:23:43] YuviPanda: But unless I'm mistaken, any addition of a user to a service group will create the group. [21:24:06] Coren: can you try it with the 'quarry' group? I created it during the missing bits [21:24:40] That group is there and works fine already, afaict. [21:24:53] marc@tools-bastion-01:~$ getent group tools.quarry [21:24:53] tools.quarry:*:52788:yuvipanda [21:24:58] (03CR) 10Madhuvishy: [WIP] apache: Add role to serve static sites on multiple hosts using apache (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/258096 (owner: 10Madhuvishy) [21:25:24] YuviPanda: need to first get it to work from the right doc root - will look at it now [21:25:42] (03PS5) 10BBlack: Make /api/rest_v1/ work for test.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/258498 (https://phabricator.wikimedia.org/T120977) (owner: 10GWicke) [21:25:47] madhuvishy: ok! do you have a working test instnace? [21:25:50] (03CR) 10BBlack: [C: 032 V: 032] Make /api/rest_v1/ work for test.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/258498 (https://phabricator.wikimedia.org/T120977) (owner: 10GWicke) [21:26:01] YuviPanda: yes. [21:26:08] awesome [21:35:12] (03CR) 10Dzahn: "what is the actual change you made here, Alex Monk?" [dns] - 10https://gerrit.wikimedia.org/r/258483 (https://phabricator.wikimedia.org/T120885) (owner: 10Papaul) [21:37:12] 6operations, 6Reading-Admin, 10Reading-Community-Engagement: STRATEGY TEST: redirect small portion of unauthenticated desktop users to mobile web - https://phabricator.wikimedia.org/T117826#1874370 (10JKatzWMF) [21:37:32] 6operations, 6Reading-Admin, 10Reading-Community-Engagement: TEST: redirect small portion of unauthenticated desktop users to mobile web - https://phabricator.wikimedia.org/T117826#1784410 (10JKatzWMF) [21:48:29] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [5000000.0] [21:52:29] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [21:53:37] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1874420 (10valhallasw) [21:56:53] (03CR) 10Jforrester: VisualEditor: Provide framework for enabling an A/B test for IPs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258404 (owner: 10Jforrester) [21:56:56] (03PS2) 10Jforrester: VisualEditor: Provide framework for enabling an A/B test for IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258404 [21:57:03] (03PS2) 10Jforrester: VisualEditor: Don't set ShowBetaWelcome, now set in repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258405 [21:58:47] (03CR) 10Jforrester: VisualEditor: Don't set ShowBetaWelcome, now set in repo (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258405 (owner: 10Jforrester) [21:59:02] (03CR) 10Jforrester: [C: 04-1] "Depends on Idb20542 in VE-MW being deployed everywhere first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258405 (owner: 10Jforrester) [22:00:11] (03PS5) 10Dzahn: Add mgmt DNS asset tag entries for new 8 misc servers [dns] - 10https://gerrit.wikimedia.org/r/258483 (https://phabricator.wikimedia.org/T120885) (owner: 10Papaul) [22:00:29] (03CR) 10Dzahn: [C: 032] Add mgmt DNS asset tag entries for new 8 misc servers [dns] - 10https://gerrit.wikimedia.org/r/258483 (https://phabricator.wikimedia.org/T120885) (owner: 10Papaul) [22:02:38] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [22:03:48] doing an out of band parsoid deploy to enforce some resource limits during parsing /cc greg-g [22:03:54] !log starting parsoid deploy [22:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:08:05] 6operations, 6Services: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874482 (10GWicke) [22:08:31] 6operations, 6Services: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1873437 (10GWicke) [22:09:47] !log restarted parsoid on wtp1002 as a canary [22:09:48] 6operations, 6Services: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1873437 (10GWicke) [22:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:12:01] 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874487 (10GWicke) [22:15:40] looking good. resource limit working as intended. restarting parsoid on all nodes. [22:20:13] can some root kill process 14599 on wtp1004? [22:20:41] andrewbogott, ^ [22:20:59] subbu: one second... [22:21:52] i wasn't sure who was online at this time, so pinged you :) [22:21:58] PROBLEM - Parsoid on wtp1004 is CRITICAL: Connection refused [22:22:17] subbu: how’s that? [22:23:00] should recover soon. [22:23:07] there is one stuck process on wtp1016 .. let me find it for you. :) [22:23:38] 29886 on wtp1016 [22:23:59] RECOVERY - Parsoid on wtp1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.027 second response time [22:24:12] !log finished deploying parsoid sha ebd62ab5 [22:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:24:48] PROBLEM - Parsoid on wtp1016 is CRITICAL: Connection refused [22:25:29] andrewbogott, and 32341 on wtp1021 as well .. besides 29886 on wtp1016. [22:25:38] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000000.0] [22:26:08] PROBLEM - Parsoid on wtp1021 is CRITICAL: Connection refused [22:26:47] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [22:29:05] andrewbogott, you there? [22:29:15] yes, doing [22:29:32] that should be all of them [22:29:41] yes. [22:30:19] RECOVERY - Parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.026 second response time [22:30:51] andrewbogott, thanks! :) [22:31:07] RECOVERY - Parsoid on wtp1016 is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.012 second response time [22:32:44] (03PS1) 10Dzahn: nova::controller: remove firewall hole for http(s) [puppet] - 10https://gerrit.wikimedia.org/r/258535 (https://phabricator.wikimedia.org/T120449) [22:34:58] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [5000000.0] [22:35:08] 6operations, 6Services, 7Security-General: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240#1874521 (10GWicke) [22:35:09] (03CR) 10Andrew Bogott: [C: 031] "What could possibly go wrong?" [puppet] - 10https://gerrit.wikimedia.org/r/258535 (https://phabricator.wikimedia.org/T120449) (owner: 10Dzahn) [22:40:46] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1874524 (10coren) @cmjohnson: We need to get our hands on another then; perhaps put in a req for one or is there one in Dallas we could ship accross? [22:41:48] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 1.00% above the threshold [1000000.0] [22:43:19] (03CR) 10Dzahn: "doing https://gerrit.wikimedia.org/r/#/c/258535/ instead to solve this" [puppet] - 10https://gerrit.wikimedia.org/r/257034 (https://phabricator.wikimedia.org/T120449) (owner: 10Andrew Bogott) [22:43:58] (03PS2) 10Dzahn: nova::controller: remove firewall hole for http(s) [puppet] - 10https://gerrit.wikimedia.org/r/258535 (https://phabricator.wikimedia.org/T120449) [22:44:24] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1476/ compiler shows this does not touch silver but only labcontrol" [puppet] - 10https://gerrit.wikimedia.org/r/258535 (https://phabricator.wikimedia.org/T120449) (owner: 10Dzahn) [22:46:48] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [22:53:35] (03CR) 10Dzahn: "fixed with firewalling, i think we don't need this one, it would not hurt though either" [puppet] - 10https://gerrit.wikimedia.org/r/257034 (https://phabricator.wikimedia.org/T120449) (owner: 10Andrew Bogott) [22:54:41] (03Abandoned) 10Andrew Bogott: Remove default apache configs from labs puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/257034 (https://phabricator.wikimedia.org/T120449) (owner: 10Andrew Bogott) [22:57:39] (03PS2) 10CSteipp: Set initial Staff password policy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258387 (https://phabricator.wikimedia.org/T104370) [22:58:10] (03CR) 10CSteipp: Set initial Staff password policy (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/258387 (https://phabricator.wikimedia.org/T104370) (owner: 10CSteipp) [23:06:21] ostriches: RoanKattouw: hi! We've decided it looks good to go ahead with our patch, so I'm just looking at ways to summon K4 (she was sick this morning, I learned)... Hope we're still potentially good, thanks again in advance [23:13:29] ostriches: Hi! [23:13:48] ostriches: I wanted to check in about the Friday deployment we were proposing... [23:16:14] ostriches: I've been through this a few times now, and I wanted to clarify that K4-713 doesn't want to be the ultimate sign-off on any of this stuff--the spirit of the law is actually that December deployment should include coordination with any fr-tech member, not the "head" by any means. [23:18:41] Fair enough :) [23:19:28] * awight lops off own head ;) [23:20:34] It's always slightly reckless to Fri-deploy, but we're talking about a lot of potential donations. For context, this is a bug that reduces CentralNotice impressions by about 55%... [23:20:56] I'll be on the line for cleaning up any fallout over the weekend. [23:21:08] ostriches: awight: so we'll go ahead with the deploy? it's just this patch https://gerrit.wikimedia.org/r/#/c/258414/ (that we'll cherry-pick onto our deploy branch) [23:21:27] +1 from me [23:22:48] awight: didn't the list get amended last time to fix that? [23:23:42] mutante: Sorry, which list? [23:23:53] awight: people who sign-off [23:24:05] i remember the same discussion from before [23:24:11] like you just said [23:24:44] Hmmm [23:24:53] just seems easier to change that [23:24:59] mutante: I have pretty bad deja vu about this ;) but hopefully that change was made. fr-tech membership is currently: cwd ejegg AndyRussG XenoRyet eileen awight [23:25:58] oic, there's some text at https://wikitech.wikimedia.org/wiki/Deployments#End_of_year.2Fbeginning_of_new_year_.27code_freezes.27 that incorrectly says "greg / katie" [23:26:02] I also approve this deploy... [23:26:09] yea, the deja vu was the only reason i said something [23:26:26] Sorry about all this... it's chafing for us as well ;) [23:26:38] not about this specific one, just thought it would be easier than repeat it [23:27:17] still we're past that date... maybe the main issue is that it's Friday? [23:27:24] mutante: If you have URLs related to this signoff list, please forward. I'm gonna fix the Deployments page, but I don't know where else this is mentioned. [23:28:05] awight: eh:) where does "K4-713 doesn't want to be the ultimate sign-off" come from then [23:28:40] mutante: I think her first email said as much, otherwise it's been a few clarifying coversations IRL. [23:28:56] * ostriches makes an iced coffee [23:29:37] ok, so IRL means it doesn't really exist.. shrug [23:30:17] the e-mail says "total lockdown on non-critical changes deployed to [23:30:17] the Central Notice extension in Q2, unless there is a member of [23:30:17] fr-tech fully driving those changes." [23:30:36] But for other things it seems to say tell her or Greg [23:30:42] Yeah there was just some miscommunication. If awight is driving this I'm all ok with it. [23:30:50] * ostriches adjusts his greg hat [23:30:52] ostriches: wee thanks! [23:31:24] RoanKattouw: hi! still up 4 a deploy? [23:31:30] Sure [23:32:25] Yeah if awight says yes, I'm good [23:32:40] Yep ^ [23:32:55] I say yay! [23:33:13] * awight is still fruitlessly trying to find the Code Freeze Manifesto [23:34:28] awight: manifestos are for mondays. it's friday! [23:35:04] my comments were unrelated to this specific change or trust or a specific user and i don't want to slow anyone down. it was only about repeating the discussion about rules, instead of just changing the rules, so you don't have to repeat this every time. that was all [23:35:31] go ahead, you all agree [23:36:12] mutante: it was fine ;) [23:36:32] All I could find was https://wikitech.wikimedia.org/wiki/Deployments#End_of_year.2Fbeginning_of_new_year_.27code_freezes.27 [23:36:35] no rules were changed or broken. it was just a miscommunication :) [23:36:37] I've tweaked the language... [23:36:50] AndyRussG: OK, let's talk details, what am I deploying? [23:37:20] RoanKattouw: I'm just preparing the core patch... Here's the wmf_deploy one https://gerrit.wikimedia.org/r/#/c/258641/ [23:38:19] mutante: Thanks for pointing that out, hopefully we make the protocol more clear next year. Hopefully that Deployments section I changed ^ will be good enough for now. [23:38:54] awight: i think it is, cool! [23:42:36] RoanKattouw: https://gerrit.wikimedia.org/r/#/c/258645 [23:42:56] RoanKattouw: ostriches: this should deploy should be documented on Deployments page, tho, right? [23:43:04] Yes plz [23:43:11] Add an entry [23:43:29] ostriches: K will do :) [23:53:08] ostriches: K Deployment page updated [23:53:25] tyvm [23:58:12] AndyRussG: To verify, I'm deploying one change, called "Improve impression diet state machine", is that right? [23:58:22] RoanKattouw: yep! :) [23:58:42] That's probably one of the strangest collections of nouns I've ever seen in one sentence :D [23:58:46] OK here goes [23:59:27] !log catrope@tin Synchronized php-1.27.0-wmf.8/extensions/CentralNotice: Improve impression diet state machine (duration: 00m 32s) [23:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:59:37] AndyRussG: Done ---^^